Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
Thesis / Dissertation
2023
Permanent link to this Item
Authors
Journal Title
Link to Journal
Journal ISSN
Volume Title
Publisher
Publisher
University of Cape Town
Department
Faculty
License
Series
Abstract
Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages.
Description
Reference:
Pedlar, V. 2023. Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language. . University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/43141