Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
| dc.contributor.advisor | Britz, Stefan | |
| dc.contributor.advisor | Buys, Jan | |
| dc.contributor.author | Pedlar, Victoria | |
| dc.date.accessioned | 2026-04-28T11:31:14Z | |
| dc.date.available | 2026-04-28T11:31:14Z | |
| dc.date.issued | 2023 | |
| dc.date.updated | 2026-04-28T11:21:45Z | |
| dc.description.abstract | Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages. | |
| dc.identifier.apacitation | Pedlar, V. (2023). <i>Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language</i>. (). University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. Retrieved from http://hdl.handle.net/11427/43141 | en_ZA |
| dc.identifier.chicagocitation | Pedlar, Victoria. <i>"Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language."</i> ., University of Cape Town ,Faculty of Science ,Department of Statistical Sciences, 2023. http://hdl.handle.net/11427/43141 | en_ZA |
| dc.identifier.citation | Pedlar, V. 2023. Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language. . University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/43141 | en_ZA |
| dc.identifier.ris | TY - Thesis / Dissertation AU - Pedlar, Victoria AB - Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages. DA - 2023 DB - OpenUCT DP - University of Cape Town KW - Statistical Sciences KW - isiZulu KW - AWD-LSTM KW - Transformer with NLL Loss LK - https://open.uct.ac.za PB - University of Cape Town PY - 2023 T1 - Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language TI - Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language UR - http://hdl.handle.net/11427/43141 ER - | en_ZA |
| dc.identifier.uri | http://hdl.handle.net/11427/43141 | |
| dc.identifier.vancouvercitation | Pedlar V. Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language. []. University of Cape Town ,Faculty of Science ,Department of Statistical Sciences, 2023 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/43141 | en_ZA |
| dc.language.iso | en | |
| dc.language.rfc3066 | eng | |
| dc.publisher.department | Department of Statistical Sciences | |
| dc.publisher.faculty | Faculty of Science | |
| dc.publisher.institution | University of Cape Town | |
| dc.subject | Statistical Sciences | |
| dc.subject | isiZulu | |
| dc.subject | AWD-LSTM | |
| dc.subject | Transformer with NLL Loss | |
| dc.title | Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language | |
| dc.type | Thesis / Dissertation | |
| dc.type.qualificationlevel | Masters | |
| dc.type.qualificationlevel | Masters |