Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language

dc.contributor.advisorBritz, Stefan
dc.contributor.advisorBuys, Jan
dc.contributor.authorPedlar, Victoria
dc.date.accessioned2026-04-28T11:31:14Z
dc.date.available2026-04-28T11:31:14Z
dc.date.issued2023
dc.date.updated2026-04-28T11:21:45Z
dc.description.abstractGenerating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages.
dc.identifier.apacitationPedlar, V. (2023). <i>Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language</i>. (). University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. Retrieved from http://hdl.handle.net/11427/43141en_ZA
dc.identifier.chicagocitationPedlar, Victoria. <i>"Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language."</i> ., University of Cape Town ,Faculty of Science ,Department of Statistical Sciences, 2023. http://hdl.handle.net/11427/43141en_ZA
dc.identifier.citationPedlar, V. 2023. Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language. . University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/43141en_ZA
dc.identifier.ris TY - Thesis / Dissertation AU - Pedlar, Victoria AB - Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages. DA - 2023 DB - OpenUCT DP - University of Cape Town KW - Statistical Sciences KW - isiZulu KW - AWD-LSTM KW - Transformer with NLL Loss LK - https://open.uct.ac.za PB - University of Cape Town PY - 2023 T1 - Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language TI - Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language UR - http://hdl.handle.net/11427/43141 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/43141
dc.identifier.vancouvercitationPedlar V. Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language. []. University of Cape Town ,Faculty of Science ,Department of Statistical Sciences, 2023 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/43141en_ZA
dc.language.isoen
dc.language.rfc3066eng
dc.publisher.departmentDepartment of Statistical Sciences
dc.publisher.facultyFaculty of Science
dc.publisher.institutionUniversity of Cape Town
dc.subjectStatistical Sciences
dc.subjectisiZulu
dc.subjectAWD-LSTM
dc.subjectTransformer with NLL Loss
dc.titleOpen-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
dc.typeThesis / Dissertation
dc.type.qualificationlevelMasters
dc.type.qualificationlevelMasters
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis_sci_2023_pedlar victoria.pdf
Size:
2.04 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.72 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections