Subword segmental neural language generation for Nguni languages

Meyer, Francois Rolihlahla

Subword segmental neural language generation for Nguni languages

Thesis / Dissertation

2025

Publisher

University of Cape Town

Department

Department of Computer Science

Faculty

Faculty of Science

Abstract

Deep learning models for text generation are now able to produce fluent and coherent text in many conversational settings. However, such models require large training datasets and are primarily designed for a limited number of high-resource languages. These advances are not directly applicable to low-resource languages with distinctive linguistic characteristics. In this thesis we develop text generation models for the Nguni languages of South Africa -- isiXhosa, isiZulu, isiNdebele, and Siswati. The Nguni languages are agglutinative and conjunctively written, so words are formed by stringing together morphemes. We design neural models that suit the morphological complexity of the Nguni languages by explicitly modelling the segmentation of words into subword units. We propose subword segmental modelling, a neural architecture and training algorithm that learns subword segmentation during training. The standard approach to subword modelling is to apply data-driven algorithms such as byte-pair encoding (BPE) during preprocessing. Subword segmental modelling represents a departure from this paradigm: instead of casting subword segmentation as a preprocessing step, we incorporate it into end-to-end learning to allow the model to discover the optimal subword units for a particular language and task. Explicitly modelling the complex subword structure of Nguni languages serves as an inductive bias for more efficient training on the typically limited training data. In this thesis we present subword segmental models for three natural language generation tasks. Our first model is for autoregressive language modelling. We propose the subword segmental language model (SSLM), a decoder-only model that learns subword segmentation to optimise its language modelling objective. SSLM achieves lower (better) perplexity-based intrinsic evaluation scores than tokenisation-based language models, on average across the four Nguni languages. We also evaluate SSLM as an unsupervised morphological segmenter, showing that its learned subwords are closer to morphemes than standard subword tokens. Since SSLM is our first instantiation of subword segmental modelling, we present a detailed analysis of the architectural components and hyperparameters we found to be influential during development. Our second model extends subword segmental modelling to neural machine translation (NMT). We propose subword segmental machine translation (SSMT), an encoder-decoder model that learns target language subword segmentation to optimise its sequence-to-sequence translation objective. To generate translations with SSMT, we propose dynamic decoding, a decoding algorithm for generating text with subword segmental architectures. SSMT outperforms tokenisation-based NMT on Nguni languages, achieving large gains in the extremely low-resource setting of English to Siswati translation. As for SSLM, we show that SSMT learns subword boundaries more aligned with morpheme boundaries than tokenisation-based subwords. SSMT also exhibits greater morphological compositional generalisation, the ability to generalise to novel combinations of known morphemes. We extend SSMT to multilingual translation, where it learns a single target-side subword segmentation scheme to optimise performance across multiple translation directions. We compare multilingual SSMT to multilingual tokenisation-based NMT. Multilingual SSMT does induce cross-lingual transfer, but to a lesser extent that multilingual tokenisation. In cross-lingual finetuning experiments, SSMT improves transfer between unrelated languages. Our experiments confirm that decisions around subword segmentation greatly affect cross-lingual performance. We also show that differences in orthographic word boundary alignment between languages can impede cross-lingual transfer. Our third and final model combines subword segmental modelling with a copy mechanism, for the task of data-to-text generation. We propose the subword segmental pointer generator (SSPG), which jointly learns to segment words and copy subwords to optimise data-to-text generation. We also propose unmixed decoding, a text generation algorithm for copy-equipped subword segmental models. On isiXhosa data-to-text, SSPG outperforms tokenisation-based architectures trained from scratch. Besides reference-based evaluation, we develop an extractive evaluation framework to measure how faithfully models capture the expected data content of generations. This shows that SSPG more effectively combines entity copying and morphological composition. Across all three tasks, and for all four Nguni languages, subword segmental modelling consistently equals or outperforms equivalent tokenisation-based models. Its performance gains are greatest for extremely low-resource languages and tasks. Through linguistically informed evaluations, we show that subword segmental modelling successfully acquires particular aspects of Nguni-language morphology. Its subword units resemble morphemes more closely than subword tokens and it effectively applies morphological composition. Subword segmental modelling proves effective for the Nguni languages, offering a promising new approach to text generation for low-resource, morphologically complex languages.