Computational Analyses of South African English – a Data-Driven Approach

Thesis / Dissertation

2024

Permanent link to this Item
Authors
Supervisors
Journal Title
Link to Journal
Journal ISSN
Volume Title
Publisher
Publisher
License
Series
Abstract
South African English across its multiple sub-varieties remains relatively understudied and an inclusive study of the language across the sub-varieties will enable us to uncover words and types of words unique to South African English that have been adopted or donated between the sub-varieties. This is important given South Africa's multilingual, multi-social society and the influence this has had on South African English. Such a study can also be used to improve large language models used in generative artificial intelligence, spellcheckers, sentiment analysis and speech to text technologies used in commercial applications. Computational techniques such as Part of Speech (POS) tagging, a sub-technique of Natural Language Processing, can be used to assist in understanding sentence structure and consequently, aid our uncovering of donor-adopter relationships between sub-varieties of a language. The accuracy of POS taggers on South African English therefore needs to be understood. This dissertation adopts computational data-driven approaches to studying South African English corpora to determine how accurate POS taggers are on South African English and if accuracy can be improved by creating extensions to a POS tagging model. A single layer neural network POS tagging model using word feature representations and a bidirectional long short-term memory (BLTSM) neural network POS tagging model, both trained on English are used as baseline models to predict POS tags on South African English corpora. Two modifications to the BLTSM model are then made, the first by creating a dual language model by including the Afrikaans language and the second by training the tokenizer and POS tagging neural processors of the dual language model on words unique to South African English. The evaluations show that the accuracy of the modified models is improved compared to the baseline models. The evaluation of baseline models when run on two South African English corpora shows a POS tagging F-Score of 0.69 on average across both corpora and baseline models. The evaluation of the modified models on the same corpora shows a POS tagging F-Score of 0.71 on average across both corpora and modified models. Evaluating the baseline models when run on words unique to South African English shows an average F-Score of 0.62 and evaluating the modified models when run on the same dataset shows an average F-Score of 0.72. The results demonstrate that improvements to POS tagging on South African English can be made by including Afrikaans in the model and by training this model on words unique to South African English. A novel Data-Driven Matching model is developed to investigate donor-adopter relationships in South African English. Results show that there is a commonality of use of words between South African English and Afrikaans, Sesotho and isiZulu. 15.7% of the words in the South African English corpora studied are observed to be in use in Afrikaans, 4.98% of the words are used in Sesotho and 1.09% of the words are used in isiZulu.
Description

Reference:

Collections