Computational Analyses of South African English – a Data-Driven Approach

De Lange, Jacques

Computational Analyses of South African English – a Data-Driven Approach

dc.contributor.advisor	Keet, Catharina
dc.contributor.author	De Lange, Jacques
dc.date.accessioned	2025-02-03T11:22:27Z
dc.date.available	2025-02-03T11:22:27Z
dc.date.issued	2024
dc.date.updated	2025-02-03T11:21:34Z
dc.description.abstract	South African English across its multiple sub-varieties remains relatively understudied and an inclusive study of the language across the sub-varieties will enable us to uncover words and types of words unique to South African English that have been adopted or donated between the sub-varieties. This is important given South Africa's multilingual, multi-social society and the influence this has had on South African English. Such a study can also be used to improve large language models used in generative artificial intelligence, spellcheckers, sentiment analysis and speech to text technologies used in commercial applications. Computational techniques such as Part of Speech (POS) tagging, a sub-technique of Natural Language Processing, can be used to assist in understanding sentence structure and consequently, aid our uncovering of donor-adopter relationships between sub-varieties of a language. The accuracy of POS taggers on South African English therefore needs to be understood. This dissertation adopts computational data-driven approaches to studying South African English corpora to determine how accurate POS taggers are on South African English and if accuracy can be improved by creating extensions to a POS tagging model. A single layer neural network POS tagging model using word feature representations and a bidirectional long short-term memory (BLTSM) neural network POS tagging model, both trained on English are used as baseline models to predict POS tags on South African English corpora. Two modifications to the BLTSM model are then made, the first by creating a dual language model by including the Afrikaans language and the second by training the tokenizer and POS tagging neural processors of the dual language model on words unique to South African English. The evaluations show that the accuracy of the modified models is improved compared to the baseline models. The evaluation of baseline models when run on two South African English corpora shows a POS tagging F-Score of 0.69 on average across both corpora and baseline models. The evaluation of the modified models on the same corpora shows a POS tagging F-Score of 0.71 on average across both corpora and modified models. Evaluating the baseline models when run on words unique to South African English shows an average F-Score of 0.62 and evaluating the modified models when run on the same dataset shows an average F-Score of 0.72. The results demonstrate that improvements to POS tagging on South African English can be made by including Afrikaans in the model and by training this model on words unique to South African English. A novel Data-Driven Matching model is developed to investigate donor-adopter relationships in South African English. Results show that there is a commonality of use of words between South African English and Afrikaans, Sesotho and isiZulu. 15.7% of the words in the South African English corpora studied are observed to be in use in Afrikaans, 4.98% of the words are used in Sesotho and 1.09% of the words are used in isiZulu.
dc.identifier.apacitation	De Lange, J. (2024). <i>Computational Analyses of South African English – a Data-Driven Approach</i>. (). ,Faculty of Science ,Department of Computer Science. Retrieved from http://hdl.handle.net/11427/40870	en_ZA
dc.identifier.chicagocitation	De Lange, Jacques. <i>"Computational Analyses of South African English – a Data-Driven Approach."</i> ., ,Faculty of Science ,Department of Computer Science, 2024. http://hdl.handle.net/11427/40870	en_ZA
dc.identifier.citation	De Lange, J. 2024. Computational Analyses of South African English – a Data-Driven Approach. . ,Faculty of Science ,Department of Computer Science. http://hdl.handle.net/11427/40870	en_ZA
dc.identifier.ris	TY - Thesis / Dissertation AU - De Lange, Jacques AB - South African English across its multiple sub-varieties remains relatively understudied and an inclusive study of the language across the sub-varieties will enable us to uncover words and types of words unique to South African English that have been adopted or donated between the sub-varieties. This is important given South Africa's multilingual, multi-social society and the influence this has had on South African English. Such a study can also be used to improve large language models used in generative artificial intelligence, spellcheckers, sentiment analysis and speech to text technologies used in commercial applications. Computational techniques such as Part of Speech (POS) tagging, a sub-technique of Natural Language Processing, can be used to assist in understanding sentence structure and consequently, aid our uncovering of donor-adopter relationships between sub-varieties of a language. The accuracy of POS taggers on South African English therefore needs to be understood. This dissertation adopts computational data-driven approaches to studying South African English corpora to determine how accurate POS taggers are on South African English and if accuracy can be improved by creating extensions to a POS tagging model. A single layer neural network POS tagging model using word feature representations and a bidirectional long short-term memory (BLTSM) neural network POS tagging model, both trained on English are used as baseline models to predict POS tags on South African English corpora. Two modifications to the BLTSM model are then made, the first by creating a dual language model by including the Afrikaans language and the second by training the tokenizer and POS tagging neural processors of the dual language model on words unique to South African English. The evaluations show that the accuracy of the modified models is improved compared to the baseline models. The evaluation of baseline models when run on two South African English corpora shows a POS tagging F-Score of 0.69 on average across both corpora and baseline models. The evaluation of the modified models on the same corpora shows a POS tagging F-Score of 0.71 on average across both corpora and modified models. Evaluating the baseline models when run on words unique to South African English shows an average F-Score of 0.62 and evaluating the modified models when run on the same dataset shows an average F-Score of 0.72. The results demonstrate that improvements to POS tagging on South African English can be made by including Afrikaans in the model and by training this model on words unique to South African English. A novel Data-Driven Matching model is developed to investigate donor-adopter relationships in South African English. Results show that there is a commonality of use of words between South African English and Afrikaans, Sesotho and isiZulu. 15.7% of the words in the South African English corpora studied are observed to be in use in Afrikaans, 4.98% of the words are used in Sesotho and 1.09% of the words are used in isiZulu. DA - 2024 DB - OpenUCT DP - University of Cape Town KW - Information technology LK - https://open.uct.ac.za PY - 2024 T1 - Computational Analyses of South African English – a Data-Driven Approach TI - Computational Analyses of South African English – a Data-Driven Approach UR - http://hdl.handle.net/11427/40870 ER -	en_ZA
dc.identifier.uri	http://hdl.handle.net/11427/40870
dc.identifier.vancouvercitation	De Lange J. Computational Analyses of South African English – a Data-Driven Approach. []. ,Faculty of Science ,Department of Computer Science, 2024 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/40870	en_ZA
dc.language.rfc3066	Eng
dc.publisher.department	Department of Computer Science
dc.publisher.faculty	Faculty of Science
dc.subject	Information technology
dc.title	Computational Analyses of South African English – a Data-Driven Approach
dc.type	Thesis / Dissertation
dc.type.qualificationlevel	Masters
dc.type.qualificationlevel	MPhil

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis_sci_2024_de lange jacques.pdf
Size:: 1.95 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.72 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters