Computational Analyses of South African English – a Data-Driven Approach

dc.contributor.advisorKeet, Catharina
dc.contributor.authorDe Lange, Jacques
dc.date.accessioned2025-02-03T11:22:27Z
dc.date.available2025-02-03T11:22:27Z
dc.date.issued2024
dc.date.updated2025-02-03T11:21:34Z
dc.description.abstractSouth African English across its multiple sub-varieties remains relatively understudied and an inclusive study of the language across the sub-varieties will enable us to uncover words and types of words unique to South African English that have been adopted or donated between the sub-varieties. This is important given South Africa's multilingual, multi-social society and the influence this has had on South African English. Such a study can also be used to improve large language models used in generative artificial intelligence, spellcheckers, sentiment analysis and speech to text technologies used in commercial applications. Computational techniques such as Part of Speech (POS) tagging, a sub-technique of Natural Language Processing, can be used to assist in understanding sentence structure and consequently, aid our uncovering of donor-adopter relationships between sub-varieties of a language. The accuracy of POS taggers on South African English therefore needs to be understood. This dissertation adopts computational data-driven approaches to studying South African English corpora to determine how accurate POS taggers are on South African English and if accuracy can be improved by creating extensions to a POS tagging model. A single layer neural network POS tagging model using word feature representations and a bidirectional long short-term memory (BLTSM) neural network POS tagging model, both trained on English are used as baseline models to predict POS tags on South African English corpora. Two modifications to the BLTSM model are then made, the first by creating a dual language model by including the Afrikaans language and the second by training the tokenizer and POS tagging neural processors of the dual language model on words unique to South African English. The evaluations show that the accuracy of the modified models is improved compared to the baseline models. The evaluation of baseline models when run on two South African English corpora shows a POS tagging F-Score of 0.69 on average across both corpora and baseline models. The evaluation of the modified models on the same corpora shows a POS tagging F-Score of 0.71 on average across both corpora and modified models. Evaluating the baseline models when run on words unique to South African English shows an average F-Score of 0.62 and evaluating the modified models when run on the same dataset shows an average F-Score of 0.72. The results demonstrate that improvements to POS tagging on South African English can be made by including Afrikaans in the model and by training this model on words unique to South African English. A novel Data-Driven Matching model is developed to investigate donor-adopter relationships in South African English. Results show that there is a commonality of use of words between South African English and Afrikaans, Sesotho and isiZulu. 15.7% of the words in the South African English corpora studied are observed to be in use in Afrikaans, 4.98% of the words are used in Sesotho and 1.09% of the words are used in isiZulu.
dc.identifier.apacitationDe Lange, J. (2024). <i>Computational Analyses of South African English – a Data-Driven Approach</i>. (). ,Faculty of Science ,Department of Computer Science. Retrieved from http://hdl.handle.net/11427/40870en_ZA
dc.identifier.chicagocitationDe Lange, Jacques. <i>"Computational Analyses of South African English – a Data-Driven Approach."</i> ., ,Faculty of Science ,Department of Computer Science, 2024. http://hdl.handle.net/11427/40870en_ZA
dc.identifier.citationDe Lange, J. 2024. Computational Analyses of South African English – a Data-Driven Approach. . ,Faculty of Science ,Department of Computer Science. http://hdl.handle.net/11427/40870en_ZA
dc.identifier.ris TY - Thesis / Dissertation AU - De Lange, Jacques AB - South African English across its multiple sub-varieties remains relatively understudied and an inclusive study of the language across the sub-varieties will enable us to uncover words and types of words unique to South African English that have been adopted or donated between the sub-varieties. This is important given South Africa's multilingual, multi-social society and the influence this has had on South African English. Such a study can also be used to improve large language models used in generative artificial intelligence, spellcheckers, sentiment analysis and speech to text technologies used in commercial applications. Computational techniques such as Part of Speech (POS) tagging, a sub-technique of Natural Language Processing, can be used to assist in understanding sentence structure and consequently, aid our uncovering of donor-adopter relationships between sub-varieties of a language. The accuracy of POS taggers on South African English therefore needs to be understood. This dissertation adopts computational data-driven approaches to studying South African English corpora to determine how accurate POS taggers are on South African English and if accuracy can be improved by creating extensions to a POS tagging model. A single layer neural network POS tagging model using word feature representations and a bidirectional long short-term memory (BLTSM) neural network POS tagging model, both trained on English are used as baseline models to predict POS tags on South African English corpora. Two modifications to the BLTSM model are then made, the first by creating a dual language model by including the Afrikaans language and the second by training the tokenizer and POS tagging neural processors of the dual language model on words unique to South African English. The evaluations show that the accuracy of the modified models is improved compared to the baseline models. The evaluation of baseline models when run on two South African English corpora shows a POS tagging F-Score of 0.69 on average across both corpora and baseline models. The evaluation of the modified models on the same corpora shows a POS tagging F-Score of 0.71 on average across both corpora and modified models. Evaluating the baseline models when run on words unique to South African English shows an average F-Score of 0.62 and evaluating the modified models when run on the same dataset shows an average F-Score of 0.72. The results demonstrate that improvements to POS tagging on South African English can be made by including Afrikaans in the model and by training this model on words unique to South African English. A novel Data-Driven Matching model is developed to investigate donor-adopter relationships in South African English. Results show that there is a commonality of use of words between South African English and Afrikaans, Sesotho and isiZulu. 15.7% of the words in the South African English corpora studied are observed to be in use in Afrikaans, 4.98% of the words are used in Sesotho and 1.09% of the words are used in isiZulu. DA - 2024 DB - OpenUCT DP - University of Cape Town KW - Information technology LK - https://open.uct.ac.za PY - 2024 T1 - Computational Analyses of South African English – a Data-Driven Approach TI - Computational Analyses of South African English – a Data-Driven Approach UR - http://hdl.handle.net/11427/40870 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/40870
dc.identifier.vancouvercitationDe Lange J. Computational Analyses of South African English – a Data-Driven Approach. []. ,Faculty of Science ,Department of Computer Science, 2024 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/40870en_ZA
dc.language.rfc3066Eng
dc.publisher.departmentDepartment of Computer Science
dc.publisher.facultyFaculty of Science
dc.subjectInformation technology
dc.titleComputational Analyses of South African English – a Data-Driven Approach
dc.typeThesis / Dissertation
dc.type.qualificationlevelMasters
dc.type.qualificationlevelMPhil
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis_sci_2024_de lange jacques.pdf
Size:
1.95 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.72 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections