Recurrent neural network language models in the context of under-resourced South African languages

Scarcella, Alessandro

Recurrent neural network language models in the context of under-resourced South African languages

dc.contributor.advisor	Lacerda, Miguel
dc.contributor.author	Scarcella, Alessandro
dc.date.accessioned	2019-02-08T13:55:47Z
dc.date.available	2019-02-08T13:55:47Z
dc.date.issued	2018
dc.date.updated	2019-02-07T09:46:06Z
dc.description.abstract	Over the past five years neural network models have been successful across a range of computational linguistic tasks. However, these triumphs have been concentrated in languages with significant resources such as large datasets. Thus, many languages, which are commonly referred to as under-resourced languages, have received little attention and have yet to benefit from recent advances. This investigation aims to evaluate the implications of recent advances in neural network language modelling techniques for under-resourced South African languages. Rudimentary, single layered recurrent neural networks (RNN) were used to model four South African text corpora. The accuracy of these models were compared directly to legacy approaches. A suite of hybrid models was then tested. Across all four datasets, neural networks led to overall better performing language models either directly or as part of a hybrid model. A short examination of punctuation marks in text data revealed that performance metrics for language models are greatly overestimated when punctuation marks have not been excluded. The investigation concludes by appraising the sensitivity of RNN language models (RNNLMs) to the size of the datasets by artificially constraining the datasets and evaluating the accuracy of the models. It is recommended that future research endeavours within this domain are directed towards evaluating more sophisticated RNNLMs as well as measuring their impact on application focused tasks such as speech recognition and machine translation.
dc.identifier.apacitation	Scarcella, A. (2018). <i>Recurrent neural network language models in the context of under-resourced South African languages</i>. (). University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. Retrieved from http://hdl.handle.net/11427/29431	en_ZA
dc.identifier.chicagocitation	Scarcella, Alessandro. <i>"Recurrent neural network language models in the context of under-resourced South African languages."</i> ., University of Cape Town ,Faculty of Science ,Department of Statistical Sciences, 2018. http://hdl.handle.net/11427/29431	en_ZA
dc.identifier.citation	Scarcella, A. 2018. Recurrent neural network language models in the context of under-resourced South African languages. University of Cape Town.	en_ZA
dc.identifier.ris	TY - Thesis / Dissertation AU - Scarcella, Alessandro AB - Over the past five years neural network models have been successful across a range of computational linguistic tasks. However, these triumphs have been concentrated in languages with significant resources such as large datasets. Thus, many languages, which are commonly referred to as under-resourced languages, have received little attention and have yet to benefit from recent advances. This investigation aims to evaluate the implications of recent advances in neural network language modelling techniques for under-resourced South African languages. Rudimentary, single layered recurrent neural networks (RNN) were used to model four South African text corpora. The accuracy of these models were compared directly to legacy approaches. A suite of hybrid models was then tested. Across all four datasets, neural networks led to overall better performing language models either directly or as part of a hybrid model. A short examination of punctuation marks in text data revealed that performance metrics for language models are greatly overestimated when punctuation marks have not been excluded. The investigation concludes by appraising the sensitivity of RNN language models (RNNLMs) to the size of the datasets by artificially constraining the datasets and evaluating the accuracy of the models. It is recommended that future research endeavours within this domain are directed towards evaluating more sophisticated RNNLMs as well as measuring their impact on application focused tasks such as speech recognition and machine translation. DA - 2018 DB - OpenUCT DP - University of Cape Town LK - https://open.uct.ac.za PB - University of Cape Town PY - 2018 T1 - Recurrent neural network language models in the context of under-resourced South African languages TI - Recurrent neural network language models in the context of under-resourced South African languages UR - http://hdl.handle.net/11427/29431 ER -	en_ZA
dc.identifier.uri	http://hdl.handle.net/11427/29431
dc.identifier.vancouvercitation	Scarcella A. Recurrent neural network language models in the context of under-resourced South African languages. []. University of Cape Town ,Faculty of Science ,Department of Statistical Sciences, 2018 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/29431	en_ZA
dc.language.iso	eng
dc.publisher.department	Department of Statistical Sciences
dc.publisher.faculty	Faculty of Science
dc.publisher.institution	University of Cape Town
dc.subject.other	Statistics
dc.title	Recurrent neural network language models in the context of under-resourced South African languages
dc.type	Master Thesis
dc.type.qualificationlevel	Masters
dc.type.qualificationname	MSc

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis_sci_2018_scarcella_alessandro.pdf
Size:: 1.56 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters