Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech

Houston, Charles

Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech

dc.contributor.advisor	Britz, Stefan S
dc.contributor.advisor	Durbach, Ian
dc.contributor.author	Houston, Charles
dc.date.accessioned	2023-03-06T10:16:35Z
dc.date.available	2023-03-06T10:16:35Z
dc.date.issued	2022
dc.date.updated	2023-02-20T12:56:38Z
dc.description.abstract	Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model.
dc.identifier.apacitation	Houston, C. (2022). <i>Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech</i>. (). ,Faculty of Science ,Department of Statistical Sciences. Retrieved from http://hdl.handle.net/11427/37267	en_ZA
dc.identifier.chicagocitation	Houston, Charles. <i>"Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech."</i> ., ,Faculty of Science ,Department of Statistical Sciences, 2022. http://hdl.handle.net/11427/37267	en_ZA
dc.identifier.citation	Houston, C. 2022. Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech. . ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/37267	en_ZA
dc.identifier.ris	TY - Master Thesis AU - Houston, Charles AB - Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model. DA - 2022_ DB - OpenUCT DP - University of Cape Town KW - Statistical Sciences LK - https://open.uct.ac.za PY - 2022 T1 - Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech TI - Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech UR - http://hdl.handle.net/11427/37267 ER -	en_ZA
dc.identifier.uri	http://hdl.handle.net/11427/37267
dc.identifier.vancouvercitation	Houston C. Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech. []. ,Faculty of Science ,Department of Statistical Sciences, 2022 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/37267	en_ZA
dc.language.rfc3066	eng
dc.publisher.department	Department of Statistical Sciences
dc.publisher.faculty	Faculty of Science
dc.subject	Statistical Sciences
dc.title	Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
dc.type	Master Thesis
dc.type.qualificationlevel	Masters
dc.type.qualificationlevel	MSc

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis_sci_2022_houston charles.pdf
Size:: 1.91 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters