Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech

dc.contributor.advisorBritz, Stefan S
dc.contributor.advisorDurbach, Ian
dc.contributor.authorHouston, Charles
dc.date.accessioned2023-03-06T10:16:35Z
dc.date.available2023-03-06T10:16:35Z
dc.date.issued2022
dc.date.updated2023-02-20T12:56:38Z
dc.description.abstractDespite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model.
dc.identifier.apacitationHouston, C. (2022). <i>Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech</i>. (). ,Faculty of Science ,Department of Statistical Sciences. Retrieved from http://hdl.handle.net/11427/37267en_ZA
dc.identifier.chicagocitationHouston, Charles. <i>"Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech."</i> ., ,Faculty of Science ,Department of Statistical Sciences, 2022. http://hdl.handle.net/11427/37267en_ZA
dc.identifier.citationHouston, C. 2022. Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech. . ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/37267en_ZA
dc.identifier.ris TY - Master Thesis AU - Houston, Charles AB - Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model. DA - 2022_ DB - OpenUCT DP - University of Cape Town KW - Statistical Sciences LK - https://open.uct.ac.za PY - 2022 T1 - Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech TI - Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech UR - http://hdl.handle.net/11427/37267 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/37267
dc.identifier.vancouvercitationHouston C. Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech. []. ,Faculty of Science ,Department of Statistical Sciences, 2022 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/37267en_ZA
dc.language.rfc3066eng
dc.publisher.departmentDepartment of Statistical Sciences
dc.publisher.facultyFaculty of Science
dc.subjectStatistical Sciences
dc.titleAdapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech
dc.typeMaster Thesis
dc.type.qualificationlevelMasters
dc.type.qualificationlevelMSc
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis_sci_2022_houston charles.pdf
Size:
1.91 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description:
Collections