Feasibility of individualised synthetic speech for children with complex communication needs in three South African languages (South African English, Afrikaans, and isiXhosa)

Terblanche, Camryn

Feasibility of individualised synthetic speech for children with complex communication needs in three South African languages (South African English, Afrikaans, and isiXhosa)

Thesis / Dissertation

2025

Publisher

University of Cape Town

Department

Division of Communication Sciences and Disorders

Faculty

Faculty of Health Sciences

Abstract

Background: A person's voice is an expression of their identity. The uniqueness of a person's voice is influenced by both their physical and social attributes. Yet, for children with complex communication needs (CCN), sometimes the only functional way to communicate is by using an augmentative and alternative communication (AAC) device, specifically speech-generating devices. However, AAC users often lack a personal connection to the synthetic voices found on speech-generating devices. While these devices improve the quality of life for those with speech impairments, they often fail to capture the unique linguistic diversity present in the population, along with the uniqueness of an individual's natural voice. Aim and Objectives: The overarching aim of this research is to develop a viable method for creating natural-sounding synthetic voices for South African children with CCN by using open-source speech synthesis software, taking into consideration the cultural assumptions and ideologies that influence the development of AAC systems for individuals with diverse backgrounds. The four primary objectives are 1) To outline the strengths and weaknesses of existing speech synthesis systems: This involves a detailed examination of the challenges and advancements in the speech synthesis field, 2) To document multiple stakeholders' experiences and perceptions: This objective aims to capture the challenges faced by professionals and caregivers when implementing AAC within schools for Learners with Special Education Needs (LSEN), and gather their ideas for overcoming these challenges, ensuring a comprehensive understanding of the context. 3) To delineate the process of generating naturalistic synthetic child speech: Tacotron 2, an open-source speech synthesis system, is used for three under resourced languages (South African English/SAE, Afrikaans, and isiXhosa). This objective aims to provide insights into the feasibility of creating synthetic voices that match the vocal identity of children with CCN, addressing issues of speech diversity and under-resourced languages. 4) To evaluate and document multiple stakeholders' perspectives surrounding the quality, acceptability, and utility of newly created synthetic speech: This involves gathering feedback on the synthetic voices generated. This objective aims to ascertain whether the utilised method would be accepted and deemed appropriate as an addition to AAC in South Africa. Understanding stakeholder perspectives is crucial for refining synthetic voices and ensuring their practicality and acceptance in real-world contexts. Methods: The PhD project employs an exploratory, sequential, mixed method methodology. The PhD research comprises three distinct phases, Phase 1 begins with a scoping review (Phase 1a) which is followed by focus group discussions (Phase 1b) utilising a descriptive qualitative design. This initial exploration is followed by Phase 2, where Tacotron 2 is employed for synthetic speech development. The assessment of the naturalness of the synthetic voices created during this phase follows a non-experimental, quantitative descriptive design. In the final phase (Phase 3), a mixed methods design, specifically a triangulation mixed method design, is adopted. This approach amalgamates qualitative insights gathered from focus groups with quantitative data, ensuring a comprehensive understanding of stakeholder perspectives and their broader assessment of the newly created synthetic speech. Results: The scoping review in Phase 1a uncovered several challenges in child speech synthesis, emphasising the need for tailored solutions considering the specific linguistic and age-related variations for children with CCN. In Phase 1b, AAC implementation challenges in South Africa revealed pervasive issues of reduced support, training, and crime-related safety concerns associated with the use of high-tech AAC devices. Limited accessibility further highlighted the barriers faced by children with CCN in LMICs. Phase 2's investigation into Tacotron 2's feasibility in generating synthetic child speech showed promising outcomes. Despite challenges like limited child speech data and literacy disparities among children providing the speech data, we were able to create synthetic voices in three under-resourced South African languages—SAE, Afrikaans, and isiXhosa, using Tacotron 2. In Phase 3, stakeholder perspectives on the quality and acceptability of newly created synthetic voices highlighted a generally positive response. Despite variations in prosody and intelligibility compared to natural child speech, stakeholders recognised potential benefits for children with CCN, with intelligibility ratings averaging 92%. The synthesis of qualitative and quantitative data enriched the understanding of the synthetic voices' practicality and acceptance, contributing to future AAC solutions for children with CCN in South Africa and similar contexts. Conclusions: Collectively, this PhD research provides holistic insights into child speech synthesis, AAC implementation challenges, and stakeholder perspectives, especially in LMICs. The implications for service provision, safety, language diversity, and stakeholder involvement are evident. The findings lay the groundwork for advancing AAC interventions, promoting accessibility, and fostering inclusive decision-making processes, thereby enhancing communication solutions for children with CCN.