Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages

dc.contributor.advisorSuleman, Hussein
dc.contributor.authorChavula, Catherine
dc.date.accessioned2021-07-13T10:42:09Z
dc.date.available2021-07-13T10:42:09Z
dc.date.issued2021
dc.date.updated2021-07-13T10:40:53Z
dc.description.abstractMost of the Web is published in languages that are not accessible to many potential users who are only able to read and understand their local languages. Many of these local languages are Resources Scarce Languages (RSLs) and lack the necessary resources, such as machine translation tools, to make available content more accessible. State of the art preprocessing tools and retrieval methods are tailored for Web dominant languages and, accordingly, documents written in RSLs are lowly ranked and difficult to access in search results, resulting in a struggling and frustrating search experience for speakers of RSLs. In this thesis, we propose the use of language similarities to match, re-rank and return search results written in closely related languages to improve the quality of search results and user experience. We also explore the use of shared morphological features to build multilingual stemming tools. Focusing on six Bantu languages spoken in Southeastern Africa, we first explore how users would interact with search results written in related languages. We conduct a user study, examining the usefulness and user preferences for ranking search results with different levels of intelligibility, and the types of emotions users experience when interacting with such results. Our results show that users can complete tasks using related language search results but, as intelligibility decreases, more users struggle to complete search tasks and, consequently, experience negative emotions. Concerning ranking, we find that users prefer that relevant documents be ranked higher, and that intelligibility be used as a secondary criterion. Additionally, we use a User-Centered Design (UCD) approach to investigate enhanced interface features that could assist users to effectively interact with such search results. Usability evaluation of our designed interface scored 86% using the System Usability Scale (SUS). We then investigate whether ranking models that integrate relevance and intelligibility features would improve retrieval effectiveness. We develop these features by drawing from traditional Information Retrieval (IR) models and linguistics studies, and employ Learning To Rank (LTR) and unsupervised methods. Our evaluation shows that models that use both relevance and intelligibility feature(s) have better performance when compared to models that use relevance features only. Finally, we propose and evaluate morphological processing approaches that include multilingual stemming, using rules derived from common morphological features across Bantu family of languages. Our evaluation of the proposed stemming approach shows that its performance is competitive on queries that use general terms. Overall, the thesis provides evidence that considering and matching search results written in closely related languages, as well as ranking and presenting them appropriately, improves the quality of retrieval and user experience for speakers of RSLs.
dc.identifier.apacitationChavula, C. (2021). <i>Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages</i>. (). ,Faculty of Science ,Department of Computer Science. Retrieved from http://hdl.handle.net/11427/33614en_ZA
dc.identifier.chicagocitationChavula, Catherine. <i>"Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages."</i> ., ,Faculty of Science ,Department of Computer Science, 2021. http://hdl.handle.net/11427/33614en_ZA
dc.identifier.citationChavula, C. 2021. Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages. . ,Faculty of Science ,Department of Computer Science. http://hdl.handle.net/11427/33614en_ZA
dc.identifier.ris TY - Doctoral Thesis AU - Chavula, Catherine AB - Most of the Web is published in languages that are not accessible to many potential users who are only able to read and understand their local languages. Many of these local languages are Resources Scarce Languages (RSLs) and lack the necessary resources, such as machine translation tools, to make available content more accessible. State of the art preprocessing tools and retrieval methods are tailored for Web dominant languages and, accordingly, documents written in RSLs are lowly ranked and difficult to access in search results, resulting in a struggling and frustrating search experience for speakers of RSLs. In this thesis, we propose the use of language similarities to match, re-rank and return search results written in closely related languages to improve the quality of search results and user experience. We also explore the use of shared morphological features to build multilingual stemming tools. Focusing on six Bantu languages spoken in Southeastern Africa, we first explore how users would interact with search results written in related languages. We conduct a user study, examining the usefulness and user preferences for ranking search results with different levels of intelligibility, and the types of emotions users experience when interacting with such results. Our results show that users can complete tasks using related language search results but, as intelligibility decreases, more users struggle to complete search tasks and, consequently, experience negative emotions. Concerning ranking, we find that users prefer that relevant documents be ranked higher, and that intelligibility be used as a secondary criterion. Additionally, we use a User-Centered Design (UCD) approach to investigate enhanced interface features that could assist users to effectively interact with such search results. Usability evaluation of our designed interface scored 86% using the System Usability Scale (SUS). We then investigate whether ranking models that integrate relevance and intelligibility features would improve retrieval effectiveness. We develop these features by drawing from traditional Information Retrieval (IR) models and linguistics studies, and employ Learning To Rank (LTR) and unsupervised methods. Our evaluation shows that models that use both relevance and intelligibility feature(s) have better performance when compared to models that use relevance features only. Finally, we propose and evaluate morphological processing approaches that include multilingual stemming, using rules derived from common morphological features across Bantu family of languages. Our evaluation of the proposed stemming approach shows that its performance is competitive on queries that use general terms. Overall, the thesis provides evidence that considering and matching search results written in closely related languages, as well as ranking and presenting them appropriately, improves the quality of retrieval and user experience for speakers of RSLs. DA - 2021_ DB - OpenUCT DP - University of Cape Town KW - Resources Scarce Languages KW - Bantu languages KW - Southeastern Africa LK - https://open.uct.ac.za PY - 2021 T1 - Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages TI - Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages UR - http://hdl.handle.net/11427/33614 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/33614
dc.identifier.vancouvercitationChavula C. Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages. []. ,Faculty of Science ,Department of Computer Science, 2021 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/33614en_ZA
dc.language.rfc3066eng
dc.publisher.departmentDepartment of Computer Science
dc.publisher.facultyFaculty of Science
dc.subjectResources Scarce Languages
dc.subjectBantu languages
dc.subjectSoutheastern Africa
dc.titleUsing language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages
dc.typeDoctoral Thesis
dc.type.qualificationlevelDoctoral
dc.type.qualificationlevelPhD
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis_sci_2021_chavula catherine.pdf
Size:
3.31 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description:
Collections