Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages

Doctoral Thesis


Permanent link to this Item
Journal Title
Link to Journal
Journal ISSN
Volume Title
Most of the Web is published in languages that are not accessible to many potential users who are only able to read and understand their local languages. Many of these local languages are Resources Scarce Languages (RSLs) and lack the necessary resources, such as machine translation tools, to make available content more accessible. State of the art preprocessing tools and retrieval methods are tailored for Web dominant languages and, accordingly, documents written in RSLs are lowly ranked and difficult to access in search results, resulting in a struggling and frustrating search experience for speakers of RSLs. In this thesis, we propose the use of language similarities to match, re-rank and return search results written in closely related languages to improve the quality of search results and user experience. We also explore the use of shared morphological features to build multilingual stemming tools. Focusing on six Bantu languages spoken in Southeastern Africa, we first explore how users would interact with search results written in related languages. We conduct a user study, examining the usefulness and user preferences for ranking search results with different levels of intelligibility, and the types of emotions users experience when interacting with such results. Our results show that users can complete tasks using related language search results but, as intelligibility decreases, more users struggle to complete search tasks and, consequently, experience negative emotions. Concerning ranking, we find that users prefer that relevant documents be ranked higher, and that intelligibility be used as a secondary criterion. Additionally, we use a User-Centered Design (UCD) approach to investigate enhanced interface features that could assist users to effectively interact with such search results. Usability evaluation of our designed interface scored 86% using the System Usability Scale (SUS). We then investigate whether ranking models that integrate relevance and intelligibility features would improve retrieval effectiveness. We develop these features by drawing from traditional Information Retrieval (IR) models and linguistics studies, and employ Learning To Rank (LTR) and unsupervised methods. Our evaluation shows that models that use both relevance and intelligibility feature(s) have better performance when compared to models that use relevance features only. Finally, we propose and evaluate morphological processing approaches that include multilingual stemming, using rules derived from common morphological features across Bantu family of languages. Our evaluation of the proposed stemming approach shows that its performance is competitive on queries that use general terms. Overall, the thesis provides evidence that considering and matching search results written in closely related languages, as well as ranking and presenting them appropriately, improves the quality of retrieval and user experience for speakers of RSLs.