Investigating language preferences in improving multilingual Swahili information retrieval

Doctoral Thesis


Permanent link to this Item
Journal Title
Link to Journal
Journal ISSN
Volume Title
Multilingual Information Retrieval (MLIR) systems are designed to retrieve information from multiple languages in response to a query posed in another language or in one of the languages in which a user is looking for information. Researchers have proposed several approaches for combining the results from individual result lists to produce a single result list. Some are heuristics, such as round-robin, in which a result is drawn from each result list one at a time until all lists are exhausted, while others are Machine Learning (ML)-based, in which a model is trained using a variety of features from the query and the required documents. These approaches strive for topical relevance, which is the most important goal in satisfying users' information needs. However, multilingual speakers exhibit a variety of behaviours, some of which are unique to certain individuals based on their historical, cultural, and linguistic backgrounds. Unfortunately, these behaviours are ignored in the current MLIR system design and implementation. Current MLIR systems present results that do not take people's language preferences into account when ranking results. Studies have shown that users have different language preferences based on their search topics – Topic-Language (T-L) preferences. This study proposes using T-L preferences to improve the relevance of the ranked MLIR results. To achieve this aim, we used a survey-based study to try to understand the information needs and Web search behaviour of Swahili-speaking Web users in Tanzania. One bold behaviour of such multilingual Web users that emerged is code-switching. Several factors, such as information context and search topic, were identified as reasons for such frequent language switching. We then created a prototype multilingual search engine with which users interacted in order to quantify how much the language of the query or the selected results is influenced by the search topic. We estimated the relationship between the topic of search and the language of the query and clicked results using the resulting query and click-through logs. The findings revealed that Swahili-speaking Web users have language preferences for certain topics. For example, Kiswahili was significantly preferred as a results language in only 9% of the examined topics, English was preferred in 26% of the topics, and there was no preference for language of results in the remaining 65% of the topics. Based on these findings, we created the T-L-based algorithm, which re-ranks the results based on T-L associations/preferences. We evaluated our proposed T-L-based algorithm using clickthrough logs from our prototype guided multilingual search engine. The results show that incorporating language preferences into the ranking model significantly improves the relevance MLIR results in some specific cases. The strength of the T-L association and the number of relevant results in the preferred language's list were discovered to be driving factors in the performance improvement of the T-L-based algorithm. This thesis provides evidence that using language preferences can potentially improve the relevance of MLIR results for some topics that are preferentially expressed in specific languages. This is important in communities where information search and access are hampered by a variety of factors and there is a clear lineage in language use, where MLIR's topical relevance alone may not be sufficient.