Investigating local ancestry inference models in mixed ancestry individual genomes

Doctoral Thesis


Permanent link to this Item
Journal Title
Link to Journal
Journal ISSN
Volume Title
Owing to historical events including the slave trade, agricultural interests, colonialism, and political and/or economical instability, most modern humans are a mosaic of segments originating from different populations. They result from the interbreeding of two or more previously isolated populations, leading to admixture. Known admixed populations include the mixed ancestry of South Africa, Latin Americans and African Americans. Admixed individuals play important roles in understanding population history, disease aetiology, and personal genomics. Accordingly, efforts have been made to understand the genetic composition of such individuals, yielding several models that infer the ancestry of every chromosomal segment in admixed individuals (local ancestry). However, new research questions emerged concerning model statistical and biological parameters, as well as the performance of these models across admixed datasets. This elicited the need for examining existing local ancestry inference models in order to identify and tackle critical issues of these models, which is the main goal of this thesis. We achieve this in four steps, constituting the main contributions of this PhD project: (1) Qualitative assessment of existing models through a systematic review; (2) Building a unified framework integrating existing models for inferring and assessing local ancestry estimates; (3) Quantitative assessment of existing methods within the same framework; and (4) Proposing a model extension to account for natural selection and the origin of modern humans to improve the accuracy of local ancestry estimates. Firstly, we assess models using published results on different datasets and performance measures, to orient modellers and software developers on the future trends in local ancestry inference. Secondly, to address the challenges identified in (1) including model complexity reflected in the distinct inputs each model requires and outputs formats, we design a unified framework, referred to as FRANC, to manipulate tool-specific inputs, deconvolve ancestry and standardise outputs, to ease the inference process and pave the way for model assessment. Thirdly, using FRANC, we assess the performance of eight state-of-the-art models on simulated admixed population datasets involving three and five ancestral populations. LAMP-LD and LOTER performed better than the other six tested models on admixed populations involving five ancestral populations while RFMIX, WINPOP, ELAI and LAMP-LD were comparable in admixed datasets involving three populations. Performance was evaluated based on performance measures borrowed from the machine learning confusion matrix. Finally, we noted that it may be more practical to extend existing models to incorporate more realistic biological assumptions. Hence, we propose a nonparametric hidden Markov model, that adjusts an existing model mSPECTRUM to account for natural selection and state-persistence when deconvolving local ancestry, which should improve the accuracy of estimates. Similarly to mSPECTRUM, this acknowledges the two common hypotheses on the origin of modern humans, making it comparable to mSPECTRUM which has been shown to be competitive with HAPMIX, a benchmark for two-way admixtures. Therefore, these four are a good contribution to admixture analysis of populations.