Creating and analysing an African pan-genome

Bourn, Jessica Jean

Creating and analysing an African pan-genome

Master Thesis

2022

Abstract

The human reference genome is currently a core resource for understanding the role of genetics in human health, disease, and variation, and has been invaluable in the development of clinical and computational tools for these purposes. However, the limited number of individual genomes used to create the reference has resulted in an underrepresentation of the extensive genetic diversity present in different human populations. Since an important use of the reference genome is to identify genetic variants that may be implicated in disease, this lack of diversity could limit the scientific utility of the reference for ethnic groups that are poorly represented in it. As a result, adaptations to the reference genome structure have been proposed. One such proposal has been the use of multiple reference genomes, each of which represent different human populations. A logical and highly practical method of achieving this is through the use of a pan-genome, which is a curated collection of all the DNA sequences that are found within a population under study. Despite the fact that African populations exhibit the greatest genetic diversity and variation in the world, the many and sometimes ancient ethnolinguistic groups from Africa are among those least represented within the reference genome. Consequently, this study aimed to explore the feasibility of creating and analysing an African pangenome, and to begin developing tools to achieve this. Several distinct African regional ancestral groups – namely east African Nilo-Saharan, east African Afro-Asiatic, far west Niger-Congo, central west Niger-Congo, Bantu-speaking Niger-Congo, central African rainforest hunter-gatherer, and the Khoe and San – have previously been identified, and this study included and analysed samples from each group in order to assemble a more inclusive and representative pan-genome. A software pipeline developed by Duan et al. (2019), termed the HUman Pan-genome ANalysis (HUPAN) pipeline, was used here to assemble the African pan-genome. As the HUPAN pipeline was originally designed to analyse only single populations, the inclusion of multiple populations required modifications and improvements, which were implemented following the testing and analysis of the pipeline using a smaller dataset of whole genome sequences. Subsequently, a final dataset of 168 African high- and medium-coverage whole genome sequences representing the seven separate regional ancestral groups was submitted to the adapted HUPAN pipeline. For each group, nucleotide sequences that were absent from the human reference genome were assembled and extracted, which resulted in the identification of 43.37 Mbp of non-redundant non-reference genomic sequence and 31 novel predicted protein-coding genes from African individuals. Alignment to other pan-genome sequences, whole genomes from different human populations, and the complete telomere-to-telomere human genome validated a large portion of the sequences as nonreference and confirmed that the dataset contained sequences specific to African populations. However, the gene presence-absence variation analysis of the pan-genome within all 168 samples revealed patterns of gene presence and absence that were strongly correlated to the sample dataset of origin, rather than to the ancestral group of origin. This hindered the identification of genuine genetic variation specific to the groups analysed. Further, it appears that previous pan-genomic research has not investigated the degree to which the genetic variation identified is dataset-specific or truly population-specific. Consequently, the failure to acknowledge and account for the effects of spurious inter-dataset variation in previous pan-genomic research indicates that those analyses may be incomplete or ambiguous. This, therefore, calls into question the methods currently used for pangenomic research, and highlights that robust, standardised methods for human pan-genome research must be agreed on to ensure that comprehensive population-specific pan-genomes are produced in the future. Despite this inherent weakness of pan-genomic research, this study successfully enabled the creation and analysis of a comprehensive and inclusive African pan-genome. Unique sets of non-reference sequences specific to African regional ancestral groups were identified and obtained, enabling the assembly of a non-redundant set of pan-African non-reference sequences. Furthermore, certain complex but previously unconsidered aspects of pan-genome research were identified and explored, and these observations may play a role in the advancement of pan-genome research in future.

Keywords

Bioinformatics

Reference:

Collections

Masters

Full item page