Exploring new methodologies to identify disease-associated variants in African populations through the integration of patient genotype data and clinical phenotypes derived from routine health data: A case study for Type 2 Diabetes Mellitus in patients in the Western Cape Province, South Africa

Doctoral Thesis

2023

Permanent link to this Item
Authors
Journal Title
Link to Journal
Journal ISSN
Volume Title
Publisher
Publisher
License
Series
Abstract
Thesis Title Exploring new methodologies to identify disease-associated variants in African populations through the integration of patient genotype data and clinical phenotypes derived from routine health data: A case study for Type 2 Diabetes Mellitus patients in the Western Cape Province, South Africa. Abstract Introduction There is poor knowledge on the genetic drivers of disease in African populations and this is largely driven by the limited data for human genomes from sub-Saharan Africa. While the costs of generating human genomic data have gone down significantly, they are still a barrier to generating large scale African genomic data. This project is therefore a proof-of-concept pilot study that demonstrates the implementation of a cost-effective, scalable genotyped virtual cohort that can address population level genomic questions. Methods We optimised a tiered informed consent process that is suitable for the cohort study design and adapted it to conducting human genomic research in the African context. We used an existing dataset to explore statistical methods for modelling longitudinal routine health data into a standardised phenotype for genome wide association studies (GWAS). We then conducted a feasibility study and piloted the tiered informed consent process, DNA collection by buccal swab and DNA extraction from buccal swabs and peripheral blood samples. DNA samples were genotyped for approximately 2.2 million variants on the Infiniumâ„¢ H3Africa Consortium Array V2. Genotyping quality control (QC) was done in Plink 1.9 and genome wide imputation on the Sanger Imputation Service. We demonstrated successful variant calling and provide aggregate statistics for known aetiological variants for type 2 diabetes and severe COVID-19 as well as demonstrating the feasibility of running nested case-control GWAS with these data. Results We demonstrate the use of routine health data to provide complex phenotypes to link to genotype data for both non-communicable diseases (diabetes) and infectious diseases (Tuberculosis, HIV and COVID-19). 459 participants consented to providing a DNA sample and access to their routine health data and were included in the feasibility study. A total of 343 DNA samples and 1782023 genotyped variants passed quality control and were available for further analysis. While most of the cohort population clustered with the 1000 genomes African population, principal component analysis showed extensive population admixture. For the COVID-19 analysis, we identified 63 cases of severe COVID-19 and 280 controls, and for the type 2 diabetes analysis we identified 93 cases and 250 controls using the routine health data of participants in the cohort. While the sample sizes were insufficient for a GWAS we were able to evaluate known type 2 diabetes mellitus and COVID-19 variants in the study population. Conclusion We have described how we conceptualised and implemented a genotyped virtual population cohort in a resource constrained environment, and we are confident that this design and implementation are appropriate to scale up the cohort to a size where novel health discoveries can be made through nested case-control studies. In the interim we demonstrate the analysis and validation of aetiological variants identified in other studies and populations.
Description

Reference:

Collections