Browsing by Department "Computational Biology Division"
Now showing 1 - 9 of 9
Results Per Page
Sort Options
- ItemOpen AccessA pan-genome wide association study to identify genes associated with invasive Streptococcus pneumoniae(2023) Iranzadeh, Arash; Mulder, NicolaStreptococcus pneumoniae (pneumococcus) is one of the leading causes of mortality in Africa. It asymptomatically colonizes the human nasopharynx. The invasive pneumococcal disease occurs when isolates spread to normally sterile sites such as lungs, blood, and the central nervous system. Colonization, though, does not necessarily lead to infection. Some isolates remain in the upper respiratory tract only, without causing any pathogenic symptoms. This thesis hypothesized that invasive and non-invasive isolates differ genetically. We tested this hypothesis by applying a pan-genome approach using whole-genome sequencing short reads of 1477 samples from Malawi, including those obtained from the nasopharynx of carriers (825 samples) and from the blood and cerebrospinal fluid of patients (652 samples). In-silico serotyping identified 56 serotypes in the cohort and statistical analysis showed that despite the vaccination, the prevalence of serotypes 1 and 12F increased amongst patients. Genomes were assembled, and a reference pan-genome for all strains was built. Short reads were aligned to the core genome, and core variants were called. The population structure was determined based on the distribution of variants in the pan-genome. Finally, genes with a significant presence in the invasive isolates were identified. Functional enrichment analysis of potential virulence genes was carried out to address how specific genes may contribute to the pathogenesis. The findings highlighted the features of the pneumococcus pan-genome in Malawi. The core- and accessory-genome were characterized based on the functional analysis of genes. The core components included: Ribosomal subunits. Subunits of F-type ATP synthase. Enzymes that catalyze the attachment of amino acids to tRNA molecules, DNA replication, DNA repair, and homologous recombination. 10.13% of the core and soft-core genes were uncharacterized. In the accessory genome, the study detected the presence of genes from Regions of Diversity (RDs), including Subunits of V-type ATPases and Sodium/solute symporter from RD8a. Enzymes from RD3 catalyzing the capsule synthesis. Subunits of PsrP secY2A2 pathogenicity island from RD10. Genes from RD6 and RD7 involved in transposing mobile genetic elements. Genes from RD2 RD8b, and RD12 participating in communication and competition. Genes from RD4 that assemble pilins into pili and anchor pili to the cell wall. 53.58% of accessory genes were uncharacterized. Most serotypes showed a similar prevalence in carriage and disease groups. However, the significant abundance of serotypes 1, 5, and 12F among patients compared to the carriage group suggested they are highly invasive with a short colonization period. These serotypes exhibited a remarkable genetic distinction from others. Their divergence included the absence and presence of several genes in their genome structure. The lack of genes from a genomic island known as RD8a was the most pronounced difference between serotypes 1, 5, and 12F compared to significantly prevalent serotypes in the nasopharynx. Genes in RD8a are involved in binding to epithelial cells and doing aerobics respiration to synthesize ATP through oxidative phosphorylation. The absence of RD8a from serotypes 1, 5, and 12F may be associated with their short duration in the nasopharynx where they need to bind to epithelial cells and access free oxygen molecules required for aerobic respiration. Given this, the amount of ATP is likely to decline in serotypes 1, 5, and 12F, causing them to harbour more phosphotransferase systems to transport carbohydrates since these transporters use phosphoenolpyruvate as the energy source instead of ATP. In conclusion, serotypes 1, 5, and 12F, the most prevalent and invasive pneumococcal strains in Malawi, showed a considerable genetic distinction from other strains that may be associated with their short colonization period and quickness to infect the blood and cerebrospinal fluid.
- ItemOpen AccessAfrican Genomic Medicine Portal: A Web Portal for Biomedical Applications(2022-02-11) Othman, Houcemeddine; Zass, Lyndon; da Rocha, Jorge E B; Radouani, Fouzia; Samtal, Chaimae; Benamri, Ichrak; Kumuthini, Judit; Fakim, Yasmina J; Hamdi, Yosr; Mezzi, Nessrine; Boujemaa, Maroua; Okeke, Chiamaka Jessica; Tendwa, Maureen B; Sanak, Kholoud; Chaouch, Melek; Panji, Sumir; Kefi, Rym; Sallam, Reem M; Ghoorah, Anisah W; Romdhane, Lilia; Kiran, Anmol; Meintjes, Ayton P; Maturure, Perceval; Jmel, Haifa; Ksouri, Ayoub; Azzouzi, Maryame; Farahat, Mohammed A; Ahmed, Samah; Sibira, Rania; Turkson, Michael E E; Ssekagiri, Alfred; Parker, Ziyaad; Fadlelmola, Faisal M; Ghedira, Kais; Mulder, Nicola; Kamal Kassim, SamarGenomics data are currently being produced at unprecedented rates, resulting in increased knowledge discovery and submission to public data repositories. Despite these advances, genomic information on African-ancestry populations remains significantly low compared with European- and Asian-ancestry populations. This information is typically segmented across several different biomedical data repositories, which often lack sufficient fine-grained structure and annotation to account for the diversity of African populations, leading to many challenges related to the retrieval, representation and findability of such information. To overcome these challenges, we developed the African Genomic Medicine Portal (AGMP), a database that contains metadata on genomic medicine studies conducted on African-ancestry populations. The metadata is curated from two public databases related to genomic medicine, PharmGKB and DisGeNET. The metadata retrieved from these source databases were limited to genomic variants that were associated with disease aetiology or treatment in the context of African-ancestry populations. Over 2000 variants relevant to populations of African ancestry were retrieved. Subsequently, domain experts curated and annotated additional information associated with the studies that reported the variants, including geographical origin, ethnolinguistic group, level of association significance and other relevant study information, such as study design and sample size, where available. The AGMP functions as a dedicated resource through which to access African-specific information on genomics as applied to health research, through querying variants, genes, diseases and drugs. The portal and its corresponding technical documentation, implementation code and content are publicly available.
- ItemOpen AccessAnalysis of within-host evolution of Plasmodium Falciparum during treatment(2018) Okendo, Javan Ochieng; Mulder, Nicola; Andagalu, BenAntimalarial drugs impose strong selective pressure on Plasmodium falciparum parasite genomes and leave signatures of selection. The evolutionary basis of drug resistant malaria in endemic and epidemic settings continues to remain an ongoing scientific priority whose solution carries a significant effect on treatment outcomes. To understand the evolutionary changes in P. falciparum during treatment with ACTs, we used various approaches to test the neutral models of evolution using P. falciparum genomic data which were collected from Kombewa and Maseno in Kisumu, Kenya between 2013 and 2015. The Synonymous/Non-synonymous (dN/dS) ratio was used to predict the effect of selection on protein coding loci of the Pfk13 gene. A logistic regression model was used to test the association between IC50s and the SNPs. mCSM and SDM were used to detect the effects of mutations on the Pfk13 gene while the PRIMO web server was used to locate the SNPs on the Kelch13 propeller domain. Modeller V9.1 was used to predict the structure of the Kelch 13 propeller domain and the Posview webserver used to predict ACT/kelch 13 interactions. Population differentiation was done using Microsatellite analyzer to calculate FST and customized R scripts with the relevant population genetics packages were used in the analysis. For samples collected in 2013, Tajima’s D genomic summary statistic was 4.53194, Fu & Li D* 2.13380, and Fu &Li F* 3.62142. However, in 2015 Tajima’s D was -2.42910, Fu and Li’s D* -5.2712, and Fu and Li’s F* -5.0045. The dN/dS in 2013 was 1.0299, while in 2015 dN/dS was 2.6884. Kenyan P. falciparum SNPs occur on the intra or inter blade domains on the PfK13 propeller domain. The FST analysis showed minimal population differentiation of the parasites during treatment. There was no significant association between SNPs and IC50 values but SNPs at codon D547E showed association with Artesunate and D559E with AQ and MQ IC50 respectively. Even though there is an exponential increase in the number of non-synonymous point mutations in the Pfk13 gene, the Kenyan P. falciparum strains remain sensitive to ACT drugs. Further research needs to be done by deep sequencing this location of chromosome 13 as it will provide more power for finding novel SNPs for further validation.
- ItemOpen AccessConserved recombination patterns across coronavirus subgenera(2022) de Klerk, Arne; Martin, Darrin PatrickRecombination contributes to the genetic diversity found in coronaviruses and is known to be a prominent mechanism whereby they evolve. It is apparent, both from controlled experiments and in genome sequences sampled from nature, that patterns of recombination in coronaviruses are nonrandom and that this is likely attributable to a combination of sequence features that favour the occurrence of recombination breakpoints at specific genomic sites, and selection disfavouring the survival of recombinants within which favourable intra-genome interactions have been disrupted. Here we leverage available whole-genome sequence data for six coronavirus subgenera to identify specific patterns of recombination that are conserved between multiple subgenera and then identify the likely factors that underlie these conserved patterns. Specifically, we confirm the non-randomness of recombination breakpoints across all six tested coronavirus subgenera, locate conserved recombination hot- and cold-spots, and determine that the locations of transcriptional regulatory sequences are likely major determinants of conserved recombination breakpoint hot-spot locations. We find that while the locations of recombination breakpoints are not uniformly associated with degrees of nucleotide sequence conservation, they display significant tendencies in multiple coronavirus subgenera to occur in low guanine-cytosine content genome regions, in non-coding regions, at the edges of genes, and at sites within the Spike gene that are predicted to be minimally disruptive of Spike protein folding. While it is apparent that sequence features such as transcriptional regulatory sequences are likely major determinants of where the template-switching events that yield recombination breakpoints most commonly occur, it is evident that selection against misfolded recombinant proteins also strongly impacts observable recombination breakpoint distributions in coronavirus genomes sampled from nature.
- ItemOpen AccessCorrection to: Human microbiota research in Africa: a systematic review reveals gaps and priorities for future research(BioMed Central, 2022-01-19) Allali, Imane; Abotsi, Regina E; Tow, Lemese A; Thabane, Lehana; Zar, Heather J; Mulder, Nicola M; Nicol, Mark PAn amendment to this paper has been published and can be accessed via the original article.
- ItemOpen AccessInformation content-based gene ontology semantic similarity approaches: toward a unified framework theory(2013) Mazandu, Gaston K; Mulder, Nicola JSeveral approaches have been proposed for computing term information content (IC) and semantic similarity scores within the gene ontology (GO) directed acyclic graph (DAG). These approaches contributed to improving protein analyses at the functional level. Considering the recent proliferation of these approaches, a unified theory in a well-defined mathematical framework is necessary in order to provide a theoretical basis for validating these approaches. We review the existing IC-based ontological similarity approaches developed in the context of biomedical and bioinformatics fields to propose a general framework and unified description of all these measures. We have conducted an experimental evaluation to assess the impact of IC approaches, different normalization models, and correction factors on the performance of a functional similarity metric. Results reveal that considering only parents or only children of terms when assessing information content or semantic similarity scores negatively impacts the approach under consideration. This study produces a unified framework for current and future GO semantic similarity measures and provides theoretical basics for comparing different approaches. The experimental evaluation of different approaches based on different term information content models paves the way towards a solution to the issue of scoring a term’s specificity in the GO DAG.
- ItemOpen AccessSimulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification(2023) Swanepoel, Phillip; Martin, DarrinMotivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align.
- ItemOpen AccessThe identification of cytotoxic T lymphocyte (CTL) escape in a large, longitudinal subtype C HIV-1 sequence dataset(2023) Mphahlele, Ruth; Williamson, Carolyn; Martin DarrenHuman Immunodeficiency Virus (HIV) rapidly escapes cytotoxic T-cell lymphocyte (CTL) immune responses exerted by the host. Mutation patterns and HLA associated footprints linked to viral escape have been identified, making it possible to use viral sequence data, combined with the host HLA allele information, to predict escape. Next-Generation Sequencing (NGS) approaches enable the generation of large sequence datasets, and the detection of viral populations present at very low frequencies in an infected individual at any given time. These datasets allow for the study of changes in viral populations within a host over time and provide a means to understand the kinetics and pathway(s) of escape. While tools exist that allow the prediction of escape in sequence data with small sequence numbers per sampling timepoint, these tools often have limitations in analysing large NGS data sets. In this project, we developed a workflow for identifying the kinetics of CTL escape in longitudinal HIV-1 next-generation datasets of gag sequences generated using an Illumina Miseq platform over the duration of drug-naïve infection. This acquired data set was generated from 15 women over a period of one to seven years and comprised of 4583 short read gag sequences (544 bp). We identified tools for identifying CTL escape in deep sequencing datasets and used pre-defined criteria to screen these tools. The outputs were validated using a test dataset from a previous study that identified escape. We selected the Epitope Matcher tool as having the most potential to identify CTL epitopes and escape mutations. To further support evidence of escape and identify additional putative escape mutations, we identified sites with high Shannon entropy (>=0.25) and sites evolving under positive selection using HyphyFUBAR. The sites were verified using the HLA association and CTL epitope variants and escape mutations lists, or data generated by Epitope Matcher. Using the Epitope Matcher tool, we identified seven HLA-B restricted gag epitopes in six individuals of which putative escape was identified in seven epitopes, commonly occurring in the late chronic phase of infection. The most common epitope in the population was YL9 (found in 60% of the participants) (Gag HXB2 coordinates 296 to 304) restricted by HLA B*15:03, B*15:10 and B*42:01. Toggling of amino acids within epitopes as a result of potential fitness cost associated with a specific change, was observed in five of seven epitopes. We further identified 35 high Shannon entropy sites, where nine of these sites were found within epitopes identified by Epitope Matcher. Additionally, nine of the high Shannon entropy sites were evolving under positive selection. With supporting evidence, we can predict that the mutation T310S (found in the AW11 epitope, restricted by allele B*58:01), is likely to be associated with escape. This study is important in that it provides a pipeline that will enable semiautomated analysis of NGS data. Using this approach, we have provided a better understanding of the kinetics and frequency of CTL escape over the course of HIV infection. Additionally, we have identified frequently targeted sites across the Gag p24 region and across individuals. This study is relevant to inform CTL-based vaccine prevention and treatment strategies.
- ItemOpen AccessUtilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification(2025) Cullinan, Joshua; Martin, DarrinThis thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and tested several models, including logistic regression, gradient boosting, random forests and neural networks, on a dataset of 491 124 sequences. A novel neural network architecture employing position selection achieved the highest performance with a weighted Area Under Curve (AUC) of 0.784, surpassing RDP5's baseline AUC of 0.739. The gradient boosting classifier demonstrated strong results with an AUC of 0.765, whilst the binary neural network achieved 0.764. Performance evaluation focused on precision, recall and F1-scores to address the inherent class imbalance between recombinant and parental sequences. The models demonstrated modest performance in detecting recombinants (precision 0.627-0.687, recall 0.652-0.686). These improvements, though incremental, represent progress in automated recombination detection. The successful preliminary integration of the logistic regression model into RDP5 demonstrates the practical applicability of these approaches. This work provides a foundation for enhancing viral recombination detection through machine learning, whilst highlighting areas requiring further development to achieve more substantial improvements in detection accuracy.