Browsing by Subject "Bioinformatics"
Now showing 1 - 20 of 26
Results Per Page
Sort Options
- ItemOpen AccessAn African Genome Variation Database and its applications in human diversity and health(2021) Todt, Davis; Mulder, NicolaAfrican genomes exhibit the highest levels of sequence and haplotype diversity of all extant human populations. A combination of historical as well as geographical factors have contributed toward the high level of genetic diversity in Ancestral populations in Africa. Additionally, a series of concomitant migration events out of Africa, with founder populations harbouring only a subset of this genetic variation, have contributed to the relatively lower genetic diversity observed in non-Africans. Population genetic studies have refined our understanding of human evolutionary history and clinical genomic studies have resulted in improved patient outcomes. However, despite the increased throughput and decreased cost afforded from next-generation sequencing (NGS) and despite the relatively higher genetic variation in Africans, relatively little of the genomic data currently available is representative of diverse African populations. This may result in adverse outcomes in the context of minority populations with little representation in clinical databases. Given the under-representation of African genetic variation and the importance of highlighting and further characterizing it, the objectives of this project were to design, develop and deploy a proof of concept database and web application for the storage, analysis and visualization of African genetic variant data – the African Genome Variation Database (AGVD). The AGVD was developed according to software industry design standards. The project also explored available genomic tools and databases in order to leverage existing software solutions where suitable. Additionally, relevant data sets were identified for use during testing and validation of the pilot phase of the project. To this end, the open access 1000 Genomes Project phase 3 dataset was selected and the genotypes for several chromosomes were loaded into the AGVD. The AGVD leverages the scalable, performant, and open source genomics engine OpenCGA for data storage and analysis. A custom front-end web application was developed by applying a novel approach to render and serve static Vue JS assets from the Python Flask microframework. The web application supports rich data search and filtering operations of loaded variants and allows end-users to visualize annotations of genomic loci and allele change, variant type, associated gene and transcript consequences, clinical significance, and allele frequency information for all annotated cohorts in a highly interactive manner. A bespoke REST API also supports future analytical functionality. The AGVD has demonstrated proof of concept in the secure and scalable storage and visualization of African genomic data, providing a viable solution for H3ABioNet to further extend in future iterations of the project and a valuable resource for researchers to explore African genetic variation.
- ItemOpen AccessApplying, Evaluating and Refining Bioinformatics Core Competencies (An Update from the Curriculum Task Force of ISCB's Education Committee)(Public Library of Science, 2016) Welch, Lonnie; Brooksbank, Cath; Schwartz, Russell; Morgan, Sarah L; Gaeta, Bruno; Kilpatrick, Alastair M; Mietchen, Daniel; Moore, Benjamin L; Mulder, Nicola; Pauley, Mark; Pearson, William; Radivojac, Predrag; Rosenberg, Naomi; Rosenwald, Anne; Rustici, Gabriella; Warnow, Tandy
- ItemOpen AccessA bioinformatic study on the feasibility of a cross-species proteomics analyses of mycobacteria(2013) Rajaonarifara, Elinambinina; Blackburn, Jonathan; Mulder, NicolaIncludes abstract. Includes bibliographical references.
- ItemOpen Access"Broadband" bioinformatics skills transfer with the Knowledge Transfer Programme (KTP): educational model for upliftment and sustainable development(Public Library of Science, 2015) Chimusa, Emile R; Mbiyavanga, Mamana; Masilela, Velaphi; Kumuthini, JuditA shortage of practical skills and relevant expertise is possibly the primary obstacle to social upliftment and sustainable development in Africa. The "omics" fields, especially genomics, are increasingly dependent on the effective interpretation of large and complex sets of data. Despite abundant natural resources and population sizes comparable with many first-world countries from which talent could be drawn, countries in Africa still lag far behind the rest of the world in terms of specialized skills development. Moreover, there are serious concerns about disparities between countries within the continent. The multidisciplinary nature of the bioinformatics field, coupled with rare and depleting expertise, is a critical problem for the advancement of bioinformatics in Africa. We propose a formalized matchmaking system, which is aimed at reversing this trend, by introducing the Knowledge Transfer Programme (KTP). Instead of individual researchers travelling to other labs to learn, researchers with desirable skills are invited to join African research groups for six weeks to six months. Visiting researchers or trainers will pass on their expertise to multiple people simultaneously in their local environments, thus increasing the efficiency of knowledge transference. In return, visiting researchers have the opportunity to develop professional contacts, gain industry work experience, work with novel datasets, and strengthen and support their ongoing research. The KTP develops a network with a centralized hub through which groups and individuals are put into contact with one another and exchanges are facilitated by connecting both parties with potential funding sources. This is part of the PLOS Computational Biology Education collection.
- ItemOpen AccessCharacterisation of the metabolome of Mycobacterium tuberculosis to identify new pathways and pathway holes(2014) Wolfenden, Kristen Marie; Mulder, NicolaDue to high incidence rates and the development of new drug-resistant or multidrug-resistant strains of TB, the development of new medicines and treatments for tuberculosis is a necessity. In order to develop these drugs, Mycobacterium tuberculosis (Mtb) needs to be studied more completely; this study performs a characterisation of the metabolome of Mtb and comparison across the phylogenetic profile to identify notable pathways.
- ItemOpen AccessCreating and analysing an African pan-genome(2022) Bourn, Jessica Jean; Mulder, NicolaThe human reference genome is currently a core resource for understanding the role of genetics in human health, disease, and variation, and has been invaluable in the development of clinical and computational tools for these purposes. However, the limited number of individual genomes used to create the reference has resulted in an underrepresentation of the extensive genetic diversity present in different human populations. Since an important use of the reference genome is to identify genetic variants that may be implicated in disease, this lack of diversity could limit the scientific utility of the reference for ethnic groups that are poorly represented in it. As a result, adaptations to the reference genome structure have been proposed. One such proposal has been the use of multiple reference genomes, each of which represent different human populations. A logical and highly practical method of achieving this is through the use of a pan-genome, which is a curated collection of all the DNA sequences that are found within a population under study. Despite the fact that African populations exhibit the greatest genetic diversity and variation in the world, the many and sometimes ancient ethnolinguistic groups from Africa are among those least represented within the reference genome. Consequently, this study aimed to explore the feasibility of creating and analysing an African pangenome, and to begin developing tools to achieve this. Several distinct African regional ancestral groups – namely east African Nilo-Saharan, east African Afro-Asiatic, far west Niger-Congo, central west Niger-Congo, Bantu-speaking Niger-Congo, central African rainforest hunter-gatherer, and the Khoe and San – have previously been identified, and this study included and analysed samples from each group in order to assemble a more inclusive and representative pan-genome. A software pipeline developed by Duan et al. (2019), termed the HUman Pan-genome ANalysis (HUPAN) pipeline, was used here to assemble the African pan-genome. As the HUPAN pipeline was originally designed to analyse only single populations, the inclusion of multiple populations required modifications and improvements, which were implemented following the testing and analysis of the pipeline using a smaller dataset of whole genome sequences. Subsequently, a final dataset of 168 African high- and medium-coverage whole genome sequences representing the seven separate regional ancestral groups was submitted to the adapted HUPAN pipeline. For each group, nucleotide sequences that were absent from the human reference genome were assembled and extracted, which resulted in the identification of 43.37 Mbp of non-redundant non-reference genomic sequence and 31 novel predicted protein-coding genes from African individuals. Alignment to other pan-genome sequences, whole genomes from different human populations, and the complete telomere-to-telomere human genome validated a large portion of the sequences as nonreference and confirmed that the dataset contained sequences specific to African populations. However, the gene presence-absence variation analysis of the pan-genome within all 168 samples revealed patterns of gene presence and absence that were strongly correlated to the sample dataset of origin, rather than to the ancestral group of origin. This hindered the identification of genuine genetic variation specific to the groups analysed. Further, it appears that previous pan-genomic research has not investigated the degree to which the genetic variation identified is dataset-specific or truly population-specific. Consequently, the failure to acknowledge and account for the effects of spurious inter-dataset variation in previous pan-genomic research indicates that those analyses may be incomplete or ambiguous. This, therefore, calls into question the methods currently used for pangenomic research, and highlights that robust, standardised methods for human pan-genome research must be agreed on to ensure that comprehensive population-specific pan-genomes are produced in the future. Despite this inherent weakness of pan-genomic research, this study successfully enabled the creation and analysis of a comprehensive and inclusive African pan-genome. Unique sets of non-reference sequences specific to African regional ancestral groups were identified and obtained, enabling the assembly of a non-redundant set of pan-African non-reference sequences. Furthermore, certain complex but previously unconsidered aspects of pan-genome research were identified and explored, and these observations may play a role in the advancement of pan-genome research in future.
- ItemOpen AccessDeveloping reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics(BioMed Central, 2018-11-29) Baichoo, Shakuntala; Souilmi, Yassine; Panji, Sumir; Botha, Gerrit; Meintjes, Ayton; Hazelhurst, Scott; Bendou, Hocine; Beste, Eugene d; Mpangase, Phelelani T; Souiai, Oussema; Alghali, Mustafa; Yi, Long; O’Connor, Brian D; Crusoe, Michael; Armstrong, Don; Aron, Shaun; Joubert, Fourie; Ahmed, Azza E; Mbiyavanga, Mamana; Heusden, Peter v; Magosi, Lerato E; Zermeno, Jennie; Mainzer, Liudmila S; Fadlelmola, Faisal M; Jongeneel, C. V; Mulder, NicolaAbstract Background The Pan-African bioinformatics network, H3ABioNet, comprises 27 research institutions in 17 African countries. H3ABioNet is part of the Human Health and Heredity in Africa program (H3Africa), an African-led research consortium funded by the US National Institutes of Health and the UK Wellcome Trust, aimed at using genomics to study and improve the health of Africans. A key role of H3ABioNet is to support H3Africa projects by building bioinformatics infrastructure such as portable and reproducible bioinformatics workflows for use on heterogeneous African computing environments. Processing and analysis of genomic data is an example of a big data application requiring complex interdependent data analysis workflows. Such bioinformatics workflows take the primary and secondary input data through several computationally-intensive processing steps using different software packages, where some of the outputs form inputs for other steps. Implementing scalable, reproducible, portable and easy-to-use workflows is particularly challenging. Results H3ABioNet has built four workflows to support (1) the calling of variants from high-throughput sequencing data; (2) the analysis of microbial populations from 16S rDNA sequence data; (3) genotyping and genome-wide association studies; and (4) single nucleotide polymorphism imputation. A week-long hackathon was organized in August 2016 with participants from six African bioinformatics groups, and US and European collaborators. Two of the workflows are built using the Common Workflow Language framework (CWL) and two using Nextflow. All the workflows are containerized for improved portability and reproducibility using Docker, and are publicly available for use by members of the H3Africa consortium and the international research community. Conclusion The H3ABioNet workflows have been implemented in view of offering ease of use for the end user and high levels of reproducibility and portability, all while following modern state of the art bioinformatics data processing protocols. The H3ABioNet workflows will service the H3Africa consortium projects and are currently in use. All four workflows are also publicly available for research scientists worldwide to use and adapt for their respective needs. The H3ABioNet workflows will help develop bioinformatics capacity and assist genomics research within Africa and serve to increase the scientific output of H3Africa and its Pan-African Bioinformatics Network.
- ItemOpen AccessThe development of computational biology in South Africa: successes achieved and lessons learnt(Public Library of Science, 2016) Mulder, Nicola J; Christoffels, Alan; De Oliveira, Tulio; Gamieldien, Junaid; Hazelhurst, Scott; Joubert, Fourie; Kumuthini, Judit; Pillay, Ché S; Snoep, Jacky L; Bishop, Özlem Tastan; Tiffin, NickiBioinformatics is now a critical skill in many research and commercial environments as biological data are increasing in both size and complexity. South African researchers recognized this need in the mid-1990s and responded by working with the government as well as international bodies to develop initiatives to build bioinformatics capacity in the country. Significant injections of support from these bodies provided a springboard for the establishment of computational biology units at multiple universities throughout the country, which took on teaching, basic research and support roles. Several challenges were encountered, for example with unreliability of funding, lack of skills, and lack of infrastructure. However, the bioinformatics community worked together to overcome these, and South Africa is now arguably the leading country in bioinformatics on the African continent. Here we discuss how the discipline developed in the country, highlighting the challenges, successes, and lessons learnt.
- ItemOpen AccessEvaluating the predictive performance of cytotoxic T lymphocyte epitope prediction tools using Elispot assay data(2018) Meraba, Rebone Leboreng; Martin, Darren PComputational T-cell epitope prediction tools have been previously devised to predict potential human leukocyte antigen (HLA) binding peptides from protein sequences. These tools are complements of Enzyme-linked immunosorbent spot (ELISpot) assays - a very commonly applied immunological technique that is used both to identify regions of pathogen genomes that trigger an immune response and to characterize the relationships between an individual's complement of HLA alleles and the degree of immunity that they display. If computational tools could accurately predict HLA-peptide binding, then these tools might be useable as a cheap and reliable alternative to ELISpot assays. A web-based IFN γ ELISpot assay dataset sharing resource, called IMMUNO-SHARE, was developed to enable the simple and straightforward storage and dissemination amongst researchers of large volumes of IFN γ ELISpot assay data. Such experimental data was next used to make HLA-peptide binding predictions with four frequently used T-cell epitope prediction tools - netMHC 3.2, IEDB_ANN, IEDB_ARB Matrix and IEDB_SMM. The predictive performances of all four tools individually and collectively was statistically assessed using non-parametric Spearman rank-order correlation tests. It was found that none of the four tested tools yielded binding affinity predictions that were detectably correlated with the observed ELISpot data. High false positive rates, where high predicted binding affinities between peptides and patient HLAs corresponded in these patients with no appreciable immune responses, were apparent for all four of the tested methods. The low degree of correlation between ELISpot data and HLA-peptide binding predictions and in particular, high false positive rates and relatively low true positive and true negative rates, indicate that the four tested tools would require substantial improvement before they could be seen as a viable alternative to ELISpot assays. Given that the accuracy of predictions of each of the four methods tested is largely dependent on both the quantity and quality of known true binder and true non-binder datasets that were used to train the HLA-peptide binding prediction methods implemented by the tools, it is plausible that the accuracy of these tools could be increased with larger training datasets. Retraining either the current methods or the next generation of prediction tools would therefore be greatly facilitated by the availability of large quantities of publically available HLA-peptide binding interaction information. It is hoped that IMMUNO-SHARE or some other ELISpot data sharing resource could eventually meet this need.
- ItemOpen AccessThe evolutionary impacts of secondary structures within genomes of eukaryote-infecting single-stranded DNA viruses(2015) Muhire, Brejnev Muhizi; Martin, Darren PSecondary structures forming through base-pairing in virus genomes have been proven to regulate several processes during viral replication cycles, including genome replication, transcription, post-transcriptional activities, protein synthesis, genome packaging, generation of viral sub-genomes and evasion of host-cell immune responses. Although computational DNA/RNA folding methods based-on free energy minimisation approaches are capable of predicting structures that form within virus genomes, these methods are not entirely accurate. Notably, many of structures that are accurately predicted will likely have no biological importance within the genomes in which they reside because even randomly generated single-stranded RNA/DNA sequences will form stable secondary structures. Nevertheless, with additional genome evolution analyses involving the detection of natural selection, sequence co-evolution, and genetic recombination, it is possible to both validate the existence of, and infer the biological importance of, computationally predicted structures. Here I implement and deploy free bioinformatics tools to (1) automate nucleotide and protein sequences classification into datasets useful for downstream molecular evolution analyses; (2) improve the accuracy of computational virus-genome-scale secondary structure prediction; (3) enable the identification of biologically relevant secondary structures using signals of purifying selection, coevolution and recombination within aligned sequence datasets; and (4) enable efficient visualisation of structural and selection data for better characterisation of individual secondary structural elements. Using these tools I carried-out large scale studies that predicted and characterised novel functional secondary structures, that potentially regulate transcription, translation, gene splicing, and replication, within the genomes of eukaryote-infecting ssDNA viruses (Circoviridae, Anelloviridae, Parvoviridae, Nanoviridae, and Geminiviridae). I show that purifying selection tends to be stronger at base-paired sites than it is at unpaired sites and, wherever mutations are tolerable within paired regions, I demonstrate that there exist strong associations between base-pairing and complementary coevolution. Finally, I show that the recombinant genomes of some, but not all, eukaryote-infecting ssDNA virus groups display weak evidence of both homologous and non-homologous recombination break-points preferentially occurring at genome sites that minimally disrupt secondary structures. Altogether, these results suggest that natural selection acting to maintain important biologically functional secondary structural elements has been a major process during the evolution of eukaryote-infecting ssDNA viruses.
- ItemOpen AccessA flexible R package for nonnegative matrix factorization(BioMed Central Ltd, 2010) Gaujoux, Renaud; Seoighe, CathalBACKGROUND: Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, including signal processing, face recognition and text mining. Recent applications of NMF in bioinformatics have demonstrated its ability to extract meaningful information from high-dimensional data such as gene expression microarrays. Developments in NMF theory and applications have resulted in a variety of algorithms and methods. However, most NMF implementations have been on commercial platforms, while those that are freely available typically require programming skills. This limits their use by the wider research community. RESULTS: Our objective is to provide the bioinformatics community with an open-source, easy-to-use and unified interface to standard NMF algorithms, as well as with a simple framework to help implement and test new NMF methods. For that purpose, we have developed a package for the R/BioConductor platform. The package ports public code to R, and is structured to enable users to easily modify and/or add algorithms. It includes a number of published NMF algorithms and initialization methods and facilitates the combination of these to produce new NMF strategies. Commonly used benchmark data and visualization methods are provided to help in the comparison and interpretation of the results. CONCLUSIONS: The NMF package helps realize the potential of Nonnegative Matrix Factorization, especially in bioinformatics, providing easy access to methods that have already yielded new insights in many applications. Documentation, source code and sample data are available from CRAN.
- ItemOpen AccessIdentification of the virulence gene of Mycobacterium tuberculosis(2007) Rabiu, Halimah Adenike; Mulder, NicolaThe major thrust of this project is to identify and characterize potential virulence genes from M. tuberculosis. To this end, we have compiled and integrated information from various public databases to catalogue 246573 microbial genes from 84 organisms, including pathogens and non pathogenic microbes. We determined the phylogenetic distributions by grouping the proteins into families based on sequence similarity with the aid of BLASTP and the NCBI BLASTClust program.
- ItemOpen AccessImpact of the mycobiome on the health of the female genital tract(2022) Gangiah, Tamlyn Kirstey; Mulder, Nicola; Masson, LindiThe female genital tract microbiome comprises a community of microorganisms. Imbalances in the microbiome are associated with vaginal diseases such as bacterial vaginosis (BV) and sexually transmitted infections (STIs). These diseases greatly burden South Africa, and young women in this region are at an increased risk of contracting vaginal diseases. Consequently, it is vital to investigate the factors that influence FGT health. The fungal constituent of the microbiome (the mycobiome) has been demonstrated to play a role in regulating mucosal health, especially when the bacterial component is disturbed. However, we have a limited understanding of the vaginal mycobiome since many microbiome studies have focused on bacterial communities and have neglected low abundance taxonomic groups, such as fungi. To reduce this knowledge deficit, we present the first large-scale metaproteomic study to define the taxonomic composition and potential functional processes of the vaginal mycobiome in South African women. We examined vaginal fungal communities present in optimal and nonoptimal states (BV, STIs, and genital inflammation) by collecting lateral vaginal wall swabs from 123 women for liquid chromatography-tandem mass spectrometry. Taxonomic analysis requires representative sequence databases; however, since mycobiome research is relatively new, fungal databases are still in their infancy. As a result, metaproteomic methods are not optimized for fungal research. Therefore, we optimized a metaproteomic approach to increase fungal protein group assignments. With this, 50 fungal proteins belonging to 39 different genera were identified post quality-control and analysed for taxonomic and functional distributions. Taxonomic analysis revealed that the vaginal mycobiome had a high relative abundance of Candida across optimal and non-optimal states. We observed changes in differential abundance at the genus and biological process level between optimal and non-optimal states for BV and Mycoplasma genitalium. In the BV positive state, most fungal proteins were significantly underabundant (p< 0.05) compared to the BV negative state, with the exception of Malassezia and Condiobolus. Correspondingly, Nugent score was negatively associated with total fungal protein intensity, implying that the microenvironment during BV is less suitable for fungal growth. Furthermore, we assessed which clinical variables were associated with driving fungal community composition; results indicated that Nugent score, pro-inflammatory cytokines, chemokines, vaginal pH, Chlamydia trachomatis, and the presence of clue cells were involved. Lastly, we used publicly available vaginal proteome data to confirm our fungal identifications and suspect Candida, Debaryomyces, Kluyveromyces, Malassezia, Penicillium, Yarrowia, Aspergillus, Cryptococcus, Wallemia, Trichosporon, and Saccharomyces are likely true vaginal inhabitants. Thus, this study sets the groundwork for understanding the vaginal mycobiome and its association with prevalent vaginal diseases.
- ItemOpen AccessInfluence of gut microbiota on immune system in infants(2017) Kachambwa, Paidamoyo; Mulder, NicolaBackground and Methods: Microbiota play many significant, direct or indirect, beneficial and detrimental roles in humans. Microbiome development is established at infancy where diet plays a directive role in the proliferation of gut microbes. It has been shown that the presence of a defined set of microbes has been known to increase the overall immunological capacity, which vaccines depend on to be effective. To date, little work has been done on the effect of the microbiota on immune system at infancy, thus an analysis of the microbial ecology present in the infant's gut and its correlation with immune activation is needed. Expression of genes involved in mediating and regulating immunity can be measured as an indicator of immune activity. Vaccines work by stimulating an immune response which can be measured by gene expression levels. This affects the infant's ability to establish a strong immune system, which is also dictated at infancy. 16s rRNA sequence data generated from 134 infant stool samples, at vaccination points 0, 6 and 14 weeks from infants that were either breast or formula fed, was analysed using the Quantitative Insights Into Microbial Ecology (QIIME) pipeline to detect different taxonomic groups that make up a particular microbiome. Statistical analysis in R was used to quantify the diversity of the different microbial groups in the gut. Expression levels of immune-related genes were measured from blood samples that were stimulated by a Bacillus Calmette–Guérin (BCG) antigen and correlated with microbiota compositions. Results and Conclusion: Microbiome data showed initial differentiation between breast and mixed fed infants.15% of 5 of the most abundant bacteria for breast fed infants were Bifidobacteriales, which are known for their probiotic properties. The data did not fully cluster as the oldest samples were taken quite early at 14 weeks. Individual bacteria were correlated with individual gene expression level data. The study shows the relative abundance of particular bacteria, comparing against feeding modality and demonstrated how the microbiota correlates with gene expression levels. At week 14, Bifidobacterium of abundance below 0 (heatmap log₁₀ scale) generally correlated with high CASP3 gene expression levels in breast fed babies while abundances above 1 correlated with low gene expression levels. Gene expression at abnormal levels usually has undesirable effects which result in dysfunctional immune reactions that lead to conditions ranging from autoimmune diseases to cancer.
- ItemOpen AccessIntegration and visualisation of data in bioinformatics(2015) Salazar, Gustavo A; Mulder NicolaThe most recent advances in laboratory techniques aimed at observing and measuring biological processes are characterised by their ability to generate large amounts of data. The more data we gather, the greater the chance of finding clues to understand the systems of life. This, however, is only true if the methods that analyse the generated data are efficient, effective, and robust enough to overcome the challenges intrinsic to the management of big data. The computational tools designed to overcome these challenges should also take into account the requirements of current research. Science demands specialised knowledge for understanding the particularities of each study; in addition, it is seldom possible to describe a single observation without considering its relationship with other processes, entities or systems. This thesis explores two closely related fields: the integration and visualisation of biological data. We believe that these two branches of study are fundamental in the creation of scientific software tools that respond to the ever increasing needs of researchers. The distributed annotation system (DAS) is a community project that supports the integration of data from federated sources and its visualisation on web and stand-alone clients. We have extended the DAS protocol to improve its search capabilities and also to support feature annotation by the community. We have also collaborated on the implementation of MyDAS, a server to facilitate the publication of biological data following the DAS protocol, and contributed in the design of the protein DAS client called DASty. Furthermore, we have developed a tool called probeSearcher, which uses the DAS technology to facilitate the identification of microarray chips that include probes for regions on proteins of interest. Another community project in which we participated is BioJS, an open source library of visualisation components for biological data. This thesis includes a description of the project, our contributions to it and some developed components that are part of it. Finally, and most importantly, we combined several BioJS components over a modular architecture to create PINV, a web based visualiser of protein-protein interaction (PPI) networks, that takes advantage of the features of modern web technologies in order to explore PPI datasets on an almost ubiquitous platform (the web) and facilitates collaboration between scientific peers. This thesis includes a description of the design and development processes of PINV, as well as current use cases that have benefited from the tool and whose feedback has been the source of several improvements to PINV. Collectively, this thesis describes novel software tools that, by using modern web technologies, facilitates the integration, exploration and visualisation of biological data, which has the potential to contribute to our understanding of the systems of life.
- ItemOpen AccessInvestigation of HIV-TB co-infection through analysis of the potential impact of host genetic variation on host-pathogen protein interactions(2022) Heekes, Alexa Storme; Mulder, NicolaHIV and Mycobacterium tuberculosis (Mtb) co-infection causes treatment and diagnostic difficulties, which places a major burden on health care systems in settings with high prevalence of both infectious diseases, such as South Africa. Human genetic variation adds further complexity, with variants affecting disease susceptibility and response to treatment. The identification of variants in African populations is affected by reference mapping bias, especially in complex regions like the Major Histocompatibility Complex (MHC), which plays an important role in the immune response to HIV and Mtb infection. We used a graph-based approach to identify novel variants in the MHC region within African samples without mapping to the canonical reference genome. We generated a host-pathogen functional interaction network made up of inter- and intraspecies protein interactions, gene expression during co-infection, drug-target interactions, and human genetic variation. Differential expression and network centrality properties were used to prioritise proteins that may be important in co-infection. Using the interaction network we identified 28 human proteins that interact with both pathogens (”bridge” proteins). Network analysis showed that while MHC proteins did not have significantly higher centrality measures than non-MHC proteins, bridge proteins had significantly shorter distance to MHC proteins. Proteins that were significantly differentially expressed during co-infection or contained variants clinically-associated with HIV or TB also had significantly stronger network properties. Finally, we identified common and consequential variants within prioritised proteins that may be clinically-associated with HIV and TB. The integrated network was extensively annotated and stored in a graph database that enables rapid and high throughput prioritisation of sets of genes or variants, facilitates detailed investigations and allows network-based visualisation.
- ItemOpen AccessMScanner: a classifier for retrieving Medline citations(BioMed Central Ltd, 2008) Poulter, Graham; Rubin, Daniel; Altman, Russ; Seoighe, CathalBACKGROUND: Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains. RESULTS: MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92. CONCLUSION: MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu.
- ItemOpen AccessMyDas, an extensible Java DAS server(Public Library of Science, 2012) Salazar, Gustavo A; García, Leyla J; Jones, Philip; Jimenez, Rafael C; Quinn, Antony F; Jenkinson, Andrew M; Mulder, Nicola; Martin, Maria; Hunter, Sarah; Hermjakob, HenningA large number of diverse, complex, and distributed data resources are currently available in the Bioinformatics domain. The pace of discovery and the diversity of information means that centralised reference databases like UniProt and Ensembl cannot integrate all potentially relevant information sources. From a user perspective however, centralised access to all relevant information concerning a specific query is essential. The Distributed Annotation System (DAS) defines a communication protocol to exchange annotations on genomic and protein sequences; this standardisation enables clients to retrieve data from a myriad of sources, thus offering centralised access to end-users. We introduce MyDas, a web server that facilitates the publishing of biological annotations according to the DAS specification. It deals with the common functionality requirements of making data available, while also providing an extension mechanism in order to implement the specifics of data store interaction. MyDas allows the user to define where the required information is located along with its structure, and is then responsible for the communication protocol details.
- ItemOpen AccessNetwork-based approach for post genome-wide association study analysis in admixed populations(2014) Mbiyavanga, Mamana; Mulder, NicolaIn this project, we review some existing pathway-based approaches for GWA study analyses, by exploring different implemented methods for combining effects of multiple modest genetic variants at gene and pathway levels. We then propose a graph-based method, ancGWAS, that incorporates the signal from GWA study, and the locus-specific ancestry into the human protein-protein interaction (PPI) network to identify significant sub-networks or pathways associated with the trait of interest. This network-based method applies centrality measures within linkage disequilibrium (LD) on the network to search for pathways and applies a scoring summary statistic on the resulting pathways to identify the most enriched pathways associated with complex diseases.
- ItemOpen AccessPrevalence and frequency spectra of single nucleotide polymorphisms at exon-intron junctions of human genes(2008) Lupindo, Bukiwe; Seoighe, CathalIn humans and other higher eukaryotes the observation of multiple splice isoforms for a given gene is common. However it is not clear whether all of these alternatively spliced isoforms are a product of true alternative splicing or some are due to DNA sequence variations in human populations. Genetic variations that affect splicing have been shown to cause variation in splicing patterns and potentially are an important source of phenotypic variability among humans. Furthermore, variation in disease susceptibility and manifestation between individuals is often associated with genetic polymorphisms that determine the way in which genes are spliced. Hence, identification of genetic polymorphisms that might affect the way in which pre-mRNAs are spliced is an area of great interest.