Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
Thesis / Dissertation
2023
Permanent link to this Item
Authors
Supervisors
Journal Title
Link to Journal
Journal ISSN
Volume Title
Publisher
Publisher
Department
Faculty
License
Series
Abstract
Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align.
Description
Keywords
Reference:
Swanepoel, P. 2023. Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification. . ,Faculty of Health Sciences ,Computational Biology Division. http://hdl.handle.net/11427/39870