Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
| dc.contributor.advisor | Martin, Darrin | |
| dc.contributor.author | Swanepoel, Phillip | |
| dc.date.accessioned | 2024-06-05T13:17:14Z | |
| dc.date.available | 2024-06-05T13:17:14Z | |
| dc.date.issued | 2023 | |
| dc.date.updated | 2024-06-05T12:51:27Z | |
| dc.description.abstract | Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align. | |
| dc.identifier.apacitation | Swanepoel, P. (2023). <i>Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification</i>. (). ,Faculty of Health Sciences ,Computational Biology Division. Retrieved from http://hdl.handle.net/11427/39870 | en_ZA |
| dc.identifier.chicagocitation | Swanepoel, Phillip. <i>"Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification."</i> ., ,Faculty of Health Sciences ,Computational Biology Division, 2023. http://hdl.handle.net/11427/39870 | en_ZA |
| dc.identifier.citation | Swanepoel, P. 2023. Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification. . ,Faculty of Health Sciences ,Computational Biology Division. http://hdl.handle.net/11427/39870 | en_ZA |
| dc.identifier.ris | TY - Thesis / Dissertation AU - Swanepoel, Phillip AB - Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align. DA - 2023 DB - OpenUCT DP - University of Cape Town KW - Medicine LK - https://open.uct.ac.za PY - 2023 T1 - Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification TI - Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification UR - http://hdl.handle.net/11427/39870 ER - | en_ZA |
| dc.identifier.uri | http://hdl.handle.net/11427/39870 | |
| dc.identifier.vancouvercitation | Swanepoel P. Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification. []. ,Faculty of Health Sciences ,Computational Biology Division, 2023 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/39870 | en_ZA |
| dc.language.rfc3066 | eng | |
| dc.publisher.department | Computational Biology Division | |
| dc.publisher.faculty | Faculty of Health Sciences | |
| dc.subject | Medicine | |
| dc.title | Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification | |
| dc.type | Thesis / Dissertation | |
| dc.type.qualificationlevel | Masters | |
| dc.type.qualificationlevel | MSc |