Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification

dc.contributor.advisorMartin, Darrin
dc.contributor.authorCullinan, Joshua
dc.date.accessioned2025-11-06T07:22:35Z
dc.date.available2025-11-06T07:22:35Z
dc.date.issued2025
dc.date.updated2025-11-06T07:12:26Z
dc.description.abstractThis thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and tested several models, including logistic regression, gradient boosting, random forests and neural networks, on a dataset of 491 124 sequences. A novel neural network architecture employing position selection achieved the highest performance with a weighted Area Under Curve (AUC) of 0.784, surpassing RDP5's baseline AUC of 0.739. The gradient boosting classifier demonstrated strong results with an AUC of 0.765, whilst the binary neural network achieved 0.764. Performance evaluation focused on precision, recall and F1-scores to address the inherent class imbalance between recombinant and parental sequences. The models demonstrated modest performance in detecting recombinants (precision 0.627-0.687, recall 0.652-0.686). These improvements, though incremental, represent progress in automated recombination detection. The successful preliminary integration of the logistic regression model into RDP5 demonstrates the practical applicability of these approaches. This work provides a foundation for enhancing viral recombination detection through machine learning, whilst highlighting areas requiring further development to achieve more substantial improvements in detection accuracy.
dc.identifier.apacitationCullinan, J. (2025). <i>Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification</i>. (). ,Faculty of Health Sciences ,Computational Biology Division. Retrieved from http://hdl.handle.net/11427/42113en_ZA
dc.identifier.chicagocitationCullinan, Joshua. <i>"Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification."</i> ., ,Faculty of Health Sciences ,Computational Biology Division, 2025. http://hdl.handle.net/11427/42113en_ZA
dc.identifier.citationCullinan, J. 2025. Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification. . ,Faculty of Health Sciences ,Computational Biology Division. http://hdl.handle.net/11427/42113en_ZA
dc.identifier.ris TY - Thesis / Dissertation AU - Cullinan, Joshua AB - This thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and tested several models, including logistic regression, gradient boosting, random forests and neural networks, on a dataset of 491 124 sequences. A novel neural network architecture employing position selection achieved the highest performance with a weighted Area Under Curve (AUC) of 0.784, surpassing RDP5's baseline AUC of 0.739. The gradient boosting classifier demonstrated strong results with an AUC of 0.765, whilst the binary neural network achieved 0.764. Performance evaluation focused on precision, recall and F1-scores to address the inherent class imbalance between recombinant and parental sequences. The models demonstrated modest performance in detecting recombinants (precision 0.627-0.687, recall 0.652-0.686). These improvements, though incremental, represent progress in automated recombination detection. The successful preliminary integration of the logistic regression model into RDP5 demonstrates the practical applicability of these approaches. This work provides a foundation for enhancing viral recombination detection through machine learning, whilst highlighting areas requiring further development to achieve more substantial improvements in detection accuracy. DA - 2025 DB - OpenUCT DP - University of Cape Town KW - Medicine LK - https://open.uct.ac.za PY - 2025 T1 - Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification TI - Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification UR - http://hdl.handle.net/11427/42113 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/42113
dc.identifier.vancouvercitationCullinan J. Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification. []. ,Faculty of Health Sciences ,Computational Biology Division, 2025 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/42113en_ZA
dc.language.isoen
dc.language.rfc3066eng
dc.publisher.departmentComputational Biology Division
dc.publisher.facultyFaculty of Health Sciences
dc.subjectMedicine
dc.titleUtilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification
dc.typeThesis / Dissertation
dc.type.qualificationlevelMasters
dc.type.qualificationlevelMSc
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis_hsf_2025_cullinan joshua.pdf
Size:
8.14 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.72 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections