Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology
Thesis / Dissertation
2025
Permanent link to this Item
Authors
Journal Title
Link to Journal
Journal ISSN
Volume Title
Publisher
Publisher
University of Cape Town
Department
Faculty
License
Series
Abstract
This thesis is grounded in the fundamental observation that biological data has shape and this shape matters. Beneath the high-dimensional, often noisy landscape of gene expression profiles lie hidden topological structures (connected components, loops and voids) that capture the complex relationships driving cancer development and progression. By embracing this perspective, we position Topological Data Analysis (TDA) and persistent homology at the core of a novel analytical framework designed to tackle two key challenges in cancer research: clinical outcome prediction and biomarker discovery. In this study, we employ Weighted Gene Topological Data Analysis (WGTDA) to extract topological features from gene expression data, which serve as prognostic biomarkers for cancer classification, staging, and treatment response. Moreover, by integrating these topological features with machine learning models we aim to enhance the predictive accuracy for clinical outcomes. For clinical outcome prediction, we transformed gene expression profiles into topological fingerprints using multiple co-expression measures—namely, Pearson Correlation, Distance Correlation, and Weighted Topological Overlap (wTO) computed with both Pearson and Distance-based adjacencies. These topological features were analyzed using Random Forests. In parallel, we compared the predictive performance of traditional machine learning models (SVM, Gradient Boosting Decision Trees, Random Forest, and Neural Networks) trained on raw gene expression data against models incorporating the topological fingerprints. This comparative analysis was conducted across three classification tasks: cancer type (using TCGA-SARC, TCGA-PCPG, and TCGA-ESCA datasets), cancer staging (using TCGA-HNSC for stages I–IV), and treatment response (responders vs. non-responders). For biomarker identification, the same three tasks were applied using the best performing co-expression measure to generate a global topological representation of the patient population. This provided a disease-level view, highlighting shared homological patterns to facilitate biomarker discovery. Additionally, a dedicated visualization tool has been developed to aid in interpreting these topological signatures and identifying critical biomarkers. The tool is available at https://nnyase.github.io/MSc-Thesis/ WGTDA significantly enhanced phenotype prediction tasks by overcoming common pitfalls of traditional ML models in RNA-Seq data, such as overfitting and poor handling of class imbalance. TDA-derived features improved generalizability of ML models in tasks such as cancer staging and treatment response prediction. Our findings strongly support the integration of TDA into clinical outcome prediction, demonstrating its value in capturing nuanced patterns that allow ML methods to learn more effectively. Moreover, WGTDA remarkably identified key gene signatures for cancer type, staging, and treatment response without relying on pre-existing biological assumptions, yielding biomarkers that are strongly supported by the existing literature. These results underscore the method's reliability and potential clinical utility in precision oncology.
Description
Reference:
Nyase, N. 2025. Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology. . University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/42574