Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering

dc.contributor.advisorBuys, Jan
dc.contributor.authorAlexander, Natalie
dc.date.accessioned2025-11-10T09:49:58Z
dc.date.available2025-11-10T09:49:58Z
dc.date.issued2025
dc.date.updated2025-11-10T09:45:15Z
dc.description.abstractHospitals store patient information in relational databases known as Electronic Health Records (EHRs). Exist ing EHRs have filter and search options on the front end that are converted to SQL queries at the back end. However, these search and filter options become cumbersome when querying the EHR. While users could write custom SQL queries to query the EHR directly, this approach requires database expertise. Recent ad vancements in medical question-answering leverage text-to-SQL parsing, which translates a user's natural language question into an executable SQL query, enabling information retrieval from a database. However, current medical text-to-SQL research only addresses a limited scope of questions, known as answerable ques tions. Questions that the system cannot reliably answer (unanswerable questions) result in inexecutable or incorrect SQL predictions which may return incorrect information that affects clinical decision-making. This limitation underscores the need for a medical text-to-SQL system that can reliably address both answerable and unanswerable questions. This project aims to expand the coverage of questions answered by medical text to-SQL systems, by addressing unanswerable questions that are out-of-schema or require medical knowledge to simplify complex, medical jargon. More specifically, we focus on addressing real-world unanswerable ques tions related to diagnoses and medication. This research first explores methods for addressing out-of-schema questions by assessing how incorporating an unseen schema, during inference, enhances the performance of a sequence-to-sequence (T5) text-to-SQL model. We then compare this approach to the effectiveness of fine tuning the model on a training dataset that includes these out-of-schema questions and their corresponding schema. Secondly, this research examines how external medical knowledge sources, related to diagnoses and medication, can be used in data augmentation (either through retrieval-augmented generation or SQL post processing) to improve the answerability of unanswerable questions with complex medical jargon. In addition, we ensure model reliability by applying answer abstention when the text-to-SQL model cannot reliably answer a question, while also ensuring that the model does not deteriorate the answerability of the original answerable questions. As a result of these experiments, we find that out-of-schema questions are addressed by fine-tuning a T5-Base model on a training dataset that includes out-of-schema question representations, excluding addi tional schema information. In addition, we find that fine-tuning a T5-Large model with retrieval-augmented generation, which incorporates medical knowledge from the SNOMED CT and RxNorm medical vocabularies, improves the model's ability to answer unanswerable questions with complex medical jargon. We also find that an entropy-based uncertainty estimation method, which uses K-means clustering to establish the absten tion threshold, is suitable for answer abstention. Finally, we find that our proposed models do not compromise the answerability of the original answerable questions.
dc.identifier.apacitationAlexander, N. (2025). <i>Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering</i>. (). University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. Retrieved from http://hdl.handle.net/11427/42164en_ZA
dc.identifier.chicagocitationAlexander, Natalie. <i>"Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering."</i> ., University of Cape Town ,Faculty of Science ,Department of Statistical Sciences, 2025. http://hdl.handle.net/11427/42164en_ZA
dc.identifier.citationAlexander, N. 2025. Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering. . University of Cape Town ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/42164en_ZA
dc.identifier.ris TY - Thesis / Dissertation AU - Alexander, Natalie AB - Hospitals store patient information in relational databases known as Electronic Health Records (EHRs). Exist ing EHRs have filter and search options on the front end that are converted to SQL queries at the back end. However, these search and filter options become cumbersome when querying the EHR. While users could write custom SQL queries to query the EHR directly, this approach requires database expertise. Recent ad vancements in medical question-answering leverage text-to-SQL parsing, which translates a user's natural language question into an executable SQL query, enabling information retrieval from a database. However, current medical text-to-SQL research only addresses a limited scope of questions, known as answerable ques tions. Questions that the system cannot reliably answer (unanswerable questions) result in inexecutable or incorrect SQL predictions which may return incorrect information that affects clinical decision-making. This limitation underscores the need for a medical text-to-SQL system that can reliably address both answerable and unanswerable questions. This project aims to expand the coverage of questions answered by medical text to-SQL systems, by addressing unanswerable questions that are out-of-schema or require medical knowledge to simplify complex, medical jargon. More specifically, we focus on addressing real-world unanswerable ques tions related to diagnoses and medication. This research first explores methods for addressing out-of-schema questions by assessing how incorporating an unseen schema, during inference, enhances the performance of a sequence-to-sequence (T5) text-to-SQL model. We then compare this approach to the effectiveness of fine tuning the model on a training dataset that includes these out-of-schema questions and their corresponding schema. Secondly, this research examines how external medical knowledge sources, related to diagnoses and medication, can be used in data augmentation (either through retrieval-augmented generation or SQL post processing) to improve the answerability of unanswerable questions with complex medical jargon. In addition, we ensure model reliability by applying answer abstention when the text-to-SQL model cannot reliably answer a question, while also ensuring that the model does not deteriorate the answerability of the original answerable questions. As a result of these experiments, we find that out-of-schema questions are addressed by fine-tuning a T5-Base model on a training dataset that includes out-of-schema question representations, excluding addi tional schema information. In addition, we find that fine-tuning a T5-Large model with retrieval-augmented generation, which incorporates medical knowledge from the SNOMED CT and RxNorm medical vocabularies, improves the model's ability to answer unanswerable questions with complex medical jargon. We also find that an entropy-based uncertainty estimation method, which uses K-means clustering to establish the absten tion threshold, is suitable for answer abstention. Finally, we find that our proposed models do not compromise the answerability of the original answerable questions. DA - 2025 DB - OpenUCT DP - University of Cape Town KW - Statistical Science LK - https://open.uct.ac.za PB - University of Cape Town PY - 2025 T1 - Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering TI - Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering UR - http://hdl.handle.net/11427/42164 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/42164
dc.identifier.vancouvercitationAlexander N. Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering. []. University of Cape Town ,Faculty of Science ,Department of Statistical Sciences, 2025 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/42164en_ZA
dc.language.rfc3066eng
dc.publisher.departmentDepartment of Statistical Sciences
dc.publisher.facultyFaculty of Science
dc.publisher.institutionUniversity of Cape Town
dc.subjectStatistical Science
dc.titleTowards answering unanswerable questions: data augmentation for enhanced medical domain question answering
dc.typeThesis / Dissertation
dc.type.qualificationlevelMasters
dc.type.qualificationlevelMSc
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis_sci_2025_alexander natalie.pdf
Size:
1.86 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.72 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections