Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering

Alexander, Natalie

Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering

Thesis / Dissertation

2025

Publisher

University of Cape Town

Department

Department of Statistical Sciences

Faculty

Faculty of Science

Abstract

Hospitals store patient information in relational databases known as Electronic Health Records (EHRs). Exist ing EHRs have filter and search options on the front end that are converted to SQL queries at the back end. However, these search and filter options become cumbersome when querying the EHR. While users could write custom SQL queries to query the EHR directly, this approach requires database expertise. Recent ad vancements in medical question-answering leverage text-to-SQL parsing, which translates a user's natural language question into an executable SQL query, enabling information retrieval from a database. However, current medical text-to-SQL research only addresses a limited scope of questions, known as answerable ques tions. Questions that the system cannot reliably answer (unanswerable questions) result in inexecutable or incorrect SQL predictions which may return incorrect information that affects clinical decision-making. This limitation underscores the need for a medical text-to-SQL system that can reliably address both answerable and unanswerable questions. This project aims to expand the coverage of questions answered by medical text to-SQL systems, by addressing unanswerable questions that are out-of-schema or require medical knowledge to simplify complex, medical jargon. More specifically, we focus on addressing real-world unanswerable ques tions related to diagnoses and medication. This research first explores methods for addressing out-of-schema questions by assessing how incorporating an unseen schema, during inference, enhances the performance of a sequence-to-sequence (T5) text-to-SQL model. We then compare this approach to the effectiveness of fine tuning the model on a training dataset that includes these out-of-schema questions and their corresponding schema. Secondly, this research examines how external medical knowledge sources, related to diagnoses and medication, can be used in data augmentation (either through retrieval-augmented generation or SQL post processing) to improve the answerability of unanswerable questions with complex medical jargon. In addition, we ensure model reliability by applying answer abstention when the text-to-SQL model cannot reliably answer a question, while also ensuring that the model does not deteriorate the answerability of the original answerable questions. As a result of these experiments, we find that out-of-schema questions are addressed by fine-tuning a T5-Base model on a training dataset that includes out-of-schema question representations, excluding addi tional schema information. In addition, we find that fine-tuning a T5-Large model with retrieval-augmented generation, which incorporates medical knowledge from the SNOMED CT and RxNorm medical vocabularies, improves the model's ability to answer unanswerable questions with complex medical jargon. We also find that an entropy-based uncertainty estimation method, which uses K-means clustering to establish the absten tion threshold, is suitable for answer abstention. Finally, we find that our proposed models do not compromise the answerability of the original answerable questions.

Keywords

Statistical Science

Reference:

Collections

Masters

Full item page