OpenUCT :: Browsing by Author "Buys, Jan"

Browsing by Author "Buys, Jan"

Now showing 1 - 7 of 7

Open Access
From GNNs to sparse transformers: graph-based architectures for multi-hop question answering
(2023) Acton, Shane; Buys, Jan
Multi-hop Question Answering (MHQA) is a challenging task in NLP which typically involves processing very long sequences of context information. Sparse Transformers [7] have surpassed Graph Neural Networks (GNNs) as the state-of-the-art architecture for MHQA. Noting that the Transformer [4] is a particular message passing GNN, in this work we perform an architectural analysis and evaluation to investigate why the Transformer outperforms other GNNs on MHQA. In particular, we compare attention- and non-attentionbased GNNs, and compare the Transformer's Scaled Dot Product (SDP) attention to the Graph Attention Network [5] (GAT)'s Additive Attention [2]. We simplify existing GNNbased MHQA models and leverage this system to compare GNN architectures in a lower compute setting than token-level models. We evaluate all of our model variations on the challenging MHQA task Wikihop [6]. Our results support the superiority of the Transformer architecture as a GNN in MHQA. However, we find that problem-specific graph structuring rules can outperform the random connections used in Sparse Transformers. We demonstrate that the Transformer benefits greatly from its use of residual connections [3], Layer Normalisation [1], and element-wise feed forward Neural Networks, and show that all tested GNNs benefit from this too. We find that SDP attention can achieve higher task performance than Additive Attention. Finally, we also show that utilising edge type information alleviates performance losses introduced by sparsity
Open Access
Hospital readmission prediction with long clinical notes
(2022) Nurmahomed, Yassin; Buys, Jan
Electronic health records (EHR) data is captured across many healthcare institutions, resulting in large amounts of diverse information that can be analysed for diagnosis, prognosis, treatment and prevention of disease. One type of data captured by EHRs are clinical notes, which are unstructured data written in natural language. We can leverage Natural Language Processing (NLP) to build machine learning (ML) models to gain understanding from clinical notes that will enable us to predict clinical outcomes. ClinicalBERT is a pre-trained Transformer based model which is trained on clinical text and is able to predict 30-day hospital readmission from clinical notes. Although the performance is good, it suffers from a limitation on the size of the text sequence that is fed as input to the model. Models using longer sequences have been shown to perform better on different ML tasks, even with clinical text. In this work, a ML model called Longformer which pre-trained then fine-tuned on clinical text and is able to learn from longer sequences than previous models is evaluated. Performance is evaluated against the Deep Averaging Network (DAN) and Long short-term memory (LSTM) baselines and previous state-of-the-art models in terms of Area under the receiver operating characteristic curve (AUROC), Area under the precision-recall curve (AUPRC) and Recall at precision of 70% (RP70). Longformer is able to best ClinicalBERT on two performance metrics, however it is not able to surpass one of the baselines in any of the metrics. Training the model on early notes did not result in substantial difference when compared to training on discharge summaries. Our analysis shows that the model suffers from out-of-vocabulary words, as many biomedical concepts are missing from the original pre-training corpus.
Open Access
Non-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution
(Public Library of Science, 2011) Murrell, Ben; Weighill, Thomas; Buys, Jan; Ketteringham, Robert; Moola, Sasha; Benade, Gerdus; Buisson, Lise du; Kaliski, Daniel; Hands, Tristan; Scheffler, Konrad
Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models.
Open Access
Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
(2023) Pedlar, Victoria; Britz, Stefan; Buys, Jan
Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages.
Open Access
Self-supervised text sentiment transfer with rationale predictions and pretrained transformers
(2022) Sinclair, Neil; Buys, Jan
Sentiment transfer involves changing the sentiment of a sentence, such as from a positive to negative sentiment, whilst maintaining the informational content. Whilst this challenge in the NLP research domain can be constructed as a translation problem, traditional sequence-to-sequence translation methods are inadequate due to the dearth of parallel corpora for sentiment transfer. Thus, sentiment transfer can be posed as an unsupervised learning problem where a model must learn to transfer from one sentiment to another in the absence of parallel sentences. Given that the sentiment of a sentence is often defined by a limited number of sentiment-specific words within the sentence, this problem can also be posed as a problem of identifying and altering sentiment-specific words as a means of transferring from one sentiment to another. In this dissertation we use a novel method of sentiment word identification from the interpretability literature called the method of rationales. This method identifies the words or phrases in a sentence that explain the ‘rationale' for a classifier's class prediction, in this case the sentiment of a sentence. This method is then compared against a baseline heuristic sentiment word identification method. We also experiment with a pretrained encoder-decoder Transformer model, known as BART, as a method for improving upon previous sentiment transfer results. This pretrained model is fine-tuned first in an unsupervised manner as a denoising autoencoder to reconstruct sentences where sentiment words have been masked out. This fine-tuned model then generates a parallel corpus which is used to further fine-tune the final stage of the model in a self-supervised manner. Results were compared against a baseline using automatic evaluations of accuracy and BLEU score as well as human evaluations of content preservation, sentiment accuracy and sentence fluency. The results of this dissertation show that both neural network and heuristic-based methods of sentiment word identification achieve similar results across models for similar levels of sentiment word removal for the Yelp dataset. However, the heuristic approach leads to improved results with the pretrained model on the Amazon dataset. We also find that using the pretrained Transformers model improves upon the results of using the baseline LSTM trained from scratch for the Yelp dataset for all automatic metrics. The pretrained BART model scores higher across all human-evaluated outputs for both datasets, which is likely due to its larger size and pretraining corpus. These results also show a similar trade-off between content preservation and sentiment transfer accuracy as in previous research, with more favourable results on the Yelp dataset relative to the baseline.
Open Access
Subword segmental neural language generation for Nguni languages
(2025) Meyer, Francois Rolihlahla; Buys, Jan
Deep learning models for text generation are now able to produce fluent and coherent text in many conversational settings. However, such models require large training datasets and are primarily designed for a limited number of high-resource languages. These advances are not directly applicable to low-resource languages with distinctive linguistic characteristics. In this thesis we develop text generation models for the Nguni languages of South Africa -- isiXhosa, isiZulu, isiNdebele, and Siswati. The Nguni languages are agglutinative and conjunctively written, so words are formed by stringing together morphemes. We design neural models that suit the morphological complexity of the Nguni languages by explicitly modelling the segmentation of words into subword units. We propose subword segmental modelling, a neural architecture and training algorithm that learns subword segmentation during training. The standard approach to subword modelling is to apply data-driven algorithms such as byte-pair encoding (BPE) during preprocessing. Subword segmental modelling represents a departure from this paradigm: instead of casting subword segmentation as a preprocessing step, we incorporate it into end-to-end learning to allow the model to discover the optimal subword units for a particular language and task. Explicitly modelling the complex subword structure of Nguni languages serves as an inductive bias for more efficient training on the typically limited training data. In this thesis we present subword segmental models for three natural language generation tasks. Our first model is for autoregressive language modelling. We propose the subword segmental language model (SSLM), a decoder-only model that learns subword segmentation to optimise its language modelling objective. SSLM achieves lower (better) perplexity-based intrinsic evaluation scores than tokenisation-based language models, on average across the four Nguni languages. We also evaluate SSLM as an unsupervised morphological segmenter, showing that its learned subwords are closer to morphemes than standard subword tokens. Since SSLM is our first instantiation of subword segmental modelling, we present a detailed analysis of the architectural components and hyperparameters we found to be influential during development. Our second model extends subword segmental modelling to neural machine translation (NMT). We propose subword segmental machine translation (SSMT), an encoder-decoder model that learns target language subword segmentation to optimise its sequence-to-sequence translation objective. To generate translations with SSMT, we propose dynamic decoding, a decoding algorithm for generating text with subword segmental architectures. SSMT outperforms tokenisation-based NMT on Nguni languages, achieving large gains in the extremely low-resource setting of English to Siswati translation. As for SSLM, we show that SSMT learns subword boundaries more aligned with morpheme boundaries than tokenisation-based subwords. SSMT also exhibits greater morphological compositional generalisation, the ability to generalise to novel combinations of known morphemes. We extend SSMT to multilingual translation, where it learns a single target-side subword segmentation scheme to optimise performance across multiple translation directions. We compare multilingual SSMT to multilingual tokenisation-based NMT. Multilingual SSMT does induce cross-lingual transfer, but to a lesser extent that multilingual tokenisation. In cross-lingual finetuning experiments, SSMT improves transfer between unrelated languages. Our experiments confirm that decisions around subword segmentation greatly affect cross-lingual performance. We also show that differences in orthographic word boundary alignment between languages can impede cross-lingual transfer. Our third and final model combines subword segmental modelling with a copy mechanism, for the task of data-to-text generation. We propose the subword segmental pointer generator (SSPG), which jointly learns to segment words and copy subwords to optimise data-to-text generation. We also propose unmixed decoding, a text generation algorithm for copy-equipped subword segmental models. On isiXhosa data-to-text, SSPG outperforms tokenisation-based architectures trained from scratch. Besides reference-based evaluation, we develop an extractive evaluation framework to measure how faithfully models capture the expected data content of generations. This shows that SSPG more effectively combines entity copying and morphological composition. Across all three tasks, and for all four Nguni languages, subword segmental modelling consistently equals or outperforms equivalent tokenisation-based models. Its performance gains are greatest for extremely low-resource languages and tasks. Through linguistically informed evaluations, we show that subword segmental modelling successfully acquires particular aspects of Nguni-language morphology. Its subword units resemble morphemes more closely than subword tokens and it effectively applies morphological composition. Subword segmental modelling proves effective for the Nguni languages, offering a promising new approach to text generation for low-resource, morphologically complex languages.
Open Access
Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering
(2025) Alexander, Natalie; Buys, Jan
Hospitals store patient information in relational databases known as Electronic Health Records (EHRs). Exist ing EHRs have filter and search options on the front end that are converted to SQL queries at the back end. However, these search and filter options become cumbersome when querying the EHR. While users could write custom SQL queries to query the EHR directly, this approach requires database expertise. Recent ad vancements in medical question-answering leverage text-to-SQL parsing, which translates a user's natural language question into an executable SQL query, enabling information retrieval from a database. However, current medical text-to-SQL research only addresses a limited scope of questions, known as answerable ques tions. Questions that the system cannot reliably answer (unanswerable questions) result in inexecutable or incorrect SQL predictions which may return incorrect information that affects clinical decision-making. This limitation underscores the need for a medical text-to-SQL system that can reliably address both answerable and unanswerable questions. This project aims to expand the coverage of questions answered by medical text to-SQL systems, by addressing unanswerable questions that are out-of-schema or require medical knowledge to simplify complex, medical jargon. More specifically, we focus on addressing real-world unanswerable ques tions related to diagnoses and medication. This research first explores methods for addressing out-of-schema questions by assessing how incorporating an unseen schema, during inference, enhances the performance of a sequence-to-sequence (T5) text-to-SQL model. We then compare this approach to the effectiveness of fine tuning the model on a training dataset that includes these out-of-schema questions and their corresponding schema. Secondly, this research examines how external medical knowledge sources, related to diagnoses and medication, can be used in data augmentation (either through retrieval-augmented generation or SQL post processing) to improve the answerability of unanswerable questions with complex medical jargon. In addition, we ensure model reliability by applying answer abstention when the text-to-SQL model cannot reliably answer a question, while also ensuring that the model does not deteriorate the answerability of the original answerable questions. As a result of these experiments, we find that out-of-schema questions are addressed by fine-tuning a T5-Base model on a training dataset that includes out-of-schema question representations, excluding addi tional schema information. In addition, we find that fine-tuning a T5-Large model with retrieval-augmented generation, which incorporates medical knowledge from the SNOMED CT and RxNorm medical vocabularies, improves the model's ability to answer unanswerable questions with complex medical jargon. We also find that an entropy-based uncertainty estimation method, which uses K-means clustering to establish the absten tion threshold, is suitable for answer abstention. Finally, we find that our proposed models do not compromise the answerability of the original answerable questions.

Browsing by Author "Buys, Jan"

Results Per Page

Sort Options