Designing and developing a robust automated log file analysis framework for debugging complex system failure

Master Thesis


Permanent link to this Item
Journal Title
Link to Journal
Journal ISSN
Volume Title
As engineering and computer systems become larger and more complex, additional challenges around the development, management and maintenance of these systems materialize. While these systems afford greater flexibility and capability, debugging failures that occur during the operation of these systems has become more challenging. One such system is the MeerKAT Radio Telescope's Correlator Beamformer (CBF), the signal processing powerhouse of the radio telescope. The majority of software and hardware systems generate log files detailing system operation during runtime. These log files have long been the go-to source of information for engineers when debugging system failures. As these systems become increasingly complex, the log files generated have exploded in both volume and complexity as log messages are recorded for all interacting parts of a system. Manually using log files for debugging system failures is no longer feasible. Recent studies have explored data-driven, automated log file analysis techniques that aim to address this challenge and have focused on two major aspects: log parsing, in which unstructured, free-form text log files are transformed into a structured dataset by extracting a set of event templates that describe the various log messages; and log file analysis, in which data-driven techniques are applied to this structured dataset to model the system behaviour and identify failures. Previous work is yet to address the combination of these two aspects to realize an end-to-end framework for performing automated log file analysis. The objective of this dissertation is to design and develop a robust, end-to-end Automated Log File Analysis Framework capable of analysing log files generated by the MeerKAT CBF to assist in system debugging. The Data Miner, Inference Engine and the complete framework are the major subsystems developed in this dissertation. State-of-the-art, data-driven approaches to log parsing were considered and the best performing approaches were incorporated into the Data Miner. The Inference Engine implements an LSTM-based multi-class classifier that models the system behaviour and uses this to perform anomaly detection to identify failures from log files. The complete framework links these two components together in a software pipeline capable of ingesting unstructured log files and outputting assistive system debugging information. The performance and operation of the framework and its subcomponents is evaluated for correctness on a publicly available, labelled dataset consisting of log files from the Hadoop Distributed File System (HDFS). Given the absence of a labelled dataset, the applicability and usefulness of the framework in the context of the MeerKAT CBF is subjectively evaluated through a case study. The framework is able to correctly model system behaviour from log files, but anomaly detection performance is greatly impacted by the nature and quality of the log files available for tuning and training the framework. When analysing log files, the framework is able to identify anomalous events quickly, even when large log files are considered. While the design of the framework primarily considered the MeerKAT CBF, a robust and generalisable end-to-end framework for automated log file analysis was ultimately developed.