A Machine Learning Model for Octane Number Prediction

Spencer, Victor

A Machine Learning Model for Octane Number Prediction

Thesis / Dissertation

2023

Abstract

Assessing the quality of gasoline blends in blending circuits is an important task in quality control. Gasoline quality however , cannot be measured directly on a process stream. Therefore a quality indicator which can be determined from the stream composition is required. Various quality indicators have been used in the existing body of literature but the indicator in this study will be the Research Octane Number (RON). This is an indicator which measures the ignition of gasoline relative to pure octane (Abdul-Gani et al. 2018). Previous research has used empirical models in the form of phenomeno-logical and machine learning models (Gonz´alez 2019). Phenomeno-logical models have been used in the past as a way of programming an engineer's thought process in the form of differential equations put together. Machine learning models are data driven with primarily regression and deep learning methods being used in literature as prediction models. This study aims to develop a parsimonious machine learning model which can be used to predict the RON from the molar composition of the gasoline product stream. Regression, ensemble learning and Artificial Neural Networks (ANN) will be used specifically in this study. The ensemble learning models which will be trained are Bayesian Additive Regression Trees (BART) and Gradient Boosting Machines (GBM). The raw data will be scraped from multiple journals online and the data frame will be comprised of volume compositions of the reference compounds and the RON of each blend. The existing data frame will be extended to include the molar composition of the structural groups present in each of the blends. The structural groups which may be referred to as functional groups are specific substituents within molecules which may be responsible for the characteristic chemical reactions of the respective molecules. This addition of structural groups adds a layer of information to differentiate between blends with different compound compositions but similar RON. It was hypothesised that the molar compositions of the additives and their substituent structural groups would rank highest and the molar composition of n-heptane would have the lowest ranking. For the Multiple Linear Regression (MLR) models, two cases were trained; one with interaction parameters and another without. Both of these cases were trained with and without the composition constraints on the compound compositions. For the ensemble learning case, a BART model with 200 trees and a GBM model with 1998 trees were trained. Four Single Layer Feed-forward Neural Network (SLFN) models were trained, each with 3, 5, 10 and 15 nodes. The choice of neural network architecture was made because the data frame was small, with only 12 input variables and 350 observations. Prior to training the models, an Explanatory Data Analysis was carried out to assess the potential dimensionality reduction, correlations and outliers. The final regression model was the interaction model with a test MSE of 7.54 and an adjusted R2 of 0.986. The BART model obtained a test MSE of 13.74 and an adjusted R2 of 0.983. The GBM model had a test MSE of 38.12 and an adjusted R2 of 0.917. Lastly the best performing ANN was the 10 node SLFN which obtained a test MSE of 11.26 and an adjusted R2 of 0.969. For each model, a variable importance was carried out and it was observed that the molar composition of n-Heptane consistently ranked high in the variable importance. In addition to these predictive statistics; the parity plots, residual plots and Analysis of Variance (ANOVA) were analysed and taken into consideration in evaluating the performance of each of the models trained. It was concluded that the MLR model performed best followed by the BART model. The ANN models ranked third and the GBM model ranked last. The hypothesis that the molar compositions of the additives and their substituent structural groups would rank highest and iv n-heptane would be the lowest ranking component was disproved as the molar composition of n-heptane and its substituent structural groups consistently ranked high . The recommendation for this study is to train the models with a more representative data set in future and to use a hybrid model which comprises of a phenomeno-logical model and a machine learning model for best results and to reduce the bias of the model in the regions with few data points. With the next step of the study being the integration of the new model into the plant-wide Advanced Process Control (APC).

Keywords

Statistical Sciences

Reference:

Collections

Masters

Full item page