Automated feature synthesis on big data using cloud computing resources

Saker, Vanessa

Automated feature synthesis on big data using cloud computing resources

dc.contributor.advisor	Berman, Sonia
dc.contributor.author	Saker, Vanessa
dc.date.accessioned	2020-12-30T10:17:56Z
dc.date.available	2020-12-30T10:17:56Z
dc.date.issued	2020
dc.description.abstract	The data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptionally important as many machine learning algorithms require a single file format as an input (e.g. supervised and unsupervised learning, feature representation and feature learning, etc.). An analyst is required to manually combine relations while generating new, more impactful information points from data during the feature synthesis phase of the feature engineering process that precedes machine learning. Furthermore, the entire process is complicated by Big Data factors such as processing power and distributed data storage. There is an open-source package, Featuretools, that uses an innovative algorithm called Deep Feature Synthesis to accelerate the feature engineering step. However, when working with Big Data, there are two major limitations. The first is the curse of modularity - Featuretools stores data in-memory to process it and thus, if data is large, it requires a processing unit with a large memory. Secondly, the package is dependent on data stored in a Pandas DataFrame. This makes the use of Featuretools with Big Data tools such as Apache Spark, a challenge. This dissertation aims to examine the viability and effectiveness of using Featuretools for feature synthesis with Big Data on the cloud computing platform, AWS. Exploring the impact of generated features is a critical first step in solving any data analytics problem. If this can be automated in a distributed Big Data environment with a reasonable investment of time and funds, data analytics exercises will benefit considerably. In this dissertation, a framework for automated feature synthesis with Big Data is proposed and an experiment conducted to examine its viability. Using this framework, an infrastructure was built to support the process of feature synthesis on AWS that made use of S3 storage buckets, Elastic Cloud Computing services, and an Elastic MapReduce cluster. A dataset of 95 million customers, 34 thousand fraud cases and 5.5 million transactions across three different relations was then loaded into the distributed relational database on the platform. The infrastructure was used to show how the dataset could be prepared to represent a business problem, and Featuretools used to generate a single feature matrix suitable for inclusion in a machine learning pipeline. The results show that the approach was viable. The feature matrix produced 75 features from 12 input variables and was time efficient with a total end-to-end run time of 3.5 hours and a cost of approximately R 814 (approximately $52). The framework can be applied to a different set of data and allows the analysts to experiment on a small section of the data until a final feature set is decided. They are able to easily scale the feature matrix to the full dataset. This ability to automate feature synthesis, iterate and scale up, will save time in the analytics process while providing a richer feature set for better machine learning results.
dc.identifier.apacitation	Saker, V. (2020). <i>Automated feature synthesis on big data using cloud computing resources</i>. (Master Thesis). University of Cape Town. Retrieved from http://hdl.handle.net/11427/32452	en_ZA
dc.identifier.chicagocitation	Saker, Vanessa. <i>"Automated feature synthesis on big data using cloud computing resources."</i> Master Thesis., University of Cape Town, 2020. http://hdl.handle.net/11427/32452	en_ZA
dc.identifier.citation	Saker, V. 2020. Automated feature synthesis on big data using cloud computing resources. Master Thesis. University of Cape Town. http://hdl.handle.net/11427/32452	en_ZA
dc.identifier.ris	TY - Master Thesis AU - Saker, Vanessa AB - The data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptionally important as many machine learning algorithms require a single file format as an input (e.g. supervised and unsupervised learning, feature representation and feature learning, etc.). An analyst is required to manually combine relations while generating new, more impactful information points from data during the feature synthesis phase of the feature engineering process that precedes machine learning. Furthermore, the entire process is complicated by Big Data factors such as processing power and distributed data storage. There is an open-source package, Featuretools, that uses an innovative algorithm called Deep Feature Synthesis to accelerate the feature engineering step. However, when working with Big Data, there are two major limitations. The first is the curse of modularity - Featuretools stores data in-memory to process it and thus, if data is large, it requires a processing unit with a large memory. Secondly, the package is dependent on data stored in a Pandas DataFrame. This makes the use of Featuretools with Big Data tools such as Apache Spark, a challenge. This dissertation aims to examine the viability and effectiveness of using Featuretools for feature synthesis with Big Data on the cloud computing platform, AWS. Exploring the impact of generated features is a critical first step in solving any data analytics problem. If this can be automated in a distributed Big Data environment with a reasonable investment of time and funds, data analytics exercises will benefit considerably. In this dissertation, a framework for automated feature synthesis with Big Data is proposed and an experiment conducted to examine its viability. Using this framework, an infrastructure was built to support the process of feature synthesis on AWS that made use of S3 storage buckets, Elastic Cloud Computing services, and an Elastic MapReduce cluster. A dataset of 95 million customers, 34 thousand fraud cases and 5.5 million transactions across three different relations was then loaded into the distributed relational database on the platform. The infrastructure was used to show how the dataset could be prepared to represent a business problem, and Featuretools used to generate a single feature matrix suitable for inclusion in a machine learning pipeline. The results show that the approach was viable. The feature matrix produced 75 features from 12 input variables and was time efficient with a total end-to-end run time of 3.5 hours and a cost of approximately R 814 (approximately $52). The framework can be applied to a different set of data and allows the analysts to experiment on a small section of the data until a final feature set is decided. They are able to easily scale the feature matrix to the full dataset. This ability to automate feature synthesis, iterate and scale up, will save time in the analytics process while providing a richer feature set for better machine learning results. DA - 2020 DB - OpenUCT DP - University of Cape Town LK - https://open.uct.ac.za PY - 2020 T1 - Automated feature synthesis on big data using cloud computing resources TI - Automated feature synthesis on big data using cloud computing resources UR - http://hdl.handle.net/11427/32452 ER -	en_ZA
dc.identifier.uri	http://hdl.handle.net/11427/32452
dc.identifier.vancouvercitation	Saker V. Automated feature synthesis on big data using cloud computing resources. [Master Thesis]. University of Cape Town, 2020 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/32452	en_ZA
dc.language.iso	eng
dc.publisher	University of Cape Town
dc.publisher.department	Department of Statistical Sciences
dc.publisher.faculty	Faculty of Science
dc.subject.other	Computer Science
dc.subject.other	Data Analytics
dc.subject.other	Cloud Computing
dc.subject.other	Big Data
dc.title	Automated feature synthesis on big data using cloud computing resources
dc.type	Master Thesis
dc.type.qualificationlevel	Masters
dc.type.qualificationname	MSc
uct.type.publication	Research
uct.type.resource	Master Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis_sci_2020_saker_vanessa.pdf
Size:: 12.14 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Masters