Automated feature synthesis on big data using cloud computing resources

dc.contributor.advisorBerman, Sonia
dc.contributor.authorSaker, Vanessa
dc.date.accessioned2020-12-30T10:17:56Z
dc.date.available2020-12-30T10:17:56Z
dc.date.issued2020
dc.description.abstractThe data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptionally important as many machine learning algorithms require a single file format as an input (e.g. supervised and unsupervised learning, feature representation and feature learning, etc.). An analyst is required to manually combine relations while generating new, more impactful information points from data during the feature synthesis phase of the feature engineering process that precedes machine learning. Furthermore, the entire process is complicated by Big Data factors such as processing power and distributed data storage. There is an open-source package, Featuretools, that uses an innovative algorithm called Deep Feature Synthesis to accelerate the feature engineering step. However, when working with Big Data, there are two major limitations. The first is the curse of modularity - Featuretools stores data in-memory to process it and thus, if data is large, it requires a processing unit with a large memory. Secondly, the package is dependent on data stored in a Pandas DataFrame. This makes the use of Featuretools with Big Data tools such as Apache Spark, a challenge. This dissertation aims to examine the viability and effectiveness of using Featuretools for feature synthesis with Big Data on the cloud computing platform, AWS. Exploring the impact of generated features is a critical first step in solving any data analytics problem. If this can be automated in a distributed Big Data environment with a reasonable investment of time and funds, data analytics exercises will benefit considerably. In this dissertation, a framework for automated feature synthesis with Big Data is proposed and an experiment conducted to examine its viability. Using this framework, an infrastructure was built to support the process of feature synthesis on AWS that made use of S3 storage buckets, Elastic Cloud Computing services, and an Elastic MapReduce cluster. A dataset of 95 million customers, 34 thousand fraud cases and 5.5 million transactions across three different relations was then loaded into the distributed relational database on the platform. The infrastructure was used to show how the dataset could be prepared to represent a business problem, and Featuretools used to generate a single feature matrix suitable for inclusion in a machine learning pipeline. The results show that the approach was viable. The feature matrix produced 75 features from 12 input variables and was time efficient with a total end-to-end run time of 3.5 hours and a cost of approximately R 814 (approximately $52). The framework can be applied to a different set of data and allows the analysts to experiment on a small section of the data until a final feature set is decided. They are able to easily scale the feature matrix to the full dataset. This ability to automate feature synthesis, iterate and scale up, will save time in the analytics process while providing a richer feature set for better machine learning results.
dc.identifier.apacitationSaker, V. (2020). <i>Automated feature synthesis on big data using cloud computing resources</i>. (Master Thesis). University of Cape Town. Retrieved from http://hdl.handle.net/11427/32452en_ZA
dc.identifier.chicagocitationSaker, Vanessa. <i>"Automated feature synthesis on big data using cloud computing resources."</i> Master Thesis., University of Cape Town, 2020. http://hdl.handle.net/11427/32452en_ZA
dc.identifier.citationSaker, V. 2020. Automated feature synthesis on big data using cloud computing resources. Master Thesis. University of Cape Town. http://hdl.handle.net/11427/32452en_ZA
dc.identifier.ris TY - Master Thesis AU - Saker, Vanessa AB - The data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptionally important as many machine learning algorithms require a single file format as an input (e.g. supervised and unsupervised learning, feature representation and feature learning, etc.). An analyst is required to manually combine relations while generating new, more impactful information points from data during the feature synthesis phase of the feature engineering process that precedes machine learning. Furthermore, the entire process is complicated by Big Data factors such as processing power and distributed data storage. There is an open-source package, Featuretools, that uses an innovative algorithm called Deep Feature Synthesis to accelerate the feature engineering step. However, when working with Big Data, there are two major limitations. The first is the curse of modularity - Featuretools stores data in-memory to process it and thus, if data is large, it requires a processing unit with a large memory. Secondly, the package is dependent on data stored in a Pandas DataFrame. This makes the use of Featuretools with Big Data tools such as Apache Spark, a challenge. This dissertation aims to examine the viability and effectiveness of using Featuretools for feature synthesis with Big Data on the cloud computing platform, AWS. Exploring the impact of generated features is a critical first step in solving any data analytics problem. If this can be automated in a distributed Big Data environment with a reasonable investment of time and funds, data analytics exercises will benefit considerably. In this dissertation, a framework for automated feature synthesis with Big Data is proposed and an experiment conducted to examine its viability. Using this framework, an infrastructure was built to support the process of feature synthesis on AWS that made use of S3 storage buckets, Elastic Cloud Computing services, and an Elastic MapReduce cluster. A dataset of 95 million customers, 34 thousand fraud cases and 5.5 million transactions across three different relations was then loaded into the distributed relational database on the platform. The infrastructure was used to show how the dataset could be prepared to represent a business problem, and Featuretools used to generate a single feature matrix suitable for inclusion in a machine learning pipeline. The results show that the approach was viable. The feature matrix produced 75 features from 12 input variables and was time efficient with a total end-to-end run time of 3.5 hours and a cost of approximately R 814 (approximately $52). The framework can be applied to a different set of data and allows the analysts to experiment on a small section of the data until a final feature set is decided. They are able to easily scale the feature matrix to the full dataset. This ability to automate feature synthesis, iterate and scale up, will save time in the analytics process while providing a richer feature set for better machine learning results. DA - 2020 DB - OpenUCT DP - University of Cape Town LK - https://open.uct.ac.za PY - 2020 T1 - Automated feature synthesis on big data using cloud computing resources TI - Automated feature synthesis on big data using cloud computing resources UR - http://hdl.handle.net/11427/32452 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/32452
dc.identifier.vancouvercitationSaker V. Automated feature synthesis on big data using cloud computing resources. [Master Thesis]. University of Cape Town, 2020 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/32452en_ZA
dc.language.isoeng
dc.publisherUniversity of Cape Town
dc.publisher.departmentDepartment of Statistical Sciences
dc.publisher.facultyFaculty of Science
dc.subject.otherComputer Science
dc.subject.otherData Analytics
dc.subject.otherCloud Computing
dc.subject.otherBig Data
dc.titleAutomated feature synthesis on big data using cloud computing resources
dc.typeMaster Thesis
dc.type.qualificationlevelMasters
dc.type.qualificationnameMSc
uct.type.publicationResearch
uct.type.resourceMaster Thesis
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis_sci_2020_saker_vanessa.pdf
Size:
12.14 MB
Format:
Adobe Portable Document Format
Description:
Collections