Small-scale distributed machine learning in R

Master Thesis


Permanent link to this Item
Journal Title
Link to Journal
Journal ISSN
Volume Title
Machine learning is increasing in popularity, both in applied and theoretical statistical fields. Machine learning models generally require large amounts of data to train and thus are computationally expensive, both in the absolute sense of actual compute time, and in the relative sense of the numerical complexity of the underlying calculations. Particularly for students of machine learning, appropriate computing power can be difficult to come by. Distributed machine learning, which involves sending tasks to a network of attached computers, can offer users access to significantly more computing power than otherwise by leveraging more processors than in a single computer. This research outlines the core concepts of distributed computing and provides brief outlines of the more common approaches to parallel and distributed computing in R, with reference to the specific algorithms and aspects of machine learning that are investigated. One particular parallel backend, doRedis, offers particular advantages as it is easy to set up and implement, and allows for the elastic attaching and detaching of computers from a distributed network. This paper will describe core features of the doRedis package and show, by means of applying certain aspects of the machine learning process, that it is both viable and beneficial to distribute these machine learning aspects. There is the potential for significant time savings when distributing machine learning model training. Particularly for students, the time required for setting up of a distributed network in which to use doRedis is far outweighed by the benefits. The implication that this research aims to explore, is that students will be able to leverage the many computers often available in computer labs to train more complex machine learning models in less time than they would otherwise be able to when using the built-in parallel packages that are already common in R. In fact, certain machine learning packages that already parallelise model training can be distributed to a network of computers, thereby further increasing the gains realised by parallelisation. In this way, more complex machine learning is more accessible. This research outlines the benefits that lie in the distribution of machine learning problems in an accessible, small-scale environment. This small-scale ‘proof of concept' performs well enough to be viable for students, while also creating a bridge, and introducing the knowledge required, to deploy large-scale distribution of machine learning problems.