VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data
AbstractThe demands on machine learning methods to cater for ultra high dimensional datasets, datasets with millions of features, have been increasing in domains like life sciences and the Internet of Things (IoT). While Random Forests are suitable for “wide” datasets, current implementations such as Google’s PLANET lack the ability to scale to such dimensions. Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random Forest. This paper introduces CursedForest, a novel Random Forest implementation on top of Apache Spark and part of the VariantSpark platform, which parallelises processing of all nodes over the entire forest. CursedForest is 9 and up to 89 times faster than Google’s PLANET and Yggdrasil, respectively, and is the first method capable of scaling to millions of features.