Fast Computation on Massive Data Sets
This chapter describes basic concepts and tools for tractably performing the computations described in the rest of this book. The need for fast algorithms for such analysis subroutines is becoming increasingly important as modern data sets are approaching billions of objects. With such data sets, even analysis operations whose computational cost is linearly proportional to the size of the data set present challenges, particularly since statistical analyses are inherently interactive processes, requiring that computations complete within some reasonable human attention span. For more sophisticated machine learning algorithms, the often worse-than-linear runtimes of straightforward implementations become quickly unbearable. The chapter looks at some techniques that can reduce such runtimes in a rigorous manner that does not sacrifice the accuracy of the analysis through unprincipled approximations. This is far more important than simply speeding up calculations: in practice, computational performance and statistical performance can be intimately linked. The ability of a researcher, within his or her effective time budget, to try more powerful models or to search parameter settings for each model in question, leads directly to better fits and predictions.