Title:
How to do Machine Learning on Massive Astronomical Datasets
Abstract:
I'll describe algorithms and data structures for allowing the most
powerful machine learning methods, which often scale quadratically or
even cubically with the number of data points, to be performed many
orders of magnitude faster than naive implementations. Such
techniques can make previously impossible statistical analyses
tractable on the scale of entire sky surveys. I will discuss
scalable algorithms we have developed for n-point correlations,
friends-of-friends, nearest-neighbors, kernel density estimation,
nonparametric Bayes classification, principal component analysis,
local linear regression, isometric non-negative matrix factorization,
hidden Markov models, k-means, support vector machine-like
classifiers, Gaussian process regression, and Gaussian graphical
model inference, among others. In addition to techniques inspired by
computational geometry, fast multipole methods, and Monte Carlo
integration, we employ a distributed framework which can be thought
of as a higher-order version of Google's MapReduce. Our algorithms
have enabled several first-of-a-kind large-scale cosmological
analyses.