MADlib is a library of machine learning and statistics functions that integrates into a relational database. For example, you can store labelled training data in a relational database and run logistic regression over it like this:
SELECT madlib.logregr_train(
'patients', -- source table
'patients_logregr', -- output table
'second_attack', -- labels
'ARRAY[1, treatment, trait_anxiety]', -- features
NULL, -- grouping columns
20, -- max number of iterations
'irls' -- optimizer
);
MADlib programming is divided into two conceptual types of programming: macro-programming and micro-programming. Macro-programming deals with partitioning matrices across nodes, moving matrix partitions, and operating on matrices in parallel. Micro-programming deals with writing efficient code which operates on a single chunk of a matrix on one node.
MADlib leverages user-defined aggregates to operate on matrices in parallel. A user defined-aggregate over a set of type T
comes in three pieces.
A -> T -> A
folds over the set.A -> A -> A
merges intermediate aggregates.A -> B
translates the final aggregate.Standard user-defined aggregates aren't sufficient to express a lot of machine learning algorithms. They suffer two main problems:
MADlib user-defined code calls into fast linear algebra libraries (e.g. Eigen) for dense linear algebra. MADlib implements its own sparse linear algebra library in C. MADlib also provides a C++ abstraction for writing low-level linear algebra code. Notably, it translates C++ types into database types and integrates nicely with libraries like Eigen.
Least squares regression can be computed in a single pass of the data. Logistic regression and k-means clustering require a Python driver to manage multiple iterations.
Wisconsin implemented stochastic gradient descent in MADlib. Berkeley and Florida implemented some statistic text analytics features including text feature expansion, approximate string matching, Viterbi inference, and MCMC inference (though I don't know what any of these are).