Daisy Zhe Wang
MADlib is an open-source library (licensed under 2-clause BSD license) for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. The MADlib mission is to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open source development. MADlib grew out of discussions between database-engine developers, data scientists, IT architects and academics interested in new approaches to scalable, sophisticated in-database analytics. These discussions were written up in a seminal paper in VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib® software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal).
I was part of the MADlib effort at Berkeley back in the days and after joining University of Florida, my group DSR@UF become one of the initial four contributors to MADlib in supporting statistical text analytics over unstructured data:
MADlib is a Big Data Machine Learning library in SQL for Data Scientists. MADlib is supported over multiple SQL engines, including Postgres, Pivotal Greenplum Database, and Pivotal HAWQ. Together with Apache HAWQ, the MADlib open source project has transitioned its development and governance models to be in accordance with “The Apache Way”. MADlib@ASF is now an Apache Software Foundation incubator project!
The goal of MADlib is to provide an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind. In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. In our VLDB2012 paper, we provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project’s goals. MADlib@ASF is freely available, and the project is open for contributions.
The focus of the MADlib work at Florida, together with effort at UC Berkeley, has been to integrate statistical text analytics into a DBMS. In many domains, structured data and unstructured text are both important assets for data analysis. The increasing use of text analysis in enterprise applications has increased the expectation of customers and the opportunities for processing big data. The state-of-the-art text analysis tools are based on statistical models and algorithms. With the goal to become a framework for statistical methods for data analysis at scale, it is important for MADlib to include basic statistical methods to implement text analysis tasks. Basic text analysis tasks include part-of-speech (POS) tagging, named entity extraction (NER), and entity resolution (ER). Different statistical models and algorithms are implemented for each of these tasks with different runtime-accuracy tradeoffs.
Pushing Statistical Text Analysis into MADlib. Based on the MADlib framework, our group set out to implement statistical methods in SQL to support various text analysis tasks. We use CRFs as the basic statistical model to perform more advanced text analysis. Similar to Hidden Markov Models (HMM) cite, CRFs are a leading probabilistic model for solving many text analysis tasks, including POS, NER and ER. To support sophisticated text analysis, we implement four key methods: text feature extraction, most-likely inference over a CRF (Viterbi), MCMC inference, and approximate string matching.
With MADlib moved to ASF, I am re-energized to continue contributing to the MADlib project with following directions:
- Develop applications using MADlib in my Introduction to Data Science course at UF CISE
- Further develop statistical methods for statistical text analytics, large-scale probabilistic graphical inference (e.g., MCMC) and more
- Foster community around MADlib@ASF both with students and with professional data scientists
- Enable MADlib over other SQL engines such as Impala