DBlytics: Statistical Analysis on data parallel frameworks
When processing large data, often a large bottleneck to computation is data movement. Moving data across geographical locations for processing is expensive. In-Database Analytics (dblytics) aims to build sophisticated analytic algorithms into data parallel systems, such as relational databases (RDBMS) and massively parallel processing (MPP) systems. Using a database as the ecosystem for analytics we a get declarative query interface, query optimization, transactional operations, efficient catching and fault tolerance. Below we list sub research projects that contribute to this effort.
MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. There are two significant motivations of in-database analytics tools such as MADlib. Firstly, it harness the embarrassing parallel processing power of parallel database and make the database being a data analytic engine, which is capable of processing massive data. Secondly, In-database analytic tools avoid the time cost of transferring large volume of data between databases and outside tools. MADlib can be installed on Postgres and Greenplum database.
MADden is a demonstration of in-database text analysis algorithms. This demonstration focuses on answering queries for sports journalism, in particular NFL data sets using Mad Lib style queries. The demonstration made the following contributions:
- Processing declarative ad hoc queries involving various statistical text analytic functions.
- Joining and querying over multiple data sources with both aggregation structured and text information.
- Query-time rendering of visualizations over query results, using word clouds, histograms, and ranked lists of documents.
GPText is a system for large-scale text indexing, search and ranking. This is a new system that integrates Greenplum DB, MADlib analytic libraries and the Apache Solr enterprise search platform. Combined with our madlib algorithms such as Conditional Random Field part of speech tagging, GPText is an extremely large and scalable text analytics engine. GPText adds a Solr instance to each Greenplum DB Segment and the database could communicate over the instances using http. Text searches are then parallelized across segments. Using UDFs we can mix sophisticated search predicates, ranking and database queries. In addition, we created an application that demonstrates the scalability of GPText and MADlib algorithms over the Enron corpus. This application displays results using a Sankey diagrams for flow analysis and other advanced analytics.