• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

DBlytics: Statistical analysis on data parallel frameworks

DBlytics: Statistical Analysis on data parallel frameworks

 

MADden22

When processing large data, often a large bottleneck to computation is data movement. Moving data across geographical locations for processing is expensive. In-Database Analytics (dblytics) aims to build sophisticated analytic algorithms into data parallel systems, such as relational databases (RDBMS) and massively parallel processing (MPP) systems. Using a database as the ecosystem for analytics we a get declarative query interface, query optimization, transactional operations, efficient catching and fault tolerance. Below we list sub research projects that contribute to this effort.

MADlib

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. There are two significant motivations of in-database analytics tools such as MADlib. Firstly, it harness the embarrassing parallel processing power of parallel database and  make the database being a data analytic engine, which is capable of processing massive data. Secondly, In-database analytic tools avoid the time cost of transferring large volume of data between databases and outside tools. MADlib can be installed on Postgres and Greenplum database.

Faculty: Daisy Zhe Wang
Students: Kun Li
Collaborators: Joseph M. Hellerstein (UC Berkeley), EMC/Greenplum, University of Wisconsin-Madison
Software: MADlib

MADden

MADden is a demonstration of in-database text analysis algorithms. This demonstration focuses on answering queries for sports journalism, in particular NFL data sets using Mad Lib style queries. The demonstration made the following contributions:

  • Processing declarative ad hoc queries involving various statistical text analytic functions.
  • Joining and querying over multiple data sources with both aggregation structured and text information.
  • Query-time rendering of visualizations over query results, using word clouds, histograms, and ranked lists of documents.

Faculty: Daisy Zhe Wang
Students: Christan Grant, Jordan Gumbs, Kun Li
Collaborators: George Chitouras (EMC/Greenplum)

GPText

GPText is a system for large-scale text indexing, search and ranking. This is a new system that integrates Greenplum DB, MADlib analytic libraries and the Apache Solr enterprise search platform. Combined with our madlib algorithms such as Conditional Random Field part of speech tagging, GPText is an extremely large and scalable text analytics engine. GPText adds a Solr instance to each Greenplum DB Segment and the database could communicate over the instances using http. Text searches are then parallelized across segments. Using UDFs we can mix sophisticated search predicates, ranking and database queries. In addition, we created an application that demonstrates the scalability of GPText and MADlib algorithms over the Enron corpus. This application displays results using a Sankey diagrams for flow analysis and other advanced analytics.

Faculty: Daisy Zhe Wang
Students: Christan Grant, Kun Li
Collaborators: George Chitouras (EMC/Greenplum), Sunny Khatri (EMC/Greenplum)

Publications

  • MADlib
  • MADden
  • GPText

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013