Kun Li
This work is recently accepted, to appear in Proceedings of VLDB 2015 Vol 8 Issue 5 and to be presented in Kona in September 2015. This is part of our continued effort to enable large-scale advanced statistical machine learning based analytics inside a DBMS system/SQL engine.
—
With the recent boom in Big Data analytics, many applications require large-scale data processing as well as advanced statistical methods. Connecting tools for data processing (e.g., DBMSes) and tools for large-scale machine learning (i.e., GraphLab) using a system-to-system integration has severe limitations including inefficient data movement between systems, impedance mismatch in data representation and data privacy issues. In the database community, there is a renewed interest in integrating statistical machine learning (SML) algorithms into DBMSes. Such integration allows both SQL-based data processing and statistical data analytics, providing a full spectrum of solutions for data analytics in an integrated system.
Most SML algorithms can be classified into two classes in terms of parallel execution. The first well studied class of SML algorithm requires multiple iterations of the same data. Such SML methods include Linear Regression, K-means and EM algorithms, which can be parallelized within each iteration using naive data partitioning. The overall algorithm can be driven by an outside iteration loop. The parallel implementation of this class of SML algorithm is supported in MADlib and Mahout. Most commercial databases incorporate support for such data-parallel SML algorithms in the form of UDAs with iterations in external scripting languages.
A second class of SML algorithm involves pre-processing and constructing a large state with all the data. The state space can not be naively partitioned, because the random variables in the state are correlated with each other. After the state is built, the algorithms involve iterative transitions (e.g., sampling, random walk) over the state space until a global optimization function converges. Such operations are computation intensive without any data flow. After convergence is reached, the state needs to be post-processed and converted into tabular data. We dubbed this class of SML algorithms state-parallel algorithms, where the states can be graphs, matrices, arrays or other customized data structures. Examples of this type of SML algorithms include MCMC and belief propagation algorithms.
Graph-parallel algorithm is a special type of state-parallel algorithm whose state is an immutable graph. While parallel DBMSes and Map-Reduce frameworks can not efficiently express graph-parallel algorithms, other solutions exist such as GraphLab and GraphX, both of which have graph-based abstractions. These graph-parallel systems simplify the design and implementation of algorithms over sparse graphs using a high-level abstraction, but they miss the opportunity of using more efficient data structures to represent the state space of a complete/dense graph, a matrix or a dynamic graph. For example, if the state is a matrix, representing it as a generalized graph can make the state building orders of magnitude slower and hamper the inference significantly due to worse access pattern over a generalized graph. Moreover, GraphLab does not support data-parallel processing for state construction, post-processing, tuples extraction and querying. GraphX on the other hand, has a less efficient edge-centric graph representation.
Currently, no previous work can efficiently support both data-parallel and state-parallel processing in a single system, which is essential for many new applications that applies SML algorithms over large amounts of data. To support such advanced data analytics applications, the UDA-GIST framework developed in this work unifies data-parallel and state-parallel processing by extending existing database frameworks.
We introduce an abstraction that generalizes GraphLab API called Generalized Iterative State Transition (GIST). GIST requires the specification of an inference algorithm in the form of four abstract data types: 1) the GIST State representing the state space, 2) the Task encoding the state transition task for each iteration, 3) the Scheduler responsible for the generation and scheduling of tasks, and 4) the convergence UDA ensures the stopping condition of the GIST operation gets observed. We efficiently implement and integrate the GIST operator into a DBMS along with User-Defined Functions (UDFs) and User-Defined Aggregates (UDAs). The efficient GIST implementation is achieved using the following techniques: 1) asynchronous parallelization of state transition, 2) efficient and flexible state implementation, 3) lock-free scheduler, and 4) code generation. The key of an efficient integration between the non-relational GIST operator and a relational DBMS engine is to use UDAs to build large states from DBMS tuples and to post-process and extract the result tuples from GIST.
The UDA-GIST framework can support a large class of advanced SML-based applications where both data-driven computation and large state transition are required. We intend the framework to be general for most SML algorithms with support for both data-parallel and state-parallel computation. Compared with GraphLab, the framework trades off implementation complexity for expressiveness and performance. While the application developers may need to implement their own scheduler for synchronization and deadlock resolution, they are given the flexibility to specify their own state representation and parallel execution strategy, which as shown in our experiments can achieve orders-of-magnitude performance gain.