We present the ARCHIMEDES system for efficient query processing over probabilistic knowledge bases. We design ARCHIMEDES for knowledge bases containing incomplete and uncertain information due to limitations of information sources and human knowledge. Answering queries over these knowledge bases requires efficient probabilistic inference. In this paper, we describe ARCHIMEDES’s efficient knowledge expansion and query-driven inference over UDA-GIST, an in-database unified data- and graph-parallel computation framework. With an efficient inference engine, ARCHIMEDES produces reasonable results for queries over large uncertain knowledge bases. We use the Reverb-Sherlock and Wikilinks knowledge bases to show ARCHIMEDES achieves satisfactory quality with real-time performance.

# Probabilistic Knowledge Bases

A probabilistic knowledge base extends first-order knowledge bases to support uncertain facts and rules. The primary goal of modeling uncertainty is to represent knowledge mined by probabilistic information extraction algorithms that contain uncertain facts and rules, as illustrated by the Reverb-Sherlock KB example in the table below:

We view a probabilistic knowledge base as a template for constructing ground factor graphs, which encodes a joint probability distribution over the facts (aka. factors, or relationships.) with the rules connecting them. The factor graph for the above table can be shown below in (b), with each fact explained in (c). Alternatively, it can also be represented with entities as nodes, and relations as edges in (a) as knowledge graph.

In a factor graph Φ = {φ1,…,φN}, the factors to- gether determine a joint probability distribution over the random vector X consisting of all the random variables in the factor graph:

where: X is the vector of factors as random variables, the ni(x) is the number of true groundings of rule Fi in x, Wi is its weight, and Z is the partition function,i.e., normalization constant.

ARCHIMEDES answers user queries by computing the marginal probability P (X = x), the probability distribution of a query node X defined above. The computation of marginal probabilities is called marginal inference in probabilistic graphical models literature. Exact inference is tractable for only limited families of graphical models, and state-of-the-art MLN inference engines use sampling algorithms including Markov chain Monte Carlo (MCMC) and MC- SAT. Observing the ground factor graphs of real knowledge bases are large while user queries often focus on small parts of the knowledge graph, ARCHIMEDES employs a query-driven approach to focus MCMC on the query nodes to avoid computation over the entire factor graph.

# System Architecture

To efficiently process queries, we design three key components of ARCHIMEDES: an inference engine for efficient knowledge expansion to derive implicit knowledge from existing KBs, query-driven inference to compute probabilities of the query facts, and the UDA-GIST framework for in-database data-parallel and graph-parallel analytics. We provide a user interface for load, search, and update queries. The system architecture is shown in figure below:

# Knowledge Expansion

To efficiently apply the inference rules, we represent a knowledge base as relational tables, first introduced by ProbKB. Based on the relational model, we express the knowledge expansion algorithm as join queries between the facts and rules tables, one join for each rules table. Our experiments show that applying rules in batches results in a 200-300 times of speedup over the state-of-the-art approaches.

# Query-Driven Inference

ARCHIMEDES uses query-driven inference to speed up MLN inference algorithms by focusing computation on the query facts. The query-driven inference algorithm is designed with the UDA-GIST analytics framework to achieve efficient inference in a relational database system. Furthermore, we use K-hop approximation to focus computation on the query facts.

We use the MCMC algorithms to compute the probabilities defined by the K-hop network. We optimize MCMC on the UDA-GIST in-database analytics framework: we build the factor graph by relational operations with User Defined Aggregates (UDAs) and compute probabilities of query facts by MCMC with General Iterative State Transition (GIST). The combined UDA-GIST framework extends relational database sys- tems to support algorithms requiring both data- and graph- parallel computation, including MCMC and MC-SAT. We describe the design of UDA-GIST in below:

### UDA-GIST

Most major DBMSes support User-Defined Aggregates (UDAs) for parallel data analytics. UDAs are suitable for data-parallel analytics where data are naively partitioned and computation is performed on the partitions in parallel. However, UDAs do not support efficient statistical inference algorithms that perform iterative transitions over a large state, where the state is a graph-like data structure. The computation is not naively partitioned due to data dependency within the state as DBMSes are fundamentally data driven and computation is tied to the processing of tuples.

To answer the demand of supporting in-database graph-parallel analytics, we propose the GIST abstraction to generalize the GraphLab API. We integrate the GIST operator into DBMSes with UDAs and User-Defined Functions. UDA-GIST unifies data-parallel (e.g., graph materialization) and graph- parallel computation (e.g., inference) into an integrated in-database analytics framework.

# Experiment Result

### Knowledge Expansion

To evaluate knowledge expansion, we use Tuffy as the baseline. In figures below, (a) (b) compare performance of ARCHIMEDES with Tuffy on the synthetic knowledge base with varying numbers of facts and rules.

We see that ARCHIMEDES achieves more than 200 times of speedup over Tuffy for 107 facts. The speedup benefits from the batch application of rules with join operations supported by the relational knowledge base model. The precision of the inferred facts is shown in (c). We use semantic constraints and rule cleaning to improve precision of the inferred facts. Combining these two methods, we are able to infer 22,654 new facts at precision 0.65 using top 50% rules, and 16,394 new facts at precision 0.75 using top 20% rules.

### Query-Driven Inference

Below figures (a)-(c) report the runtime results for query-driven inference by K-hop approximation with different num- bers of hops from large, medium, and small clusters. We can see that query-driven inference achieves a speedup of more than one order of magnitude compared to using the entire factor graph for computation with acceptable error rates in the computed probabilities.

We evaluate the performance and scalability of UDA- GIST by cross-document coreference using the Wikilinks datasets shown in (d) (d). Using our solution, within a manageable 10-hour computation in a single system the coreference analysis can be performed on the entire Wikilink dataset, 27 times larger than achieved by the current state-of-the-art.

For complete paper with references, please see paper