Yang Chen, Xiaofeng Zhou
Recent development in information extraction and data management systems arouses elevating efforts in constructing large knowledge bases (KBs). These knowledge bases store information in a structured format, facilitating efficient processing and querying. Examples of these knowledge bases include DBpedia, DeepDive, Freebase, Google Knowledge Graph, Knowledge Vault, NELL, OpenIE, ProBase, ProbKB, and YAGO. However, the extracted knowledge bases often contain incomplete and uncertain knowledge. The query results, consequently, contain incomplete information. To improve the quality of the queries, we design the ArchimedesOne knowledge base system to process user queries over such incomplete and uncertain knowledge bases.
System Overview
ArchimedesOne supports three types of queries: load, search, and update. These queries are supported by an efficient knowledge expansion algorithm and query-driven inference algorithm on UDA-GIST. In knowledge expansion, ArchimedesOne derives missing and implicit facts using first-order inference rules. Specifically, we model a knowledge base as relational tables. This relational model is first introduced by the ProbKB work (SIGMOD’14) and proves efficient in rule mining by applying inference rules in batches using join queries (SIGMOD’16). The main challenge with inference rules is that they have flexible structures. To adapt for their structures, we utilize structural equivalence to divide rules into equivalent classes so that each equivalent class has a fixed table format. For example, the following rules:
isBornInCountry(x, y) ← isBornInState(x, z), isAStateOf(z, y)
isBornInState(x, y) ← isBornInCity(x, z), isACityOf(z, y)
are structural equivalent because they have the same structure and argument order. Storing them in a single relational table, e.g.:
H(x, y) | B1(x, z) | B2(z, y)
------------------+-----------------+----------------
isBornInCountry | isBornInState | isAStateOf
isBornInState | isBornInCity | isACityOf
we are able to apply them in batches, as described in the SIGMOD’14 paper.
In query-driven inference, ArchimedesOne computes or updates marginal probabilities of query results. To achieve real-time response, we approximate inference by extracting K-hop sub-networks of the ground factor graph, consisting of nodes within K hops from the query nodes. The K-hop approximation is based on the observation that neighbors of the query nodes have more influence than distant nodes. To achieve real-time response, we use an additional network limit parameter to control the expansion of K-hop sub-networks as K increases. As a result, we achieve an 18 times of speedup compared to inference over the entire factor graph by choosing K = 2, with an acceptable error of 0.04 in probabilities.
We implement query-driven inference using the UDA-GIST in-database analytics framework (VLDB’15). UDA-GIST utilizes User Defined Aggregates (UDAs) and extends it with a new operator, General Iterative State Transition (GIST), that performs iterative transitions of computation over a large state. UDA-GIST combined extends relational database systems by allowing users to define UDAs and GISTs capable of expressing complex analytics algorithms including MCMC and MC-SAT.
User Interface
We design a web interface for ArchimedesOne as shown in Figure 1. Users can query NELL-Sports and Reverb-Sherlock KBs by providing triples of facts and rules or apply the example queries from the drop down menu at the bottom. ArchimedesOne runs knowledge expansion and query-driven inference to answer the queries. The current web interface supports load, search, and update queries.
ArchimedesOne returns the query result in a table with the (predicate, subject, object) triples and the computed probabilities. Each fact has an optional prior probability. As new facts and rules are added to the KB, the probabilities may be updated. Both prior and posterior probabilities are returned to the user for comparison, as shown in the figure. The facts may result from the knowledge base or first-order inference, which the user may find out by looking at the factor graph, as described below.
Visualization
ArchimedesOne explains the query results by providing a visualization of the factor graph. When users click on a row of the result, its K-hop network will be retrieved and visualized, as shown in Figure 2.
The visualization helps users understand the query result. For instance, to explain the result “Ken Kesey born in Colorado,” the user may examine its neighboring factors, each representing a ground first-order inference rule. The ground rules represent of the lineage of a particular fact, e.g., “bornIn(Ken Kesey, Colorado) ← bornIn(Ken Kesey, La Junta), isLocatedIn(La Junta, Colorado).” We use d3.js for interactive visualization of factor graphs.
To summarize, the ArchimedesOne system allows users to query incomplete and uncertain knowledge bases. The system is based on efficient knowledge expansion and query-driven inference algorithms, supported by the UDA-GIST designed for in-database data-parallel and state-parallel analytics.