Miguel Rodriguez
The amount of information available on the web has motivated a number of efforts in creating large-scale knowledge bases (KBs), each with their own methods of automatically extracting relevant information from unstructured text. Despite sharing the same data model, each project is unique, displaying their own strengths and weaknesses related to the size of their ontology, factual completeness, method of extraction, accuracy, and domain space. SigmaKB is a probabilistic fusion system that can incorporate multiple knowledge bases into a single, cohesive master KB.
System Overview
SigmaKB shares the same goals as data integration systems by improving the ability to answer complex queries over multiple data sources in uncertain environments. Rather than integrate all data sources into a single, monolithic KB, we choose to remain modular, querying over each KB individually and fusing the results on-the-fly. Aggregation across individual KBs is handled using Consensus Maximization Fusion, our previous work presented at (NAACL16), that can leverage complementary and conflicting data values to present the user with a probabilistic interpretation of the results.
The key feature of SigmaKB compared to other data integration systems is the probabilistic knowledge fusion component. Rather than simply take the union of results from all individual KBs, SigmaKB contains a reasoning component that combines duplicate and conflicting entries into a cohesive, singular response returned to the user. SigmaKB unifies two major knowledge bases, YAGO and NELL and 69 small KBs from the 2015 SFV evaluation.
Knowledge bases may differ greatly in their schema, using different named and different granularity relations and properties. SigmaKB combines these different ontologies into a single mediated ontology by taking the union across all KBs and canonicalizing those relations that refer semantically to the same thing. Alignment algorithms commonly employ syntactic and structural comparisons between relations. We implemented the PARIS algorithm for structure analysis, a probabilistic technique that looks at participation of subject-object pairs across different KBs. The query processing module uses the mediated ontology to translate from the user queries into a logical query plan across all individual KBs. We first push the translated query to each separate knowledge base before aggregating and fusing the results. The system architecture and query plan for a sample query are shown in the following figures.
Interfase
The user interface layer allows the user to directly submit queries in SigmaKB using SQL. Query results are displayed in tabular form along with provenance information. In addition to the specific knowledge bases each entry originated from, we display a unified con- fidence obtained using CM Fusion. Clicking on each entry brings up a tool-tip with further KB-specific info, and the ability to mark the fact as incorrect for user feedback.
Slot Filler Validation 2016
A modified version of SigmaKB was used to participate in the 2016 Slot Filler Validation evaluation and win for second year in a row the first place. Compared to our 2015 system, we Introduce two major features. First, given the computational complexity of jointly reasoning over the complete set of input KBs, we present a query driven approach that reasons only over the set of candidate answers to a specific query. Our query driven approach also allow our system to be more selective answering 1-hop queries by evaluating only presumably correct answers from its corresponding 0-hop part. The second feature, is an extra layer of ensemble that combines signals from three types of uncertainty: (1) from the extractors, (2) from source documents and (3) from users beliefs. In addition to the new main features, we also used a distance metric to partially disambiguate shallow entity names from different slot filler runs. The official SFV scoring metrics for each of the runs submitted are summarized in the following table.
Overall, our system outperformed the best individual runs and thus, achieving the ensemble purpose. The most interesting results are the precision gained by the 3rd run in 1-hop queries at the cost of recall. Since run 1 uses all runs and 1-hop queries are usually answered by a small number of systems with very low precision, by running CMF with all systems included, SigmaKB becomes more selective and avoids including incorrect 1-hop fillers that are penalized for having a wrong 0-hop counterpart.