Yang Chen
Today’s World Wide Web contains huge volumes of information about people, places, and events all over the world. Unfortunately, these information is intended for human readers and not ready for direct machine processing, making it hard to query and analyze systematically. To overcome this limit, researchers are trying to convert them into structured formats, e.g., tables, by means of human collaboration, information extraction, data fusion, etc. We call the resulting structured database a knowledge base. An exemplifying effort to build a world wide knowledge base is the Semantic Web. In its current status, we have many open knowledge bases available for query and analysis, including DBPedia, Freebase, YAGO, WordNet, etc, as shown below.
A key challenge faced by many current knowledge bases is that there are much information that is not explicitly stated in web pages, but may be inferred using existing sources. For instance, we know from Wikipedia webpages that Kale contains Calcium, and Calcium prevents osteoporosis, but the conclusion that Kale prevents osteoporosis has to be inferred using logical inference. To expand knowledge bases with these implicit knowledge, we design the ProbKB system.
We focus on two key challenges: efficiency and quality. First, web knowledge bases are big: there are millions or billions of facts and thousands of inference rules. In the state-of-the-art approach, the inference rules are stored in text files, which are parsed and applied sequentially by external programs in C++ or Java during the inference process. This execution model is inefficient when the number of rules is large, since that implies large numbers of queries to perform. Instead, based on the structures of the rules, we design a novel relational model for the inference rules and an efficient SQL-based inference algorithm that applies inference rules in batches. Furthermore, our approach allows us to run inference over massive parallel processing (MPP) databases, like Greenplum, that parallelize the inference process. We experimentally validate the ProbKB inference algorithm using a machine constructed knowledge base, OpenIE, and a rule base, Sherlock, and observe over 100 times speed up over the state-of-the-art during loading and inference.
Second, since all facts and rules are automatically constructed using statistical algorithms, they are inevitably noisy. To make things worse, the erroneous facts tend to propagate rapidly and generate more errors in the inference chain. Thus, it is an important task to detect and recover from the errors early to maintain a high quality knowledge base. Analyzing the error distribution, we observe that the majority of errors are caused by ambiguous entities, incorrect rules, and propagated errors, since ambiguous entities invalidate equality checks in join queries, and incorrect rules are applied repeatedly to different facts in multiple inference iterations. Thus, we use a set of semantic constraints from Leibniz to detect erroneous facts (extracted or propagated) and ambiguous entities, and we clean the rules based on their statistical properties. Our cleaning methods achieve a 0.6 higher precision than using the original OpenID-Sherlock knowledge base.
In next step, we will continue working on quality control. We will try to investigate whether our semantic constraint approach can be moved up along the knowledge base construction pipeline to the rule learning phase. If it does, we will be able to obtain a much cleaner rule set, making error recovery much easier in the inference phase.
For more information, please refer to our publication.