• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

ProbKB: Web-Scale Probabilistic Knowledge Base

ProbKB

ProbKB: Web-Scale Probabilistic Knowledge Base

Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. The ProbKB project aims at building a web-scale probabilistic knowledge base through scalable learning and inference. The goal is supported by two current projects:

Mining first-order knowledge

We design the Ontological Pathfinding algorithm that scales first-order rule mining to web knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent subtasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing approach achieves this scale.

First-order inference engine

We design an efficient inference engine to infer implicit knowledge from existing knowledge bases: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. The ProbKB inference engine outperforms the state-of-the-art inference engine in terms of both performance and quality.

Faculty: Daisy Zhe Wang
Students: Yang Chen, Sean Goldberg, Soumitra Siddharth Johri

Publications

  • Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases
    Yang Chen, Sean Goldberg, Daisy Zhe Wang, Soumitra Siddharth Johri
    Proceedings of the ACM SIGMOD International Conference on Management of Data, 2016
  • Knowledge Expansion over Probabilistic Knowledge Bases
    Yang Chen, Daisy Zhe Wang
    Proceedings of the ACM SIGMOD International Conference on Management of Data, 2014
  • Web-Scale Knowledge Inference Using Markov Logic Networks
    Yang Chen, Daisy Zhe Wang
    Proceedings of ICML workshop on Structured Learning: Inferring Graphs from Structured and Unstructured Inputs (SLG), 2013, Atlanta

Software

  • Ontological Pathfinding
    Mining first-order knowledge from large knowledge bases.
  • Knowledge Expansion
    Inferring hidden knowledge from knowledge bases.

Data

  • Freebase data dump; please contact Yang Chen for the clean 388M Freebase facts.
  • 36,625 Freebase first-order rules

Acknowledgments

The ProbKB project is partially supported by NSF IIS Award # 1526753, DARPA under FA8750-12-2-0348-2 (DEFT/CUBISM), and a generous gift from Google. We also thank Dr. Milenko Petrovic and Dr. Alin Dobra for the helpful discussions on query optimization.

For any questions, please contact Yang Chen or Dr. Daisy Zhe Wang.

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013