• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Eureka: Efficient Query Processing over Large Probabilistic Knowledge Bases

Eureka: Efficient Query Processing over Large Probabilistic Knowledge Bases

Due to the uncertainty, incompleteness and inconsistency from automatic extraction processes, query results from current large-scale knowledge bases (KBs) are incomplete, erroneous and conflicting. The research objective of this proposal is to extend the state-of-the-art KB systems to create a probabilistic first-order KB system that can infer missing knowledge using rules, prune conflicting knowledge using constraints, and return confidence values for resulting tuples. The new system and algorithms developed in this proposal can enable advanced online data analysis through an declarative query interface over large uncertain graphs exist in many high impact applications, including knowledge bases, social networks, and biological networks.

The research objective of this proposal is to extend the data model, query language, query processing and optimization techniques of the state-of-the-art KB systems to support a probabilistic first-order KB system. The P.I. will design a probabilistic KB graph data model; extend SPARQL to probabilistic graph query language with additional inference operators; invent new query execution and optimization techniques for scalable inference queries; and implement a new query processing system using a unified data-parallel and graph-parallel system over web-scale probabilistic KB graphs.

POC: Dr. Daisy Zhe Wang

Projects

ProbKB: Scalable Learning and Inference over Large Probabilistic Knowledge Bases

ProbKBRecent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this project, we design the Ontological Pathfinding (OP) algorithm to mine first-order inference rules from web-scale knowledge bases and apply the rules to uncover implicit facts. The OP algorithm scales up via a series of optimization techniques, including a new parallel rule mining algorithm, a pruning strategy to eliminate unsound and resource-consuming rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale.

Based on the mining algorithm and the optimizations, we develop an efficient inference engine. As a result, we infer 0.9 billion new facts from Freebase in 17.19 hours. We use cross validation to evaluate the inferred facts, and estimate a degree of expansion by 0.6 over Freebase, with a precision approaching 1.0. Our approaches outperform state-of-the-art mining algorithms and inference engines in terms of both performance and quality.

For more details, please visit the ProbKB homepage.

Archer: Query-Driven Machine Learning

archerEntity resolution (ER) is the process of determining records (mentions) in a database that correspond to the same real-world entity. Leading ER systems solve this problem by resolving every record in the database; however, for large datasets this is an expensive process. Moreover, such approaches are wasteful because in practice, users are interested in only one or a small subset of the entities mentioned in the database. In this work, we introduce new classes of SQL queries involving ER operators — selection-driven ER and join-driven ER. We develop novel variations of Metropolis Hastings algorithm and introduce selectivity-based scheduling algorithms to support the two classes of ER queries.

For more details, please visit the Archer homepage.

DBlytics: Statistical analysis on data parallel frameworks

MADden22DBlytics When processing large data, often a large bottleneck to computation is data movement. Moving data across geographical locations for processing is expensive. In-Database Analytics (dblytics) aims to build sophisticated analytic algorithms into data parallel systems, such as relational databases (RDBMS) and massively parallel processing (MPP) systems. Using a database as the ecosystem for analytics we a get declarative query interface, query optimization, transactional operations, efficient catching and fault tolerance. Below we list sub research projects that contribute to this effort.

MADlib MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. There are two significant motivations of in-database analytics tools such as MADlib. Firstly, it harness the embarrassing parallel processing power of parallel database and  make the database being a data analytic engine, which is capable of processing massive data. Secondly, In-database analytic tools avoid the time cost of transferring large volume of data between databases and outside tools. MADlib can be installed on Postgres and Greenplum database.

For more details, please visit the DBlytics/MADLib homepage.

Publications

  • Query-driven Sampling for Collective Entity Resolution
    Christan Grant, Daisy Zhe Wang, Michael Wick
    IEEE 17th International Conference on Information Reuse and Integration, 2016
  • ArchimedesOne: Query Processing over Probabilistic Knowledge Bases
    Xiaofeng Zhou, Yang Chen, Daisy Zhe Wang
    Proceedings of the VLDB Endowment, 2016
  • SigmaKB: Multiple Uncertain Knowledge Base Fusion
    Miguel E. Rodríguez, Sean Goldberg, Daisy Zhe Wang
    Proceedings of the VLDB Endowment, 2016
  • Multimodal Ensemble Fusion for Disambiguation and Retrieval
    Yang Peng, Xiaofeng Zhou, Daisy Zhe Wang, Ishan Patwa, Dihong Gong, Chunsheng Victor Fang
    Proceedings of the IEEE Multimedia Magazine, 2016
  • Scalable Image Retrieval with Multimodal Fusion
    Yang Peng, Xiaofeng Zhou, Daisy Zhe Wang, Chunsheng Victor Fang
    Proceedings of the 29th International FLAIRS conference, 2016
  • Consensus Maximization Fusion of Probabilistic Information Extractors
    Miguel E. Rodríguez, Sean Goldberg, Daisy Zhe Wang
    Proceedings of the 15th Conference of the North American Chapter of the Association of Computational Linguistics (NAACL HLT), 2016
  • Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases
    Yang Chen, Sean Goldberg, Daisy Zhe Wang, Soumitra Siddharth Johri
    To Appear in the Proceedings of the ACM SIGMOD International Conference on Management of Data, 2016
  • Optimizing Sampling-based Entity Resolution over Streaming Documents
    Christan Grant, Daisy Zhe Wang
    Proceedings of SDM Big Data & Streaming Analytics Workshop, 2015
  • A Topic-Based Search, Visualization, and Exploration System
    Christan Grant, Clint P. George, Virupaksha Kanjilal, Supriya Nirkhiwale, Joseph Wilson, Daisy Zhe Wang
    Proceedings of the 28th International FLAIRS Conference, 2015
  • A Challenge for Long-term Knowledge Base Maintenance
    Christan Grant, Daisy Zhe Wang
    Proceedings of ACM Journal on Data and Information Quality, 2015
  • UDA-GIST: An In-database Framework to Unify Data-Parallel and State-Parallel Analytics
    Kun Li, Daisy Zhe Wang,  Alin Dobra, Christopher Dudley
    Proceedings of the VLDB Endowment, 2015
  • Efficient In-Database Analytics with Graphical Models
    Daisy Zhe Wang, Yang Chen, Christan Grant, Kun Li
    IEEE Data Engineering Bulletin, 2014
  • Knowledge Expansion over Probabilistic Knowledge Bases
    Yang Chen, Daisy Zhe Wang
    Proceedings of the ACM SIGMOD International Conference on Management of Data, 2014
  • SemMemDB: In-Database Knowledge Activation
    Yang Chen, Milenko Petrovic, Micah H. Clark
    Proceedings of the 27th International FLAIRS Conference, 2014
  • GPText: Greenplum Parallel Statistical Text Analysis Framework
    Kun Li, Christan Grant, Daisy Zhe Wang, Sunny Khatri, George Chitouras
    Data analaytics in the Cloud workshop (DanaC) at SIGMOD, 2013
  • Web-Scale Knowledge Inference Using Markov Logic Networks
    Yang Chen, Daisy Zhe Wang
    Proceedings of ICML workshop on Structured Learning: Inferring Graphs from Structured and Unstructured Inputs (SLG), 2013
  • MADden: Query-Driven Statistical Text Analytics
    Christan Grant, Jordan Gumbs, Kun Li, Daisy Zhe Wang, George Chitouras
    Proceedings of the 21st ACM CIKM International Conference on Information and Knowledge Management, 2012
  • Automatic Knowledge Base Construction using Probabilistic Extraction, Deductive Reasoning, and Human Feedback
    Daisy Zhe Wang, Yang Chen, Sean Goldberg, Christan Grant, and Kun Li
    Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), 2012
  • The MADlib Analytics Library or MAD Skills, the SQL
    Joseph M. Hellerstein, Christoper Re, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleks Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, Arun Kumar
    Proceedings of the VLDB Endowment, 2012

Open-Source Repositories

  • Knowledge Expansion
  • Ontological Pathfinding
  • 36,625 First-Order Inference Rules Mined from Freebase

People

daisy ChristanGrant kun yang
Dr. Daisy Zhe Wang Dr. Christan Grant Dr. Kun Li Yang Chen
Sean Goldberg Miguel E. Rodríguez Yang Peng Xiaofeng Zhou

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013