Eureka: Efficient Query Processing over Large Probabilistic Knowledge Bases
Due to the uncertainty, incompleteness and inconsistency from automatic extraction processes, query results from current large-scale knowledge bases (KBs) are incomplete, erroneous and conflicting. The research objective of this proposal is to extend the state-of-the-art KB systems to create a probabilistic first-order KB system that can infer missing knowledge using rules, prune conflicting knowledge using constraints, and return confidence values for resulting tuples. The new system and algorithms developed in this proposal can enable advanced online data analysis through an declarative query interface over large uncertain graphs exist in many high impact applications, including knowledge bases, social networks, and biological networks.
The research objective of this proposal is to extend the data model, query language, query processing and optimization techniques of the state-of-the-art KB systems to support a probabilistic first-order KB system. The P.I. will design a probabilistic KB graph data model; extend SPARQL to probabilistic graph query language with additional inference operators; invent new query execution and optimization techniques for scalable inference queries; and implement a new query processing system using a unified data-parallel and graph-parallel system over web-scale probabilistic KB graphs.
POC: Dr. Daisy Zhe Wang
Projects
ProbKB: Scalable Learning and Inference over Large Probabilistic Knowledge Bases
Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this project, we design the Ontological Pathfinding (OP) algorithm to mine first-order inference rules from web-scale knowledge bases and apply the rules to uncover implicit facts. The OP algorithm scales up via a series of optimization techniques, including a new parallel rule mining algorithm, a pruning strategy to eliminate unsound and resource-consuming rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale.
Based on the mining algorithm and the optimizations, we develop an efficient inference engine. As a result, we infer 0.9 billion new facts from Freebase in 17.19 hours. We use cross validation to evaluate the inferred facts, and estimate a degree of expansion by 0.6 over Freebase, with a precision approaching 1.0. Our approaches outperform state-of-the-art mining algorithms and inference engines in terms of both performance and quality.
For more details, please visit the ProbKB homepage.
Archer: Query-Driven Machine Learning
Entity resolution (ER) is the process of determining records (mentions) in a database that correspond to the same real-world entity. Leading ER systems solve this problem by resolving every record in the database; however, for large datasets this is an expensive process. Moreover, such approaches are wasteful because in practice, users are interested in only one or a small subset of the entities mentioned in the database. In this work, we introduce new classes of SQL queries involving ER operators — selection-driven ER and join-driven ER. We develop novel variations of Metropolis Hastings algorithm and introduce selectivity-based scheduling algorithms to support the two classes of ER queries.
For more details, please visit the Archer homepage.
DBlytics: Statistical analysis on data parallel frameworks
DBlytics When processing large data, often a large bottleneck to computation is data movement. Moving data across geographical locations for processing is expensive. In-Database Analytics (dblytics) aims to build sophisticated analytic algorithms into data parallel systems, such as relational databases (RDBMS) and massively parallel processing (MPP) systems. Using a database as the ecosystem for analytics we a get declarative query interface, query optimization, transactional operations, efficient catching and fault tolerance. Below we list sub research projects that contribute to this effort.
MADlib MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. There are two significant motivations of in-database analytics tools such as MADlib. Firstly, it harness the embarrassing parallel processing power of parallel database and make the database being a data analytic engine, which is capable of processing massive data. Secondly, In-database analytic tools avoid the time cost of transferring large volume of data between databases and outside tools. MADlib can be installed on Postgres and Greenplum database.
For more details, please visit the DBlytics/MADLib homepage.
Publications
- Query-driven Sampling for Collective Entity Resolution
Christan Grant, Daisy Zhe Wang, Michael Wick
IEEE 17th International Conference on Information Reuse and Integration, 2016 - ArchimedesOne: Query Processing over Probabilistic Knowledge Bases
Xiaofeng Zhou, Yang Chen, Daisy Zhe Wang
Proceedings of the VLDB Endowment, 2016 - SigmaKB: Multiple Uncertain Knowledge Base Fusion
Miguel E. Rodríguez, Sean Goldberg, Daisy Zhe Wang
Proceedings of the VLDB Endowment, 2016 - Multimodal Ensemble Fusion for Disambiguation and Retrieval
Yang Peng, Xiaofeng Zhou, Daisy Zhe Wang, Ishan Patwa, Dihong Gong, Chunsheng Victor Fang
Proceedings of the IEEE Multimedia Magazine, 2016 - Scalable Image Retrieval with Multimodal Fusion
Yang Peng, Xiaofeng Zhou, Daisy Zhe Wang, Chunsheng Victor Fang
Proceedings of the 29th International FLAIRS conference, 2016 - Consensus Maximization Fusion of Probabilistic Information Extractors
Miguel E. Rodríguez, Sean Goldberg, Daisy Zhe Wang
Proceedings of the 15th Conference of the North American Chapter of the Association of Computational Linguistics (NAACL HLT), 2016 - Ontological Pathfinding: Mining First-Order Knowledge from Large Knowledge Bases
Yang Chen, Sean Goldberg, Daisy Zhe Wang, Soumitra Siddharth Johri
To Appear in the Proceedings of the ACM SIGMOD International Conference on Management of Data, 2016 - Optimizing Sampling-based Entity Resolution over Streaming Documents
Christan Grant, Daisy Zhe Wang
Proceedings of SDM Big Data & Streaming Analytics Workshop, 2015 - A Topic-Based Search, Visualization, and Exploration System
Christan Grant, Clint P. George, Virupaksha Kanjilal, Supriya Nirkhiwale, Joseph Wilson, Daisy Zhe Wang
Proceedings of the 28th International FLAIRS Conference, 2015 - A Challenge for Long-term Knowledge Base Maintenance
Christan Grant, Daisy Zhe Wang
Proceedings of ACM Journal on Data and Information Quality, 2015 - UDA-GIST: An In-database Framework to Unify Data-Parallel and State-Parallel Analytics
Kun Li, Daisy Zhe Wang, Alin Dobra, Christopher Dudley
Proceedings of the VLDB Endowment, 2015 - Efficient In-Database Analytics with Graphical Models
Daisy Zhe Wang, Yang Chen, Christan Grant, Kun Li
IEEE Data Engineering Bulletin, 2014 - Knowledge Expansion over Probabilistic Knowledge Bases
Yang Chen, Daisy Zhe Wang
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2014 - SemMemDB: In-Database Knowledge Activation
Yang Chen, Milenko Petrovic, Micah H. Clark
Proceedings of the 27th International FLAIRS Conference, 2014 - GPText: Greenplum Parallel Statistical Text Analysis Framework
Kun Li, Christan Grant, Daisy Zhe Wang, Sunny Khatri, George Chitouras
Data analaytics in the Cloud workshop (DanaC) at SIGMOD, 2013 - Web-Scale Knowledge Inference Using Markov Logic Networks
Yang Chen, Daisy Zhe Wang
Proceedings of ICML workshop on Structured Learning: Inferring Graphs from Structured and Unstructured Inputs (SLG), 2013 - MADden: Query-Driven Statistical Text Analytics
Christan Grant, Jordan Gumbs, Kun Li, Daisy Zhe Wang, George Chitouras
Proceedings of the 21st ACM CIKM International Conference on Information and Knowledge Management, 2012 - Automatic Knowledge Base Construction using Probabilistic Extraction, Deductive Reasoning, and Human Feedback
Daisy Zhe Wang, Yang Chen, Sean Goldberg, Christan Grant, and Kun Li
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), 2012 - The MADlib Analytics Library or MAD Skills, the SQL
Joseph M. Hellerstein, Christoper Re, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleks Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, Arun Kumar
Proceedings of the VLDB Endowment, 2012
Open-Source Repositories
People
![]() |
![]() |
![]() |
![]() |
Dr. Daisy Zhe Wang | Dr. Christan Grant | Dr. Kun Li | Yang Chen |
![]() |
![]() |
![]() |
![]() |
Sean Goldberg | Miguel E. Rodríguez | Yang Peng | Xiaofeng Zhou |