• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses
Home › Uncategorized › Streaming Fact Extraction for Wikipedia Entities at Web-Scale

Streaming Fact Extraction for Wikipedia Entities at Web-Scale

May 12, 2014     No Comment     Uncategorized

Morteza Shahriari Nia, Christan Grant, Yang Peng

Wikipedia.org (WP) is the largest and most popular general reference work on the Internet. Presently, there is considerable time lag between the publication of an event and its citation in WP. The median time lag for a sample of about 60K web pages cited by WP articles in the living people category is over a year Frank et al. (2013). Moreover, Discovering facts relevant and cite-worthy to a certain WP entity all across the Internet is quite challenging.

Consider an example sentence: “Boris Berezovsky, who made his fortune in Russia in the 1990s, passed away March 2013.” We observe that there are two persons named Boris Berezovsky in Wikipedia; one a businessman and the other a pianist. Any extraction needs to take this into account (a.k.a entity resolution). Then, we match the sentence to a list of topics and find a match to topic DateOfDeath valued in the sentence as March 2013.

Table 1: The set of possible slot name for each entity type.

Table 1: The set of possible slot names for each entity type.

In this work, we introduce an efficient fact extraction system from a list of facts for given WP entities. Fact extraction is the task of matching each sentence to the {subject — verb — adverbial/complement} sentence structure. The subject represents the WP entity, verb is the relation type (slot) as in Table 1, and adverbial/complement, represents the value of the associated slot. In our example, the entity is Boris Berezovsky and the slot we extract is DateOfDeath with a slot value of March 2013. The resulting extraction containing an entity, slot name and slot value is a fact.

Figure 1: System Architecture. Components are logical groups noted with dotted boxes.

Figure 1: System Architecture. Components are logical groups noted with dotted boxes.

Our system is built with a pipeline style architecture depicted in Figure 1. The three logical components are divided into sections entitled Model for entity resolution purposes, Wikipedia Citation to annotate cite-worthy documents, and Slot Filling to generate the actual slot values.

Model. Using regular expressions we extract bold phrases of the initial paragraph of the WP entity page as aliases. Then we generate possible forms of writing (e.g. ‘Boris Berezovsky’ can also be written as ‘Berezovsky, Boris’). Next, we iterate over documents in the stream and filter out all documents that do not explicitly contain a string matching the list of entities.

Wikipedia Citation. Corpus of documents comes in the form of chunk files each of which contain thousands of documents, corpus is processed by a two-layer filter system referred to as Document Chunk Filter and Document Filter. The purpose of these filters is to reduce I/O cost while generating slot values for various entities. Document Chunk Filter removes the chunk files that do not contain a mention of any of the desired entities, and the Document Filter removes documents that do not contain a mention of an entity.

Slot Filling. We extract fact values from sentences according to a list of patterns. We define slot values extraction patterns as a tuple of five values ⟨p1, p2, p3, p4, p5⟩, where p1 represents the type of entity from set {FACILITY, ORGANIZATION, PERSON}. p2 represents a slot name from Table 1. p3 is the pattern content — a string found in the sentence that identifies a slot name. The pattern evaluator uses a direction (left or right) found in p4 to explore sentence. The final element p5 represents the type of the slot value.  Therefore an example pattern would be ⟨PER, DateOfDeath, passed away, right, NP⟩.

Inference and constraints. The output contains many duplicate entries. Duplicates can be present in a window of rows; we use a window size of two meaning we only be adjacent rows (of extractions). Two rows are duplicates if they have the same exact extraction or if the rows have the same slot name and a similar slot value, or if the extracted sentence for a particular slot types come from the same sentence. New slots can be deduced from existing slots by defining inference rules. For example, two slots for the task are “FounderOf” and “FoundedBy”. A safe assumption is these slot names are biconditional logical connectives with the entities and slot values. Therefore, we can express a rule “X FounderOf Y ” ↔ “Y FoundedBy X” where X and Y are single unique entities. Additionally, we found that the slot names “Contact Meet PlaceTime” could be inferred as “Contact Meet Entity” if the Entity was a FAC and the extracted sentence contained an additional ORG/FAC tag, and so on.

This effort was a part of the Knowledge Base Acceleration track at the 2013 National Institute of Standards Text REtrieval Conference. TREC KBA will continue in 2014 with an updated corpus.

Our system was developed using Java based system and we tested our techniques on the large scale KBA corpus. A more detailed write up on this system was published at the 2014 FLAIRS Conference. Our paper will be release shortly.

Uncategorized
20132014KBANISTStreamingTREC

 Previous Post

SMART Electronic Discovery

― March 14, 2014

Next Post 

ProbKB: A Probabilistic Knowledge Base System for Web Knowledge

― July 6, 2014

Related Articles

SMART Electronic Discovery: System Evaluation
CASTLE: Crowd-Assisted System for Textual Labeling & Extraction
GPText: Greenplum Parallel Statistical Text Analysis Framework
Web-Scale Knowledge Inference Using Markov Logic Networks
Knowledge Extraction and Outcome Prediction using Medical Notes

Leave a Reply Cancel reply

You must be logged in to post a comment.

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017