One of the research theme of the UF Data Science Research lab is to adapt machine learning algorithm to real-time interactive query engines. Our prior work looked at implementing and optimizing information extraction and statistical machine learning operations in exploratory query processing systems.
To push the envelop further, our newly completed paper (under-submission) entitled “Query-Driven Sampling for Collective Entity Resolution” describes a sampling-based Entity Resolution (ER) algorithm that is optimized for the case when the user has one or more entities in mind. This paper describes three new techniques and the performance improvements. Below we give a motivation of the problem and walk through an example of our solution.
Cross-Document Entity resolution is the process of identifying and linking/grouping different manifestations — mentions, noun phrases, named entities — of the same real world object from across document boundaries. ER is a notoriously difficult and expensive task. There are two core reasons for ER’s difficulty. First, language is ambiguous; each named entity may have several wholly different manifestations. A person may have formal titles, nicknames all different from their government name. Similarly, multiple entities may have several identical names properties. Second, modern data sets are large and entity resolution over growing data sets become exponentially more expensive. With each new entity or record added, it will need to be compared or matched with each existing is exponentially expensive as new records are added.
Here is a simplified example. Suppose we are interested in documents involving a TV Host named Jimmy. A search of the dataset will return several named-entities describing talk show hosts named Jimmy. Upon observations of the results, we see that there may be multiple talk show hosts named Jimmy. Below is a sample of the sentences returned from a large dataset.
…Late-night host and comic Jimmy Fallon was born on…
…the Kanye West appearance on Jimmy Kimmel Live last…
…Jimmy Kimmel Shares His “Only Complaint” About Jimmy Fallon…
…funny scene with Jimmy Kimmel where he attempts to fire his biggest fan…
…while on the right is Jimmy Fallon playing the same mission on his TV show…
Given millions of similar mentions across the documents in a dataset, you can imagine that is is difficult for the algorithm to distinguish between entities. The cross-document entity resolution attempts to cluster the mentions that correspond to the same real-world entity. Our observation is that many times we are not interested in resolving all the entities, instead a small subset or watchlist of entities. Given this assumption we describe a new method of resolving entities given a watch list of query mention.
To make the problem more tractable, typical approaches will reduce the data set to remove all the mentions that are obviously not part of the data set. However, it is difficult to enumerate all possible references to an entity and reducing the data set, or blocking, may reduce the maximum possible recall.
Developing these blocking schemes is a bit of a black art. For example, blocking based on first and last name may not be effective for companies or locations. This is especially true for data sets with messy or non-canonicalized terms. Careless blocking may cause records that belong in the same entity cluster to be separated across canopies. Each block is isolated so blocked so the resolution is not performed across blocks.
Query-Driven Sampling-based Statistical Inference for ER
In query-driven entity resolution, the user provides a template or example entity (query-node). The algorithm then does two things, it generates a probabilistic canopy to understand the records that are most similar to the query mention. And then the metropolis Hastings algorithm is then biased towards the items in the implicit block. In other words, an implicit block is used by the entity resolution algorithm to perform sampling over a large, set of mentions; the entity resolution algorithm becomes query aware.
This method works well when a user is interested in a handful, or a watch list of entities. The index structure can be created in parallel for each item on the watchlist. The sampling index used is a version of the Vose’s Alias method. The frequency that each record is sampled is based on their normalized distance to the query mention. Distance is a heuristic based on any set of similarity comparisons.
Instead of always selecting a mention at random, we do an initial coin flip to decide whether to use the traditional sampling method or to perform a biased sample (depicted as a spin-wheel) over the Vose structure.
When the spin-wheel is selected, we can use two additional coin flips to select a row and column over the structure that the Alias method produces. The index provides biased samples in constant time and more similar mentions are sampled more frequently. As proved in prior work, the error included by not sampling uniformly is exactly the mutual influence of the query node and each node in the set.
No dangerous blocking methods need to be performed, instead, once the query is expressed a sampling data structure is created and sampling is performed across a large set of records. We see orders of magnitude speed up for queries both popular and rare entities over wikilink and other datasets.