Sean Goldberg
Information Extraction (IE) is the name given to the task of converting unstructured free text into a more structured form for better searching, analysis, and organization. Automating IE tasks is crucial to making sense of the enormous amount of unstructured information being put on the web everyday. Even the most current state-of-the-art algorithms are still prone to errors because machines generally lack the prior knowledge and contextual ability necessary to interpret language semantically compared to a human.
This doesn’t mean we should throw out these methods. On the contrary, there are a far larger number of tasks for which we can train machines to perform well on than not. What would be useful would be to have a way to take all the ambiguous, uncertain parts of the extraction process and send them to a human for correction/confirmation.
CASTLE is a system we designed to do just such a thing. It’s primarily a system for performing IE tasks and storing the results in a probabilistic database (PDB), but with an additional layer of data cleaning using humans. CASTLE differs from other human-in-the-loop systems such as those that employ active learning in that it’s designed for web scale inference. Crowdsourcing through Amazon Mechanical Turk enables a large human workforce for fast, cheap labor. Data cleaning corrections are posed to the crowd in the form of questions and are optimized to scale with the data to get the most “bang for our buck”.
CASTLE accepts free text as input and uses a Conditional Random Field (CRF) to annotate the data according to a specific task (POS Tagging, Field Segmentation, Named Entity Recognition, etc.) and deposits the data, its tags, and probabilistic information used by the CRF to choose the tags into a PDB.
The PDB is then scanned and possible errors identified and converted into questions along two metrics, mututal information and information density. Mutual information ensures that questions are selected whose answers may given information used to improve other parts of the DB. For example, consider the task of segmenting a scientific citation between title and author by labeling tokens in the citation as one or the other. A question to the crowd would ask for the label of a specific token. Selecting a token near the suspected boundary between title and author would give much more mutual information about neighboring tokens than selecting random tokens elsewhere.
The other metric we use is information density to eliminate redundancy. Some documents might contain frequently occurring tokens whereby asking one question and applying it to all those tokens has the same effect as asking many questions about each one. We use a form of contextual clustering to identify tokens appearing in the same context and likely to share the same label.

Finally, once questions are selected for cleaning, CASTLE is able to automatically query the AMT service, dispatch questions, and retrieve the results. CASTLE trades off to select the most informative tokens in terms of both mutual information and information density. “Cleaned” tokens are deposited back into the database where the inference engine may be run again. CASTLE performs “constrained inference” using the available evidence from the crowd to improve the remaining, unselected results even further.
We believe CASTLE represents an important step in producing a hybrid system that combines the strengths of both human and machine computation to achieve fast, cheap, and most of all accurate results in information extraction.