Introduction
In order to keep up with the enormous pace of data being produced on the web, we need automatic methods for converting it into a more structured form for analyzing and querying. For text data this amounts to generating labels which describe the underlying semantic concept and is known as Information Extraction (IE). Examples include parsing the authors and title out of a bibliographic citation or pulling people, places, and organizations out of a tweet.
A common way to perform automatic IE is through statistical machine learning, but even state-of-the-art methods may contain errors. This is especially true when the model is trained in one domain (say, on newspaper articles) and applied to another (like Twitter). In many cases, these errors are easily corrected by adding a small amount of human input. Human editing is slow and potentially costly, so we don’t want to add too much. We want just the right amount to balance the speed of automation with the precision of a human.
This short blog outlines a system developed at the DSR Lab called pi-CASTLE for performing interactive inference, combining human and machine computation to solve problems in information extraction. We will discuss the model we use known as a Conditional Random Fields, how we select examples for labeling using information functions, and discuss some results. For more details, please see our paper (Goldberg et al. 2017).
Model
What makes labeling tasks in language inherently difficult are the structural properties between words in a sentence. The semantic meanings underlying words depend on their position and relation to other words in the sentence. Although the goal is to label individual words (also known as tokens), the entire sentence and all token relationships must be processed globally to reach an optimal decision. We call such problems structured prediction problems because the output is a set of correlated labels.
The image above shows the common Conditional Random Field (CRF) model as applied to a sequence of word tokens in a bibliographic citation. The blue nodes represent the observed words and the white nodes the labels corresponding to those nodes, such as Title or Author. These label nodes are also called “hidden nodes” because they are not observed and must be inferred. Edges mark dependencies as each hidden node depends on the observed token and other hidden nodes in the sequence. In full, this represents a statistical sequence model and can be used to score the probability of different label sequences in a sentence or citation according to the following formula:
The left-hand side enumerates that we want the probability of the hidden label sequence H conditioned on the observed label sequence (which is just the words or tokens). We build up this probability as a sum over weighted features f for each token t in the sequence. lambda denote the feature weights. Features encode common syntactic or lexical properties of a word such as titles and authors are co-located in a citation or an author name appears in a dictionary of common names.
The global label sequence (solved using the Viterbi algorithm which is outside the cope of this post) can be visualized as tracing a path through the “label lattice”.
Note that in the above figure, the chosen label path is incorrect and labels Cyril Fonlupt as part of the Title. pi-CASTLE is about being able to select tokens that may be incorrect and propagating those corrections to other nodes in the sequence. This is done using a technique known as constrained inference. For example, let’s say we chose to ask the label for the token Cyril.
Now the inference path is constrained to through a particular section of the lattice since we can assume we have new ground truth. The relational model catches wind of this correction and propagates corrections to nearby areas in the lattice. In this way a single token correction can lead to many additional tokens being corrected.
Selection
The selection problem in pi-CASTLE is to uncover those nodes in the graphical model with the largest information value determined by the amount of correction propagation they incur. These nodes are batched and sent to humans for labeling, whereupon the inference is applied a second time to determine additional corrections.
Our paper describes a number of different information functions by which tokens are evaluated for selection. The parameters of information for each token are its uncertainty (how unsure the model is of its correct label) and its influence (how much a correction would be propagated to other nodes in the model). Examination of the marginal distribution for each label (what’s the probability a single word has the label Title or Author?) can be scored using a simple entropy function. Entropy looks at whether the distribution is smooth (high entropy) or peaked to a particular value (low entropy). High entropy can be used as a metric for high uncertainty.
The above equation just says to evaluate each hidden node in terms of its marginal entropy.
Another information-theoretic concept of use is Mutual Information (MI), which describes directly how much information the observation of a statistical random variable gives about another random variable. MI can be written as comparing entropies between different random variables and so encompasses both uncertainty and influence as an information function. The full MI should consider the information between a node and the joint distribution of all other nodes in the sequence and has an exponential run-time. As an approximation, we take the pair-wise MI between a node and its left- and right-neighbors. Thus we are considering the effect selection of a node has on its neighbors.
Our full selection algorithm is as follows. Given a collection of token sequences, we run the inference and generate a prediction based on a trained model. We then apply local information functions to each token independently, rank them, and select the top k according to our budget, which is a small fraction of the total problem size. Corrections are applied and the inference algorithm is run one additional time in a constrained form to allow corrections to propagate.
Our paper includes an additional optimization to reduce redundancy across sequences through clustering. Since selections are performed in batch, we can cluster tokens we expect to have a similar label and apply corrections to all tokens in the cluster. We incorporate cluster significance into our information functions for entropy and mutual information, Clustering combined with correction propagation provides many ways in which a single token correction can have far greater influence on the rest of the problem.
Experiments
The main domain where we find interactive inference to be particularly useful is in knowledge transfer. This is where a machine learning classifier is trained on data from a particular domain which is different from the domain in which it is applied, facilitating ease of use in applications. Our two experimental tasks are bibliographic extraction and named entity recognition (NER). The bibliographic extraction problem aims to label citations with semantic information such as the Title, Author, Conference, etc. NER processes full sentences to extract words referring to people, locations, and organizations.
We generated instances of knowledge transfer for both applications. The bibliographic labeling task was trained on the CORA dataset which consists of a wide range of papers in different fields of science. Our test set was the DBLP dataset and comprised primarily computer science papers. The NER task used a pre-trained classifier trained on newswire data (from the CoNLL 2003 shared task) and was tested on a set of Twitter data.
The figures below show experiments comparing entropy, mutual information, and a cluster size baseline for both tasks. The x-axis represents the number of corrections made by the user and the y-axis shows the resulting F1 (technically micro-averaged F1, see the paper for more details). The graphs show to what extent performance improves as we apply corrections and that for mutual information in particular, the best examples are corrected first. This is more prevalent in the bibliographic labeling task because there is more relevance in the positions of tokens with relation to each other.
Future Work
Our work showed that small, guided human corrections can have a potentially large impact on reducing errors in information extraction inference. We’re excited to be extending this work in a number of novel ways. We are currently investigating generalizations that take into account graphs of arbitary shape as opposed to just linear chains. We’re applying this to important problems in Knowledge Base completion, where fact-based Knowledge Bases can be automatically extended to include implicit information. This represents another problem where interactive human computation can play an outsized role in improving machine learning inference.