The Text Analysis Conference (TAC) is a series of evaluation workshops organized by NIST to encourage research in Natural Language Processing and related applications. TAC is focused on Knowledge Base Population (KBP), automated systems that discover information about entities found in a large corpus and incorporate them into a knowledge base. The TAC-KBP evaluation is composed of a number of tracks each focused on a specific aspect of the problem. Such tracks include: entity discovery and linking, event extraction, cold start relation extraction, slot filler and slot filler validation. The DSR@UF lab participated in the 2015 Slot Filler Validation – Ensemble (SFV-ensemble) track. The final purpose of this track is the output refinement of relation extractions from both cold start and slot filling tracks by combining them and/or applying more intensive linguistic processing. We approached this problem as a Knowledge Fusion problem and proposed a semi-supervised application of Consensus Maximization, “Consensus Maximization Fusion of Probabilistic Information Extractors“, that combines a number of supervised and unsupervised ensemble models to promote high quality facts and eliminate incorrect ones from the aggregate. Our system achieved a superior F1 compared to all participating relation extractors and also other systems submitted to the SFV task. Below we give a motivation of the problem and a short explanation of our method.
Information extractors (IE) are used to construct or expand Knowledge Bases (KBs). Nevertheless, Information extraction pipelines are not perfectly accurate and different models exhibit different strengths and weaknesses. Therefore, by exploiting complementary and contradictory extractions while merging the output of multiple IE pipelines can benefit the quality of the resulting KB. An ideal combination results in the removal of erroneous extractions and the promotion and confidence adjustment of corrects correct ones. This process is also known as Knowledge Fusion and it is not trivial. For instance, more doesn’t mean better, a majority vote among extractors without distinguishing the merit of each, may perform poorly if all the extractors are weak, or they all make similar mistakes. On the other hand, Supervised methods such as stacking achieve better performance by learning weights for each extractor and combining them as a weighted sum, but this approach suffer from the difficulty of obtaining training data resulting in high precision, but low recall among all facts.
Formally, the SFV task take two inputs, a set of queries in the form of (Subject, Relation) and a set of answers for such queries from multiple Slot Fillers (Relation extractors). We use the given queries to cluster multiple answers for the same (Subject, Relation) pair. From this clustering we also derive an evaluation dataset, consisting of all distinct triples (subject,relation,object) extracted by all systems. For instance, the following table shows some of the answers for the query (Facebook, org:city of headquarters) along with their extraction system id and confidence score. The triples contributed by this example cluster to the evaluation dataset are (Facebook, org:city of headquarters, Menlo Park), (Facebook, org:city of headquarters, San Francisco), (Facebook, org:city of headquarters, Chinatown) and (Facebook, org:city of headquarters, California).
As we mentioned before, majority voting and stacking have their own strengths. We propose to use a cross model ensemble method that can leverage the strengths of both. We apply Consensus Maximization, an ensemble method that combines the output of supervised classification models and unsupervised clustering. At first, we train (with queries, answers and ground truth from previous evaluations) a number of stacked ensembles that differ from each other in the feature vector and classification model used. We then use the trained models to assign labels (Yes/No) to the evaluation dataset. This stacked ensemble approach was introduced in the SFV task by Viswanathan et al. We know that the biggest weakness of this approach is the difference in participating teams from one evaluation to another. We treat such unused systems as clusters that each divide the evaluation dataset into their own extractions (Yes) and all other extractions (No). Using these outputs, we create a CM bipartite graph between the elements in the evaluation dataset (O) and the groups formed by stacking and clusters(G) as shown in the following figure. We then run the constraint optimization problem defined in CM to find the aggregated labels for the evaluation dataset. Finally after applying CM we apply some functional constraints. For instance, each relation have a specific multiplicity eg. one can only be born in one city/contry and an organization can have multiple subsidiaries. These constraints are given by the problem definition.
We submitted three runs in the 2015 evaluation that differ from each other in the training data used to train the supervised portion of the final ensemble. We trained on 2013-only, 2014-only and both. Our results are shown in the following table. The main idea behind CM Fusion is to take into account the answers from potentially well-ranked extractors that stacking meta-classifiers omit due lack of training data. CM Fusion outperforms both approaches in terms of F1 by greatly increasing the recall while maintaining high precision.
The main idea behind CM Fusion is to take into account the answers from potentially well-ranked extractors that stacking meta-classifiers omit due lack of training data. CM Fusion outperforms both approaches in terms of F1 by greatly increasing the recall while maintaining high precision.