The amount of text data has been growing exponentially in recent years, giving rise to automatic information extraction methods that store text annotations in a database. The current state-of-the-art structured prediction methods, however, are likely to contain errors and it is important to be able to manage the overall uncertainty of the database. On the other hand, the advent of crowdsourcing has enabled humans to aid machine algorithms at scale. In this article, we introduce pi-CASTLE, a system that optimizes and integrates human and machine computing as applied to a complex structured prediction problem involving Conditional Random Fields (CRFs). We propose strategies grounded in information theory to select a token subset, formulate questions for the crowd to label, and integrate these labelings back into the database using a method of constrained inference. On both a text segmentation task over academic citations and a named entity recognition task over tweets we show an order of magnitude improvement in accuracy gain over baseline methods.
Authors:
Sean Goldberg, Daisy Zhe Wang, Christan Grant
Bibtex:
@article{goldberg2017probabilistically, title={A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction}, author={Goldberg, Sean and Wang, Daisy Zhe and Grant, Christan}, journal={Journal of Data and Information Quality (JDIQ)}, volume={8}, number={2}, pages={10}, year={2017}, publisher={ACM} }
Download:
[paper]