We consider the problem of extracting text instances of predefined categories (e.g. city and person) from the Web. Instances of a category may be scattered across thousands of independent sources in many different formats with potential noises, which makes open-domain information extraction a challenging problem. Learning syntactic rules like “cities such as _” or “_ is a city” in a semi-supervised manner using a few labeled examples is usually unreliable because 1) high quality syntactic rules are rare and 2) the learning task is usually underconstrained. To address these problems, in this paper we propose to learn multimodal rules to combat the difficulty of syntactic rules. The multimodal rules are learned from information sources of different modalities, which is motivated by an intuition that information that is difficult to disambiguate correctly in one modality may be easily recognized in another. To demonstrate the effectiveness of this method, we have built a sophisticated end-to-end multimodal information extraction system that takes unannotated raw web pages as input, and generates a set of extracted instances (e.g. Boston is an instance of city) as outputs. More specifically, our system learns reliable relationship between multimodal information by multimodal relation analysis on big unstructured data. Based on the learned relationship, we further train a set of multimodal rules for information extraction. Experimental evaluation shows that a greater accuracy for information extraction can be achieved by multimodal learning. The overall algorithm consists of three stages, described below.
Stage 1: Multimodal Relation Analysis
To enable multimodal learning, we first learn the relationship between concepts of the text and image modalities. This stage creates a set of relating visual concepts for each predefined text category, which will be used to develop multimodal classification rules in the next stage.
Stage 2: Learning Multimodal Rules
Confidence score of an instance is calculated based on the multimodal rules that it matches. A multimodal rule defines how an instance is matched, and what’s the confidence score of that instance if a rule is matched. This stage generates a set of useful multimodal rules for information extraction.
Stage 3: Multimodal Information Extraction
In the final stage, we apply the learned multimodal rules to extract information from the real-world data.
For more details, please see our paper (Gong et. al. ACM-MM-2017).