We consider the problem of automatically extracting visual objects from web images. Despite the extraordinary advancement in deep learning, visual object detection remains a challenging task. To overcome the deficiency of pure visual techniques, we propose to make use of meta text surrounding images on the Web for enhanced detection accuracy. In this work we present a multimodal learning algorithm to integrate text information into visual knowledge extraction. we developed a system that takes raw webpages and a small set of training images from ImageNet as inputs, and automatically extracts visual knowledge. Experimental results based on 46 object categories show that the extraction precision is improved significantly from 73% (with state-of-the-art deep learning programs) to 81%, which is equivalent to a 31% reduction in error rates.
Multimodal Embeddings
Our algorithm is closely related to the skip-gram model, which is trained to learn word embeddings by maximizing the following objective function:
where w1, w2, … is a sequence of training words in the corpus, and c is the size of window around target wt. We extend this skip-gram model into multimodal corpus to learn vector embeddings for both text words and image concepts, such that objects with similar semantic meanings are also closed to each other in the embedding space.
Structure Learning and Prediction
Given candidate image objects along with text words describing these objects, our goal is to predict the confidence score that the image objects are belonging to some predefined image categories. Mathematically, we model the probability that an image In contains objects of category c with a logistic regression model:
where Wn is a set of multimodal words describing image In. To learn the model parameters, we maximize the following regularized objective function:
Experiments and Results
We evaluate our approach based on a collection of web pages and images derived from the Common Crawl dataset that is publicly available on Amazon S3. The data is processed to extract image objects along with text tags, resulting in around 10 million tagged images for our study. The Table 1 shows some example documents.
Quantative evaluation based on 46 image categories show that, on average the multimodal approach has improved the image prediction precision by 8.48% from 72.95% to 81.43%, which is equivalent to 31% reduction in error rates. To intuitively examine the effectiveness, we visualize extracted examples as shown in Table 3. From these examples, we conclude that the baseline Uni. approach extracts objects with the highest visual detection score (1st row), while the proposed Mul. approach leverages both text and visual information (2nd row). We also observe that the text description for images retrieved with Mul. (2nd row) is more consistent with the visual objects in the images. The second image in the first row is a false positive extraction, which also shows the unreliability of algorithms relying on single source of information.
For more details, please see our paper (Gong et. al. IJCAI-2017).