• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses
Home › publications › research directions › Extracting Visual Knowledge from the Web with Multimodal Learning

Extracting Visual Knowledge from the Web with Multimodal Learning

May 26, 2017     Comment Closed     publications, research directions

We consider the problem of automatically extracting visual objects from web images. Despite the extraordinary advancement in deep learning, visual object detection remains a challenging task. To overcome the deficiency of pure visual techniques, we propose to make use of meta text surrounding images on the Web for enhanced detection accuracy. In this work we present a multimodal learning algorithm to integrate text information into visual knowledge extraction. we developed a system that takes raw webpages and a small set of training images from ImageNet as inputs, and automatically extracts visual knowledge. Experimental results based on 46 object categories show that the extraction precision is improved significantly from 73% (with state-of-the-art deep learning programs) to 81%, which is equivalent to a 31% reduction in error rates.

 

Multimodal Embeddings

Our algorithm is closely related to the skip-gram model, which is trained to learn word embeddings by maximizing the following objective function:

where w1, w2, … is a sequence of training words in the corpus, and c is the size of window around target wt. We extend this skip-gram model into multimodal corpus to learn vector embeddings for both text words and image concepts, such that objects with similar semantic meanings are also closed to each other in the embedding space.

Structure Learning and Prediction

Given candidate image objects along with text words describing these objects, our goal is to predict the confidence score that the image objects are belonging to some predefined image categories. Mathematically, we model the probability that an image In contains objects of category c with a logistic regression model:

where Wn is a set of multimodal words describing image In. To learn the model parameters, we maximize the following regularized objective function:

Experiments and Results

We evaluate our approach based on a collection of web pages and images derived from the Common Crawl dataset that is publicly available on Amazon S3. The data is processed to extract image objects along with text tags, resulting in around 10 million tagged images for our study. The Table 1 shows some example documents.

Quantative evaluation based on 46 image categories show that, on average the multimodal approach has improved the image prediction precision by 8.48% from 72.95% to 81.43%, which is equivalent to 31% reduction in error rates. To intuitively examine the effectiveness, we visualize extracted examples as shown in Table 3. From these examples, we conclude that the baseline Uni. approach extracts objects with the highest visual detection score (1st row), while the proposed Mul. approach leverages both text and visual information (2nd row). We also observe that the text description for images retrieved with Mul. (2nd row) is more consistent with the visual objects in the images. The second image in the first row is a false positive extraction, which also shows the unreliability of algorithms relying on single source of information.

 

For more details, please see our paper (Gong et. al. IJCAI-2017).

publications research directions

 Previous Post

Interactive Inference for Information Extraction

― March 14, 2017

Next Post 

Archimedes: Efficient Query Processing over Probabilistic Knowledge Bases

― June 26, 2017

Related Articles

DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
A Brief Overview of Weak Supervision
DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
IDTrees Data Science Challenge: 2017

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017