We consider the problem of automatically extracting visual objects from web images. Despite the extraordinary advancement in deep learning, visual object detection remains a challenging task. To overcome the deficiency of pure visual techniques, we propose to make use of meta text surrounding images on the Web for enhanced detection accuracy. In this work we present a multimodal learning algorithm to integrate text information into visual knowledge extraction. we developed a system that takes raw webpages and a small set of training images from ImageNet as inputs, and automatically extracts visual knowledge. Experimental results based on 46 object categories show that the extraction precision is improved significantly from 73% (with state-of-the-art deep learning programs) to 81%, which is equivalent to a 31% reduction in error rates.
Authors:
Bibtex:
@article{gongextracting, title={Extracting Visual Knowledge from the Web with Multimodal Learning}, author={Gong, Dihong and Wang, Daisy Zhe} booktitle = {Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, {IJCAI} 2017, Melbourne, Australia, August 19-25, 2017} pages = {1718--1724}, year = {2017}, }
Download:
[pdf]