Learning to Search for Images without Annotations
Supervisor(s) and Committee member(s): Advisor(s): Arnold W.M. Smeulders (promotor), Cees G.M. Snoek (co-promotor).
This thesis contributes to learning machines what is in an image by avoiding direct manual annotation as training data. We either rely on tagged data from social media platforms to recognize concepts, or on objects semantics and layout to recognize scenes. We focus our effort on image search.
We firstly demonstrate that concepts detectors can be learned using tagged examples from social media platforms. We show that using tagged images and videos directly as ground truth for learning can be problematic because of the noisy nature of tags. To this end, through extensive experimental analysis, we recommend to calculate the relevance of tags, and select only relevant positive and relevant negative examples for learning. Inclusive, we present four best practices which led to a winning entry on the TRECVID 2013 benchmark for the semantic indexing with no annotations task. Following the findings that important concepts appear rarely as tags in social media platforms, we propose to use semantic knowledge from an ontology to improve calculating tag relevance and to enrich training data for learning concept detectors of rare tags.
When searching images of a particular scene, instead of using annotated scene images, we show that with object classifiers we can reasonably well recognize scenes. We exploit 15,000 object classifiers trained with a convolutional neural network. Since not all objects can contribute equally in describing a scene, we show that pooling only the 100 most prominent object classifiers per image is good enough to recognize its scene. Furthermore, we go to the extreme of recognizing scenes by removing all object identities. We refer to the most probable positions in images to contain objects as things. We show that the ensemble of things properties, size, position, aspect ratio and prominent color, and those only, can discriminate scenes. The benefit of removing all object identities is that we also eliminate the learning of object classifiers in the process, and thus demonstrate that scenes can be recognized with no learning at all.
Overall, this thesis presents alternative ways to learn what concept is in an image or what scene it belongs to, without using manually annotated data, for the goal of image search. It investigates new approaches for learning machines to recognize the visually depicted environment captured in images, all the while dismissing the annotation process.
Intelligent Sensory Information Systems group
The world is full of digital images and videos. In this deluge of visual information, the grand challenge is to unlock its content. This quest is the central research aim of the Intelligent Sensory Information Systems group. We address the complete knowledge chain of image and video retrieval by machine and human. Topics of study are semantic understanding, image and video mining, interactive picture analytics, and scalability. Our research strives for automation that matches human visual cognition, interaction surpassing man and machine intelligence, visualization blending it all in interfaces giving instant insight, and database architectures for extreme sized visual collections. Our research culminates in state-of-the-art image and video search engines which we evaluate in leading benchmarks, often as the best performer, in user studies, and in challenging applications.