Storytelling Machines for Video Search
Supervisor(s) and Committee member(s): Advisor(s): Arnold W.M. Smeulders (promotor), Cees G.M. Snoek (co-promotor).
This thesis studies the fundamental question: what vocabulary of concepts are suited for machines to describe video content? The answer to this question involves two annotation steps: First, to specify a list of concepts by which videos are described. Second, to label a set of videos per concept as its examples or counter examples. Subsequently, the vocabulary is constructed as a set of video concept detectors learned from the provided annotations by supervised learning.
Starting from handcrafting the vocabulary by manual annotation, we gradually automate vocabulary construction by concept composition, and by learning from human stories. As a case study, we focus on vocabularies for describing events, such as marriage proposal, graduation ceremony, and changing a vehicle tire, in videos.
As the first step, we rely on an extensive pool of manually specified concepts to study what are the best practices for handcrafting the vocabulary? From our analysis, we conclude that the vocabulary should encompass over thousands of concepts from various types, including object, action, scene, people, animal, and attribute. Moreover, the vocabulary should include the detectors for both generic concepts and specific concepts, which are trained and normalized in an appropriate way.
We alleviate the manual labor for vocabulary construction by addressing the next research question: can a machine learn novel concepts by composition? We propose an algorithm, which learns new concepts by composing the ground concepts by Boolean logic connectives, i.e. “ride-AND-bike”. We demonstrate that concept composition is an effective trick to infer the annotations, needed for training new concept detectors, without additional human annotation.
As a further step towards reducing the manual labor for vocabulary construction, we investigate the question of can a machine learn its vocabulary from human stories, i.e. video captions or subtitles? By analyzing the human stories using topic models, we effectively extract the concepts that humans use for describing videos. Moreover, we show that the occurrences of concepts in stories can be effectively used as weak supervision to train concept detectors.
Finally, we address the question of how to learn the vocabulary from human stories? We learn the vocabulary as an embedding from videos into their stories. We utilize the correlations between the terms to learn the embedding more effectively. More specifically, we learn similar embeddings for the terms, which highly co-occur in the stories, as these terms are usually synonyms. Furthermore, we extend our embedding to learn the vocabulary from various video modalities including audio and motion. It makes us able to generate more natural descriptions by incorporating concepts from various modalities, i.e. the laughing and singing concepts from audio, and the jumping and dancing concepts from motion.
Intelligent Sensory Information Systems group
The world is full of digital images and videos. In this deluge of visual information, the grand challenge is to unlock its content. This quest is the central research aim of the Intelligent Sensory Information Systems group. We address the complete knowledge chain of image and video retrieval by machine and human. Topics of study are semantic understanding, image and video mining, interactive picture analytics, and scalability. Our research strives for automation that matches human visual cognition, interaction surpassing man and machine intelligence, visualization blending it all in interfaces giving instant insight, and database architectures for extreme sized visual collections. Our research culminates in state-of-the-art image and video search engines which we evaluate in leading benchmarks, often as the best performer, in user studies, and in challenging applications.