Kimiaki Shirahama

Intelligent Video Processing Using Data Mining Techniques

Supervisor(s) and Committee member(s): Prof. Dr. Kuniaki Uehara (supervisor)


Due to the rapidly increasing video data on the Web, much research effort has been devoted to develop video retrieval methods which can efficiently retrieve videos of interest. Considering the limited man-power, it is much expected to develop retrieval methods which use features automatically extracted from videos. However, since features only represent physical contents (e.g. color, edge, motion, etc.), retrieval methods require knowledge of how to use/integrate features for retrieving videos relevant to queries. To obtain such knowledge, this thesis concentrates on `video data mining’ where videos are analyzed using data mining techniques which extract previously unknown, interesting patterns in underlying data. Thereby, patterns for retrieving relevant shots to queries are extracted as explicit knowledge.

Queries can be classified into three types. For the first type of queries, a user can find keywords suitable for retrieving relevant videos. For the second type of queries, the user cannot find such keywords due to the lexical ambiguity, but can provide some example videos. For the final type of queries, the user has neither keywords nor example videos. Thus, this thesis develops a video retrieval system with `multi-modal’ interfaces by implementing three video data mining methods to support each of the above three query types. For the first query type, the system provides a `Query-By-Keyword’ (QBK) interface where patterns which characterize videos relevant to certain keywords are extracted. For the second query type, a `Query-By-Example’ (QBE) interface is provided where relevant videos are retrieved based on their similarities to example videos provided by the user. So, patterns for defining meaningful shot similarities are extracted using example videos. For the final qu
ery type, a `Query-By-Browsing’ (QBB) interface is devised where abnormal video editing patterns are detected to characterize impressive segments in videos, so that the user can browse these videos to find keywords or example videos. Finally, to improve retrieve performance, the integration of QBK and QBE is explored where informations from text and image/video modalities are interchanged using knowledge base which represents relations among semantic contents.

The developed video data mining methods and the integration method are summarized as follows.

The method for the QBK interface focuses that a certain semantic content is presented by concatenating several shots taken by different cameras. Thus, this method extracts `sequential patterns’ which relate adjacent shots relevant to certain keyword queries. Such patterns are extracted by connecting characteristic features in adjacent shots. However, the extraction of sequential patterns requires an expensive computation cost because a huge number of sequences of features have to be examined as candidates of patterns. Hence, time constraints are adopted to eliminate semantically irrelevant sequences of features.

The method for the QBE interface focuses on a large variation of relevant shots. This means that even for the same query, relevant shots contain significantly different features due to varied camera techniques and settings. Thus, `rough set theory’ is used to extract multiple patterns which characterize different subsets of example shots. Although this pattern extraction requires counter-example shots which are compared to example shots, they are not provided. Hence, `partially supervised learning’ is used to collect counter-example shots from a large set of shots left behind in the database. Particularly, to characterize the boundary between relevant and irrelevant shots, the method collects counter-example shots which are as similar to example shots as possible.

The method for the QBB interface assumes that impressive actions of a character are presented by abnormal video editing patterns. For example, thrilling actions of the character are presented by shots with very short durations while his/her romantic actions are presented by shots with very long durations. Based on this, the method detects `bursts’ as patterns consisting of abnormally short or long durations of the character’s appearance. The method firstly performs a probabilistic time-series segmentation to divide a video into segments characterized by distinct patterns of the character’s appearance. It then examines whether each segment contains a burst or not.

The integration of QBK and QBE is achieved by constructing a `video ontology’ where concepts such as Person, Car and Building are organized into a hierarchical structure. Specifically, this is constructed by considering the generalization/specialization relation among concepts and their co-occurrences in the same shots. Based on the video ontology, concepts related to a keyword query are selected by tracing its hierarchical structure. Shots where few of selected concepts are detected are filtered, and then QBE is performed on the remaining shots.

Experimental results validate the effectiveness of all the developed methods. In the future, the multi-modal video retrieval system will be extended by adding a `Query-By-Gesture’ (QBG) interface based on virtual reality techniques. This enables a user to create example shots for any arbitrary queries by synthesizing his/her gesture, 3DCG and background images.

CS 24 Uehara Laboratory at Graduate School of System Informatics, Kobe University


Our research group aims at developing fundamental and practical technologies to utilize knowledge extracted from multimedia data. To this end, we are conducting research in broad areas of artificial intelligence, more specifically, machine learning, video data mining, time-series data analysis, information retrieval, trend analysis, knowledge discovery, etc. with typically a large amount of data.

As a part of the research efforts, we are developing a multi-modal video retrieval system where different media, such as text, image, video, and audio, are analyzed using machine learning and data mining techniques. We formulate video retrieval as a classification problem to discriminate between relevant and irrelevant shots to a query. Various techniques, such as rough set theory, partially supervised learning, multi-task learning, and Hidden Markov Model (HMM), are applied to the classification. Recently, we began to develop a gesture-based video retrieval system where information from various sensors, including RGB cameras, depth sensors, and magnetic sensors, are fused using virtual reality and computer vision techniques. In addition, transfer learning and collaborative filtering are utilized to refine the video annotation.

Another pillar of our research group is concerned with more deeper analysis of natural language text. Our primary focus is to distill both explicit and implicit information contained therein. The former is generally seen as the problems of information extraction, question answering, passage retrieval, and annotation, and the latter as hypothesis discovery and text mining. Explicit information is directly described in text but not readily accessible by computers as it is embedded in complex human language. On the other hand, implicit information cannot be found in a single document and is only understood by synthesizing knowledge fragments scattered across a large number of documents. We take statistical natural language processing (NLP)- and machine learning-based approaches, such as kernel-based online learning and transductive transfer learning, to tackling these problems.

As described above, the common foundation underlying our research methodologies is machine learning, which requires more and more computing power reflecting increasingly available large-scale data and more complex algorithms. To deal with it, we are also engaged in developing parallel machine learning frameworks using MapReduce, MPI, Cell, and GPGPU. These works are ongoing and will be shared with the research community soon. More details of our research group can be found on our web site at

Bookmark the permalink.