Author: Iveel Jargalsaikhan
Supervisor(s) and Committee member(s): Noel E. O'Connor and Suzanne Little (joint supervisors)
The task of automatic categorization and localization of human action in video sequences is valuable for a variety of applications such as detecting relevant activities in surveillance video, summarizing and indexing video sequences or organizing a digital video library according to the relevant actions. However it remains a challenging problem for computers to robustly recognize action due to cluttered backgrounds, camera motion, occlusion, view point changes and the geometric and photometric variances of objects. An important question in action recognition is how to efficiently and effectively represent a video scene while maintaining the discriminative appearance, motion and contextual cues of the scene. Recently, local feature-based action recognition methods have gained popularity due to their simplicity and the-state-of-the-performance with various benchmarking datasets. However, the existing feature representation schemes e.g, Bag-of-Features, Fisher and VLAD, ignore the the spatial and temporal cues in the local features e.g, the spatio-temporal location and relationship. Inspired by this fact, this thesis aims to overcome the underlying limitation of the feature representation by proposing a new way to construct graph structure that aims to capture the spatial and temporal relationship between the local features while maintaining discriminative power. The key contributions can be summarized as follows (i) comprehensive evaluation of the several key elements in the recognition pipeline (ii) novel video graph based human action recognition framework (iii) evaluation of the different techniques involved in the video graph construction process and (iv) extension of the proposed video graph based video analysis to the challenging problem of action localization.