VIREO-VH: Libraries and Tools for Threading and Visualizing a Large Video Collection

Affiliation: VIREO, City University of Hong Kong, Kowloon, Hong Kong



“Video Hyperlinking” refers to the creation of links connecting videos that share near-duplicate segments. Like hyperlinks in HTML documents, the video links help user navigating videos of similar content, and facilitate the mining of iconic clips (or visual memes) spread among videos. Figure 1 shows some example of iconic clips, which can be leveraged for linking videos and the results are potentially useful for multimedia tasks such as video search, mining and analytics.

VIREO-VH [1] is an open source software developed by the VIREO research team. The software provides end-to-end support for the creation of hyperlinks, including libraries and tools for threading and visualizing videos in a large collection. The major software components are: near-duplicate keyframe retrieval, partial near-duplicate localization with time alignment, and galaxy visualization. These functionalities are mostly implemented based on state-of-the-art technologies, and each of them is developed as an independent tool taking into consideration flexibility, such that users can substitute any of the components with their own implementation. The earlier versions of the software are LIP-VIREO and SOTU, which have been downloaded more than 3,500 times. VIREO-VH has been internally used by VIREO since 2007, and evolved over the years based on the experiences of developing various multimedia applications, such as news events evolution analysis, novelty reranking, multimedia-based question-answering [2], cross media hyperlinking [3], and social video monitoring.

Figure 1: Examples of iconic clips.


The software components include video pre-processing, bag-of-words based inverted file indexing for scalable near-duplicate keyframe search, localization of partial near-duplicate segments [4], and galaxy visualization of a video collection, as shown in Figure 2. The open source includes over 400 methods with 22,000 lines of code.

The workflow of the open source is as followings. Given a collection of videos, the visual content will be indexed based on a bag-of-words (BoW) representation. Near-duplicate keyframes will be retrieved and then temporally aligned in a pairwise manner among videos. Segments of a video which are near-duplicate to other videos in the collection will then be hyperlinked with the start and end times of segments being explicitly logged. The end product is a galaxy browser, where the videos are visualized as a galaxy of clusters on a Web browser, with each cluster being a group of videos that are hyperlinked directly or indirectly through transitivity propagation. User friendly interaction is provided such that end user can zoom in and out, so they can glance or take a close inspection of the video relationship.

Figure 2: Overview of VIREO-VH software architecture.


VIREO-VH could be either used as an end-to-end system that outputs visual hyperlinks, with a video collection as input, or as independent functions for development of different applications.

For content owners interested in the content-wise analysis of a video collection, VIREO-VH can be used as an end-to-end system by simply inputting the location of a video collection and the output paths (Figure 3). The resulting output can then be viewed with the provided interactive interface for showing the glimpse of video relationship in the collection.

Figure 3: Interface for end-to-end processing of video collection.

VIREO-VH also provides libraries to grant researchers programmatic access. The libraries consist of various classes (e.g., Vocab, HE, Index, SearchEngine and CNetwork), providing different functions for vocabulary and Hamming signature training [5], keyframe indexing, near-duplicate keyframe searching and video alignment. Users can refer to the manual for details. Furthermore, the components of VIREO-VH are independently developed for providing flexibility, so users can substitute any of the components with their own implementation. This capability is particular useful for benchmarking the users’ own choice of algorithms. As an example, users can choose their own visual vocabulary and Hamming median, but use the open source for building index and retrieving near-duplicate keyframes. For example, the following few lines of code implements a typical image retrieval system:

#include “Vocab_Gen.h” #include “Index.h” #include “HE.h” #include “SearchEngine.h” … // train visual vocabulary using descriptors in folder “dir_desc” // here we choose to train a hierarchical vocabulary with 1M leaf nodes (3 layers, 100 nodes / layer) Vocab_Gen::genVoc(“dir_desc”, 100, 3); // load pre-trained vocabulary from disk Vocab* voc = new Vocab(100, 3, 128); voc->loadFromDisk(“vk_words/”); // Hamming Embedding training for the vocabulary HE* he = new HE(32, 128, p_mat, 1000000, 12); he->train(voc, “matrix”, 8); // index the descriptors with inverted file Index::indexFiles(voc, he, “dir_desc/”, “.feat”, “out_dir/”, 8); // load index and conduct online search for images in “query_desc” SearchEngine* engine = new SearchEngine(voc, he); engine->loadIndexes(“out_dir/”); engine->search_dir(“query_desc”, “result_file”, 100); …


We use a video collection consisting of 220 videos (around 31 hours) as an example. The collection was crawled from YouTube using the keyword “economic collapse”. Using our open source and default parameter settings, a total of 35 partial near-duplicate (ND) segments are located, resulting in 10 visual clusters (or snippets). Figure 4 shows two examples of the snippets. Based on our experiments, the precision of ND localization is as high as 0.95 and the recall is 0.66. Table 1 lists the running time for each step. The experiment was conducted on a PC with dual core 3.16 GHz CPU and 3 GB of RAM. In total, creating a galaxy view for 31.2 hours of videos (more than 4,000 keyframes) could be completed within 2.5 hours using our open source. More details can be found in [6].

Pre-processing 75 minutes
ND Retrieval 59 minutes
Partial ND localization 8 minutes
Galaxy Visualization 55 seconds

Table 1: The running time for processing 31.2 hours of videos.

Figure 4: Examples of visual snippets mined from a collection of 220 videos. For ease of visualization, each cluster is tagged with a timeline description from Wikipedia using the techniques developed in [3].


The open source software described in this article was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 119610).



[2] W. Zhang, L. Pang and C. W. Ngo. Snap-and-Ask: Answering Multimodal Question by Naming Visual Instance. ACM Multimedia, Nara, Japan, October 2012. Demo

[3] S. Tan, C. W. Ngo, H. K. Tan and L. Pang. Cross Media Hyperlinking for Search Topic Browsing. ACM Multimedia, Arizona, USA, November 2011. Demo

[4] H. K. Tan, C. W. Ngo, R. Hong and T. S. Chua. Scalable Detection of Partial Near-Duplicate Videos by Visual-Temporal Consistency. In ACM Multimedia, pages 145-154, 2009.

[5] H. Jegou, M. Douze, and C. Schmid. Improving bag-of-features for large scale image search. IJCV,87(3):192-212, May 2010.

[6] L. Pang, W. Zhang and C. W. Ngo. Video Hyperlinking: Libraries and Tools for Threading and Visualizing a Large Video Collection. ACM Multimedia, Nara, Japan, Oct 2012.

Bookmark the permalink.