Authors:Matthijs Douze and Hervé Jégou

Affiliation:INRIA LEAR and Texmex

URL:http://yael.gforge.inria.fr

## Introduction

Yael is a library implementing computationally intensive functions used in large scale image retrieval, such as neighbor search, clustering and inverted files. The library offers interfaces for C, Python and Matlab.

The motivation of Yael is twofold. We aim at providing:

- core and optimized instructions and methods commonly used for large-scale multimedia retrieval systems
- more sophisticated functions associated with state-of-the-art methods, such as the Fisher vector, VLAD, Hamming Embedding or more generally methods based on inverted file systems, such as selective match kernels.

Yael is intended as an API and does not implement a retrieval system in an integrated manner: only a few test programs are available for key tasks such as k-means. Yet this can be done on top of it with a few dozen lines of Matlab or Python code.

Yael started as an open-source spin-off of INRIA LEAR‘s proprietary library Bigimbaz. The objective was to isolate performance-critical primitives that could be re-used in other projects. Yael’s design choices were: implemented in C for simplicity, but using an object-oriented design (structs with constructors/destructors), interface with Python as high-level language to facilitate administrative tasks.

Yael is designed to handle dense data in float, as it is primarily used for signal processing tasks where the quality of the representation is determined by the number of dimensions rather than the precision of the components. In the Matlab interface, single matrices, and float32 in Python. Yael was designed initially to manipulate matrices in C. It was interfaced for Python using SWIG, which gives low-level access to the full library. An additional Numpy layer (ynumpy) is provided for high-level functions. The most important functions of Yael are wrapped in Mex to be callable from Matlab.

Performance is very important. Yael has computed k-means with hundreds of thousand centroids and routinely manipulate matrices that occupy more than 1/2 the machine’s RAM. This means that it has to be lightweight and 64-bit clean. The design choices of Yael are governed by efficiency concerns more than by portability. As a result, the library may work only with severely down-graded performance if instructions are not provided by the processor. In particular, Yael relies on SSE instructions such as the SSE 4.2 popcnt instruction. The library is maintained for Linux and MacOS. Yael relies on as few external libraries as possible. The only mandatory ones are BLAS/Lapack (for performance). Other libraries (Python’s C interface, Matlab’s mex, Arpack, OpenMP) are optional.

Yael and related packages are downloaded around 600 times per month.

This article addresses the recognition of images of the same scene or object, and how Yael can perform this kind of operation. Here is an example of two images of the same scene that we would like to match:

We will explain how to compute descriptors (aka signatures) for the images, and how to find descriptors that are similar between images.

We are going to work on the 100 first query images of the Holidays dataset, and their associated database examples. The images and associated SIFT descriptors can be downloaded from here: Images and SIFT descriptors.

## Image indexing

Imagine a user that has a large image collection with photos of buildings, with as associated metadata the GPS location of the building. Given a new photo of a building, taken with a mobile phone, the user wants to find the location where the photo was taken. This is where image indexing comes into play.

Image indexing means constructing an index referencing the images from a collection. This index has a search function that can be used to retrieve the images that are most similar to a query image.

At build time and search time, the index is stored in RAM. This is orders of magnitude faster than disk-based implementations, such as those used in SQL database engines. However, for large datasets, this requires either a lot of RAM or a very compact representation per image. Yael provides this compact representation, so that you do not need to buy the RAM.

In combination with efficient matrix manipulation environments like Matlab and Numpy, Yael makes the process of building an index and searching in it very simple.

## Extracting image descriptors

Local image descriptors are vectors computed each on an area of the image. The areas are selected to contain strong contrast changes, with a 2D signal processing filter. Then the descriptor vector is computed from the gradient or frequency content in the area.

Local descriptors are typically designed to be invariant to some classes of transformations: translations, illumination changes, rotations, etc. At the same time, they should be discriminant enough to distinguish relevant differences on the patches, eg. different patterns on the facade of a building. There is a long line of research on designing local image features with appropriate tradeoffs in terms of invariance / discriminance / computational cost, see for example this comparison of affine covariant features.

In the images above, local descriptors extracted on the skyline ought to be very similar. Therefore, these images should be easy to match.

Local descriptors can be extracted using any local description algorithm, as long as they can be compared with L2 distances, ie. descriptors that are far away in L2 space are also considered different in image content. For example, OpenCV provides an implementation of the SURF descriptor, and VLFeat contains a SIFT implementation.

For this example, we will use the SIFT implementation provided along with the Holidays dataset. In the “Descriptor extraction” section of http://lear.inrialpes.fr/~jegou/data.php, download the executable (there is a Mac OS X version and a Linux version).

The pre-processing applied to images before analyzing them to extract signatures can have a dramatic effect on the retrieval performance. Ideally, images should be equalized so that their luminance is similar and resized into dimensions that are not too different. This can be performed in a number of ways, eg. with Imagemagick. In our case, we’ll just use a few command-line utilities from netpbm.

In total, the steps that extract the descriptors from a single image are:

infile=xxxx.jpg tmpfile=${infile/jpg/pgm} outfile=${infile/jpg/siftgeo} # Rescaling and intensity normalization djpeg $infile | ppmtopgm | pnmnorm -bpercent=0.01 -wpercent=0.01 -maxexpand=400 | pamscale -pixels $[1024*768] > $tmpfile # Compute descriptors compute_descriptors -i $tmpfile -o4 $outfile -hesaff -sift

This should be applied to all the images that are to be indexed, and the ones that will be queried.

The remainder of this article presents the main functions used in Yael to do image retrieval. They are implemented in the two languages supported by Yael: Python and Matlab.

## Image indexing in Python with Fisher vectors

A global image descriptor is a vector that characterizes the whole image. The Euclidean distance between the descriptors of two images should be higher for different images than for similar images. There are many popular types of global descriptors, like color histograms or GIST descriptors.

Here, we use a statistical tool derived from the Fisher kernel to aggregate the local SIFT descriptors of an image into a global image descriptor: the Fisher vector (FV). See Aggregating local image descriptors into compact codes for more details. You may also be interested in INRIA’s Fisher vector implementation which is a Matlab version of this example, on the complete Holidays dataset.

The most important functions of Yael are available in Python via the ynumpy module. They all manipulate c-compact float32 or int32 matrices.

The FV computation relies on a training where a Gaussian Mixture Model (GMM) is fitted to a set of representative local descriptors. For simplicity, we are going to use the descriptors of the database we index. To load the database descriptors, use the ynumpy.siftgeo_read function:

for imname in image_names: desc, meta = ynumpy.siftgeo_read(imname) image_descs.append(desc)

The meta component contains the SIFT descriptor’s meta-information (location and size of the area, orientation, etc.). We do not use this information to compute the FV.

Next we sample the descriptors to reduce their dimensionality by PCA and computing a GMM. This involves some standard numpy code, and the ynumpy.gmm_learn function. For a GMM of size k (let’s set it to 64), we need about 1000*k training descriptors

k = 64 n_sample = k * 1000 # choose n_sample descriptors at random sample_indices = np.random.choice(all_desc.shape[0], n_sample) sample = all_desc[sample_indices] # train GMM gmm = ynumpy.gmm_learn(sample, k)

The GMM is a tuple containing the a-priori weights per mixture component, the mixture centres and the diagonal of the component covariance matrices (the model assumes a diagonal matrix, otherwise the descriptor would be way too long).

The training is finished. The next stage is to encode the SIFTs into one vector per image:

image_fvs = [] for image_desc in image_descs: # compute the Fisher vector, using only the derivative w.r.t mu fv = ynumpy.fisher(gmm, image_desc, include = 'mu') image_fvs.append(fv)

All the database descriptors are stacked as lines of a single matrix image_fvs, and all queries image descriptors in another matrix query_fvs. Then the Euclidean nearest neighbors of each query (and hence the most similar images) can be retrieved with:

# get the 8 NNs for all query images in the image_fvs array results, distances = ynumpy.knn(query_fvs, image_fvs, nnn = 8)

Now we display the search results for a few query images. There is one line per query image, which shows the image, and a row of retrieval results. The correct results have a green rectangle around them, negative ones a red rectangle.

Note that the query image always appears as the first retrieval result, because it is included in the dataset.

Image indexing based on global descriptors like the Fisher Vector is very efficient and easy to implement using Yael. For larger datasets (more than a few tens of thousand images), it is useful to use vector quantization or hashing techniques to perform the nearest-neighbor search faster.

## Image indexing in Matlab with inverted files

In this chapter, we directly index all the local SIFT descriptors of the database images into an indexing structure in RAM called the inverted file. Each SIFT descriptor is assigned an index in [1,k] using a quantization function. The inverted file contains k lists, one per possible index. When a SIFT from an image is assigned to an index 1 ≤ i ≤ k, the id of this image is added to the list i.

In the example below, we show how to use an inverted file of Yael from Matlab. More specifically, the inverted file we consider supports binary signatures, as proposed in the Hamming Embedding approach described in this paper.

Before launching the code, please ensure that

- You have a working and compiled version of Yael’s matlab interface
- The corresponding directory (‘YAELDIR/matlab’) is in your matlab Path. If not, use the addpath(‘YAELDIR/matlab’) to add it.

To start with, we define the parameters of the indexing method. Here, we choose a vocabulary of size k=1024. We also set some parameters specific to Hamming embedding.

k = 1024; % Vocabulary size dir_data = './holidays_100/'; % data directory % Parameters For Hamming Embedding nbits = 128; % Typical values are 32, 64 or 128 bits ht = floor(nbits*24/64); % Hamming Embedding threshold

Hereafter, we show how we typically load a set of images and descriptors stored in separate files. We use the standard matlab functions `arrayfun` and `cellfun` to perform operations in batch. The descriptors are assumed stored in the siftgeo format, therefore we read them with the yael ‘siftgeo_read’ function.

sifts = cell(); for i = 1:numel(img_list) [sifts_i, meta] = siftgeo_read(img_list{i}); sifts{i} = sifts_i; end

Now, we are going to learn the visual vocabulary with k-means and subsequently construct the inverted file structure for Hamming Embedding. We learn it on Holidays itself to avoid requiring another dataset. But note that this should be avoided for a true system, and a proper evaluation should employ an external dataset for dictionary learning.

vtrain = [sifts{:}]; vtrain = vtrain (:, 1:2:end); tic C = yael_kmeans (vtrain, k, 'niter', 10); % We provide the codebook and the function that performs the assignment, % here it is the exact nearest neighbor function yael_nn ivfhe = yael_ivf_he (k, nbits, vtrain, @yael_nn, C);

We can add the descriptors of all the database images to the inverted file. Here, Each local descriptor receives an identifier. This is not a requirement: another possible choice would be to use directly the id of the image. But in this case we could not use this output for spatial verification. In our case, the descriptor id will be used to display the matches.

descid_to_imgid = zeros (totsifts, 1); % desc to image conversion imgid_to_descid = zeros (nimg, 1); % for finding desc id lastid = 0; for i = 1:nimg ndes = nsifts(i); % number of descriptors % Add the descriptors to the inverted file. % The function returns the visual words (and binary signatures), [vw,bits] = ivfhe.add (ivfhe, lastid+(1:ndes), sifts{i}); imnorms(i) = norm(hist(vw,1:k)); descid_to_imgid(lastid+(1:ndes)) = i; imgid_to_descid(i) = lastid; lastid = lastid + ndes; end

Finally, we make some queries. We compute the number of matches `n_immatches` between query and database images. We invoke the standard Matlab function `accumarray`, which in essence compute here a histogram weighted by the match weights.

Queries = [1 13 23 42 63 83]; for q = 1:numel(Queries) qimg = Queries(q) matches = ivfhe.query (ivfhe, int32(1:nsifts(qimg)), sifts{qimg}, ht); % Translate to image identifiers and count number of matches per image, m_imids = descid_to_imgid(matches(2,:)); n_immatches = hist (m_imids, 1:nimg); % Images are ordered by descreasing score [~, idx] = sort (n_immatches, 'descend'); % Display results ... end

The output looks as follows. The query is the top-left image, and then the queries are displayed. The title gives the number of matches and the normalized score used to rank the images. The matches are displayed in yellow (and the non-matching descriptors in red).

## Conclusion

Yael is a small library that contains many primitives that are useful for image indexing, nearest-neighbor search, sorting, etc. It at the base of several state-of-the-art implementations of image indexing packages. Reference [1] describes the implementation tradeoffs of some of Yael’s main functions, and provides more references to research papers whose results were obtained with Yael.

In the code above, only the main function calls were shown, see the Yael tutorial for a fully functional version of the code, and the main Yael website for the complete documentation.