Understanding how bags of visual words work

From 10 years ago [1], bags of visual words (also called bags of features or bags of keypoints) has been widely used in computer vision community for image classification and recognition.

Computing similarity between two pictures is complicated because there are many pixels in one image. Usually, scientists try to extract features such as color, shape or texture in order to compare images. One difficulty with the standard techniques is to compute features robust to rotation, zoom, illumination, noise and occlusion. Another difficulty is that most of the techniques need to segment the object before describing it.

Interest points (or keypoint) such as SIFT, SURF, etc. solve most of the problem : they are robust and do not need any segmentation, so it is very easy to use them. Extracting interest points for comparing images is a good idea. After extracting points, there are mainly two options : 1) matching the points of one image with the points of another image in order to do stitching, object recognition and localization or 2) make a statistical description of the images by counting the different “kind of” keypoints contained in image, this is the Bags of Visual Words – BoVW – technique. BoVW is used for image classification.

How the bags of visual words works ?

Here is the principle in 4 simple steps :

  1. Extracting the keypoints of images. You can use SURF to do this.
  2. Creating a visual dictionary by clustering all the keypoints. You can use k-means and fix k between 200 and 2 000, for example 1 000.
  3. For one image, you have to check in which cluster is each keypoint. So you will build a histogram with 1 000 bins, where each bin correspond to a cluster. The value of one bin is equal to the number of keypoints of the image that are in the related cluster.
  4. Each image is described by a vector so you can do supervised classification by using SVM.


[1] Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004, May). Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV (Vol. 1, No. 1-22, pp. 1-2).

Leave a Reply

Your email address will not be published. Required fields are marked *

9 − two =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>