Using Flickr images for 3D reconstruction

At the Electronic Imaging symposium of 2013, Steve Seitz from Washington University and Google presented a very interesting keynote entitled “a trillion photos”.

The principle is to exploit the millions of images found in databases such as Flickr. The aim of the project Building Rome in a Day is to harvest images from Flickr by simply typing the keywords “Rome” or “Venice”. Many images are unusable because they can not be matched with other images  – such as pictures of a restaurant, of a family, etc. On the other hand, the most touristic places such as “San Marco” are taken in picture from many different angles. By using a standard processing stream such as SIFT + RANSAC + FLANN it is possible to match the images and then to do the 3D reconstruction.

In this video , the pyramids represent the estimated shooting positions. The reconstruction was made by ​​using 14,079 pictures. The reconstruction of Venice is made by ​​using 250,000 images, 496 computing cores. 27h are necessary for matching and 38h for reconstruction.

Understanding how bags of visual words work

From 10 years ago [1], bags of visual words (also called bags of features or bags of keypoints) has been widely used in computer vision community for image classification and recognition.

Computing similarity between two pictures is complicated because there are many pixels in one image. Usually, scientists try to extract features such as color, shape or texture in order to compare images. One difficulty with the standard techniques is to compute features robust to rotation, zoom, illumination, noise and occlusion. Another difficulty is that most of the techniques need to segment the object before describing it.

Interest points (or keypoint) such as SIFT, SURF, etc. solve most of the problem : they are robust and do not need any segmentation, so it is very easy to use them. Extracting interest points for comparing images is a good idea. After extracting points, there are mainly two options : 1) matching the points of one image with the points of another image in order to do stitching, object recognition and localization or 2) make a statistical description of the images by counting the different “kind of” keypoints contained in image, this is the Bags of Visual Words – BoVW – technique. BoVW is used for image classification.

How the bags of visual words works ?

Here is the principle in 4 simple steps :

  1. Extracting the keypoints of images. You can use SURF to do this.
  2. Creating a visual dictionary by clustering all the keypoints. You can use k-means and fix k between 200 and 2 000, for example 1 000.
  3. For one image, you have to check in which cluster is each keypoint. So you will build a histogram with 1 000 bins, where each bin correspond to a cluster. The value of one bin is equal to the number of keypoints of the image that are in the related cluster.
  4. Each image is described by a vector so you can do supervised classification by using SVM.


[1] Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004, May). Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV (Vol. 1, No. 1-22, pp. 1-2).

Handwritten digits recognition based on neural networks

During summer 2012 and summer 2013 I supervised two master internship about handwritten digit recognition at Gestform digitizing company. The first one was about basic notions for reading, in a simple case, a handwritten digit and the second one was about segmentation of digits.

Here is an example of what we want to do :

HandwrittenDigitsTwo things are done here : 1) recognizing which part of the text contain digits or not and then 2) reading the digits. Almost the same tools are used for part 1 and 2.

Which part of the text contain digits ?

First, we try to segment the text line by line. Several algorithms can be apply for doing that, here a simple one is used : RLSA.

RLSAThen each connected component are extracted and analyzed in order to classified it as digit or non-digit.

CCCC2As we can see, before extracting connected components some preprocessing should be done such as morphological mathematics (dilatation and erosion) because some parts of digits are cut.

In order to extract the features homogeneously, some preprocessing are applied before feature extraction in order to normalize the features : correction the slope and the angle of the digit.

Then many features are extracted such as :

  • Hu invariant moments
  • Projection histograms
  • Profile histograms
  • Intersection with horizontal lines
  • Position of holes
  • Extremities
  • Junction points

Profile histogram


Projection histogram


Intersection with horizontal and vertical lines

They are all concatenated into one vector of 124 dimensions. Another vector is build from Freeman chain code (an histogram of 128 dimensions).


Freeman chain code

After extracting features, two neural networks are used in order to classify connected component as digit or non-digit. The first one have 124 inputs and the second one 128, each have 2 output : D (digit) or R (reject or non-digit). Many example have to be used in order to train the classifier (around 10 000 for each class).


Reading the digits

Here, the same features and neural network are used, but instead of 2 classes (digit / non-digit) 10 classes are used (0,1,2,3,4,5,6,7,8,9).

You can download training examples here. Some examples of 6 digits :



Correcting some classification errors

Many digits are touching others and are classified as R. So we introduced a new class. as “DD” (double digit). Furthermore, by using the sequences it is possible to correct some errors. By example, if you are looking for a 5 digit postal code it is possible to change a result such as : RRDDDRDRR as RRDDDDDRR or also filtering noise : RRRRRRDRRRR -> RRRRRRRRRRR. In order to do this HMM is used. Here is a HMM designed for postal code with D (digit) DD (double digits) and R (reject / non-digit) classes :


Double digits segmentation

In order to do double digits segmentation, the “drop fall” algorithm is used.
dropFallThe drop fall algorithm can be seen as if a drop of water is sliding along the digits. 4 drop fall can be done depending if the starting point is set up/left, up/right, down/right or down/left. Then, in order to chose the best segmentation the digits are recognized by the neural network, the couple of digits with the best recognition rate is kept.


Yi-Kai Chen and Jhing-Fa Wang. Segmentation of single-or multiple-touching handwritten numeral string using background and foreground analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(11) :1304–1317, 2000.

Britto de S, Robert Sabourin, Flavio Bortolozzi, Ching Y Suen, et al. A string length predictor to control the level building of hmms for handwritten numeral recognition. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 4, pages 31–34. IEEE, 2002.

RV Kulkarni and PN Vasambekar. An overview of segmentation techniques for handwritten
connected digits. In Signal and Image Processing (ICSIP), 2010 International
Conference on, pages 479–482. IEEE, 2010.

Umapada Pal, Abdel Belaïd, and Christophe Choisy. Water reservoir based approach
for touching numeral segmentation. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 892–896. IEEE, 2001.

Ma Rui, Du Jie, Gu Yunhua, and Yan Yunyang. An improved drop-fall algorithm based
on background analysis for handwritten digits segmentation. In Intelligent Systems,
2009. GCIS’09. WRI Global Congress on, volume 4, pages 374–378. IEEE, 2009.

Javad Sadri, Ching Y Suen, and Tien D Bui. Automatic segmentation of unconstrained
handwritten numeral strings. In Frontiers in Handwriting Recognition, 2004. IWFHR-9
2004. Ninth International Workshop on, pages 317–322. IEEE, 2004.

Jie Zhou and Ching Y Suen. Unconstrained numeral pair recognition using enhanced
error correcting output coding : a holistic approach. In Document Analysis and Recognition, Proceedings. Eighth International Conference on, pages 484–488. IEEE, 2005.

Clément Chatelain, Guillaume Koch, Laurent Heutte, and Thierry Paquet. Une méthode dirigée par la syntaxe pour l’extraction de champs numériques dans les courriers entrants. 2006.

Image Quality Assessment for predicting OCR

The aim of OCR (Optical Character recognition) is to read and transcript the text present in an image.

The most famous commercial OCR software is certainly FineReader from ABBYY company. Omnipage, from Nuance company is also quite famous. For open source OCR, Tesseract (Google) works quite well.

OCR systems works quite well on relatively “new”, black & white, 300dpi document images. Rotation, size and fonts of characters, blur, stains, noise, etc. will decrease the quality of OCR. Most of the time, many preprocessing are applied on images before submitting them to the OCR : deskewing, despeckling, segmentation, etc.

Some researchers are working on predicting OCR by using IQA (Image Quality Assessment). 3 kinds of methods exist : full image reference [1],   reduced reference [2] and without reference [3].

[1] CAPODIFERRO, Licia, JACOVITTI, Giovanni, et DI CLAUDIO, Elio D. Two-Dimensional Approach to Full-Reference Image Quality Assessment Based on Positional Structural Information. Image Processing, IEEE Transactions on, 2012, vol. 21, no 2, p. 505-516.

[2] REHMAN, Abdul et WANG, Zhou. Reduced-reference image quality assessment by structural similarity estimation. Image Processing, IEEE Transactions on, 2012, vol. 21, no 8, p. 3378-3389.

[3] CIANCIO, Alexandre, DA COSTA, ALN Targino, DA SILVA, Eduardo AB, et al. No-reference blur assessment of digital pictures based on multifeature classifiers. Image Processing, IEEE Transactions on, 2011, vol. 20, no 1, p. 64-75.