How to create a word cloud from text files with R

Here is the result of the creation of word cloud applied to 100 scientific papers from ICDAR 2013 conference :wordcloud

Text mining is a useful tool for making an overview of subjects or important words in a text collection such as website, books, articles, etc.

Creating a word cloud from text files with R is easy. The first thing to do is to install R and two packages : “tm” and “wordcloud” (maybe these package will need others packages, you just have to follow R instructions). Then,  put all the text files you want to analyze in the same directory, and write the following code in R :

 

# Loading libraries
library(tm)
library(wordcloud)

# Define the folder where the text files are
a <-Corpus(DirSource("C:/MyPath/FolderContaining/TxtFiles"), readerControl = list(language="lat"))

# Preprocessing text
a <- tm_map(a, removeNumbers) # Not necessary if numbers are important for you
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
# Stopwords are words such as "we" "the" "and" "so", etc. You can add your own words to the list
a <- tm_map(a, removeWords, c(stopwords("english"), "can", "also", "may"))
# a <- tm_map(a, stemDocument, language = "english") # You can also do steamming if you want

# Computing the term document matrix
tdm <- TermDocumentMatrix(a)

# Transforming data for wordcloud
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
myNames <- names(v)
d <- data.frame(word=myNames, freq=v)

# Making and displaying the cloud
wordcloud(d$word, d$freq, min.freq=150)

Image Quality Assessment for predicting OCR

The aim of OCR (Optical Character recognition) is to read and transcript the text present in an image.

The most famous commercial OCR software is certainly FineReader from ABBYY company. Omnipage, from Nuance company is also quite famous. For open source OCR, Tesseract (Google) works quite well.

OCR systems works quite well on relatively “new”, black & white, 300dpi document images. Rotation, size and fonts of characters, blur, stains, noise, etc. will decrease the quality of OCR. Most of the time, many preprocessing are applied on images before submitting them to the OCR : deskewing, despeckling, segmentation, etc.

Some researchers are working on predicting OCR by using IQA (Image Quality Assessment). 3 kinds of methods exist : full image reference [1],   reduced reference [2] and without reference [3].

[1] CAPODIFERRO, Licia, JACOVITTI, Giovanni, et DI CLAUDIO, Elio D. Two-Dimensional Approach to Full-Reference Image Quality Assessment Based on Positional Structural Information. Image Processing, IEEE Transactions on, 2012, vol. 21, no 2, p. 505-516.

[2] REHMAN, Abdul et WANG, Zhou. Reduced-reference image quality assessment by structural similarity estimation. Image Processing, IEEE Transactions on, 2012, vol. 21, no 8, p. 3378-3389.

[3] CIANCIO, Alexandre, DA COSTA, ALN Targino, DA SILVA, Eduardo AB, et al. No-reference blur assessment of digital pictures based on multifeature classifiers. Image Processing, IEEE Transactions on, 2011, vol. 20, no 1, p. 64-75.

Document image recognition

Document image matching

The problem of the recognition of document image can be complex because it requires to be robust in translation, rotation and zoom. It may also happen that the documents are degraded (noise, spots, cuts, etc.).

Techniques based on using interest points such as SIFT and SURF are commonly used in natural images (pictures). I worked on an extension of this method to quickly recognize patterns document given by a user, such as an identity card, a passport, train ticket, etc..

The method is simple and extensible to many other image document, it is divided into four main steps:

  1. Extraction of interest points. (SURF)
  2. Description of points. (SURF)
  3. Matching the current image points with those of the query image. (FLANN)
  4. Estimation of a 4-parameter transformation. (RANSAC)

Technological choices in brackets will be changed in the future by new more efficient algorithms and more suitable to the context.

The details of the technique can be found in the publication in 2012 CIFED: Recognition and Extraction of identity documents (in French).

 

Document image deskew

A useful preprocessing for document image analysis is to detect the orientation of the document and then to deskew it.

Straight document image and skewed document image (1)

To do this, several methods exist. But you should be aware that most of techniques will be effective on documents containing text and can be disrupted if photos or lines are present on the document. You can simply remove the key components related or select the components likely to be text.

The two most simple and most commonly used are: horizontal projection profile and line detection with Hough. They are applied to a binarized (black and white) document.

Horizontal projection profile.

The method consists in calculating, for each horizontal line of pixels, the number of black pixel. This is an histogram.
Then the image is rotated by an angle and a new histogram is computed again.
The histogram with the longest peaks is the histogram corresponding to an horizontal sheet. We can then deduce the rotation angle.
Of course if many different angles have to been tested, the method will take more time.

Profil de projection (1)

Hough

Hough can be used with the center of connected components, or pixels. Usually, all the image pixels are not used, but only the black pixels that have a white pixel below them, the goal is to use the footer row of characters. For more details on Hough we can refer to this article.

Other techniques

Boris Epshtein [2] from Google have published a paper to the ICDAR  conference in 2011. It is based on using interline space.

Bibliography

[1] Document image skew detection: Survey and annotated bibliography, Hull J.J., SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE, volume 29, pages 40–66, 1998, WORLD SCIENTIFIC PUBLISHING.

[2] Determining Document Skew Using Inter-Line Spaces, Epshtein, B., Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 27–31, 2011, IEEE.

ICDAR 2011

The 11th international conference on analysis and document recognition took place from the 19th to the 21th September 2011 in Beijing. The program can be found here.

I presented a poster on my work on the document image classification in an industrial context where thousands of documents are scanned each day. My paper presents a new method for fast indexing of document images. One of the difficulties is that the number of classes and the nature of the documents is completely unknown. Many descriptors are extracted such as the number of words, the number of images, the number of tables, statistics on the height and width of connected components and their bounding boxes, the values ​​of local densities components, etc.. Then the number of classes is estimated and a clustering is created, based on the number of class. We provide an “assisted” classification tool based on the CBIR technique and relevance feedback.