How to create a word cloud from text files with R

Here is the result of the creation of word cloud applied to 100 scientific papers from ICDAR 2013 conference :wordcloud

Text mining is a useful tool for making an overview of subjects or important words in a text collection such as website, books, articles, etc.

Creating a word cloud from text files with R is easy. The first thing to do is to install R and two packages : “tm” and “wordcloud” (maybe these package will need others packages, you just have to follow R instructions). Then,  put all the text files you want to analyze in the same directory, and write the following code in R :

 

# Loading libraries
library(tm)
library(wordcloud)

# Define the folder where the text files are
a <-Corpus(DirSource("C:/MyPath/FolderContaining/TxtFiles"), readerControl = list(language="lat"))

# Preprocessing text
a <- tm_map(a, removeNumbers) # Not necessary if numbers are important for you
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
# Stopwords are words such as "we" "the" "and" "so", etc. You can add your own words to the list
a <- tm_map(a, removeWords, c(stopwords("english"), "can", "also", "may"))
# a <- tm_map(a, stemDocument, language = "english") # You can also do steamming if you want

# Computing the term document matrix
tdm <- TermDocumentMatrix(a)

# Transforming data for wordcloud
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
myNames <- names(v)
d <- data.frame(word=myNames, freq=v)

# Making and displaying the cloud
wordcloud(d$word, d$freq, min.freq=150)

Leave a Reply

Your email address will not be published. Required fields are marked *


+ 3 = twelve

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>