Discovering the new Digital world

03 Dec 2016

NLP: Working with Unstructured Data with the `tm` package

In the last weeks I have spent some time experimenting and playing around with the tm package, some learnings and examples on how to use it can be found here. It is a great package to perform text mining and transform free text into features that can be used for further data analysis.

Background Information

Text Mining is the process of finding useful insights/ information from text and transform it, using NLP (Natural Language Processing) and analytical methods, into data that could be used for further analysis.

There are many packages that could be used for Natural Language Processing but only one package is the cornerstone of NLP in R, the tm package.

'In recent years, we have elaborated a framework to be used in packages dealing with the processing of written material: the package tm. Extension packages in this area are highly recommended to interface with tm's basic routines...' (from CRAN website)

The tm package provides a comprehensive text mining framework for R. More information about it can be found in “Text Minining Infrastructure in R” publication (Journal of Statistical Software) and the “Introduction of the tm Package” vignette.