Discovering the new Digital world

22 Dec 2016

NLP: Classification using a Naive Bayes classifier

Here is possible to find the application of the Naive Bayes approach to a specific problem: the classification of SMS into spam (“an undesired messages, e.g. advertising”) or ham (“a desired message containing valuable information that is not considered spam”). The supporting code can be found here.

The data used for such playground activity is the SMS Spam Collection v. 1, a public set of SMS messages that have been collected for mobile phone spam research where each message has been properly labeled as spam or ham.

Background information

Q: What is classification?

‘In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into “spam” or “non-spam” classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition.’ wikipedia

Classification is a type of supervised learning problem.

Q: What is supervised learning?

‘Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.’ wikipedia

Q: What is a Naive Bayes classifier?

‘Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.’ wikipedia

To have a better overview and understanding of the theory behind text classification & Naive Bayes classifiers, the material created by Dan Jurafsky & Christopher Manning for the “Natural Language Processing” MOOC at Coursera is a great starting point

The “Speech and Language Processing” book of Dan Jurafsky & James H. Martin

Enjoy the learnings…