Spam Filtering Using Naive Bayes Classifier

Overview

Abstract
Structure of Spam Filter
Introduction
Example text for spam
Bayes Theorem
Types of Naive Bayes Algorithm
Steps to Classify the emails
Conclusion

Abstract

To predict the emails which are spam or not by using naive Bayes algorithm. It is one of the techniques in text classification. It is a small example which is used for machine learning beginners and spam filters. Anyone having an e-mail address must have faced unwanted emails which we call spam mail. Modern spam filtering software is continuously struggling to detect unwanted emails and mark them as spam mail. Here, we are going to discuss one algorithm that is Naive Bayes Classifier, which is the supervised machine learning algorithm.

Structure of the Spam Filter

Introduction

The Naive Bayes Classifier is one of the supervised machine learning algorithms. It is the most popular tool for spam filtering. It is based on Bayes Theorem.

Example text for spam

The example text for having spam emails are,

1. Have a pleasurable stay! Get up to 30% off + Flat 20% Cashback on Oyo Room bookings done via Paytm.

2. Let’s Talk Fashion! Get flat 40% Cashback on Backpacks, Watches, Perfumes, Sunglasses & more.

We can classify the spam emails by some keywords like offer, cashback, and gifts. The Bayesian filters have learned the spam probability for the words “Viagra” and “refinance”

Bayes Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. It is a mathematical formula for determining conditional probability. .

Bayes theorem starts mathematically as the following equation:

P(A/B) is the posterior probability.

P(B/A) is the likelihood.

P(A) is the class prior to a probability.

P(B) is the predictor prior probability.

Types of Naive Bayes Algorithm

There are many types in Naive Bayes Classifier. They are:

  1. Multivariate Bernoulli Naive Bayes
  2. Multinomial Naive Bayes
  3. Gaussian Naive Bayes

Multivariate Bernoulli Naive Bayes

The multivariate Bernoulli NB can be modified for real-valued attributes, by assuming that each attribute follows a normal distribution g(xi; µi,c, σi,c).

Multinomial Naive Bayes

Multinomial NB, TF attributes: the multinomial NB with TF attributes treats each message d as a bag of tokens, containing each one of them as many times as it occurs in d. Hence, d can be represented by a vector ~x = hx1, . . . , xmi, where each xi is now the number of occurrences of ti in d. Furthermore, each message d of category c is seen as the result of picking independently |d| tokens from F with replacement, with probability p(ti | c).

Flexible Bayes

Instead of using a single normal distribution for each attribute per category, fb models p(xi | c) as the average of Li,c normal distributions with different mean values.

Steps to classify the emails

Preprocessing

We are going to make use of NLTK for processing the messages, Word-Cloud and matplotlib for visualization and pandas for loading data, NumPy for generating random probabilities for a train-test split.

Example coding for preprocessing

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from nltk.stem import porterstemmer

import matplotlib.pyplot as plt

from wordcloud import wordcloud

from math import log, sqrt

import pandas as pd

import numpy as np

% matplotlib inline

Here, This coding is used Porter algorithm for stemming.

Tokenizing

Then we tokenize each message in the dataset. Tokenization is the task of splitting up a message into pieces and throwing away the punctuation characters.

Training

Analyze each email into its component words.

Create a probability for each word.

P[w] = Cspam(W)

Store spamminess values to a database.

Example coding for training the classifiers

import os

import numpy as np

Join our weekly newsletter to receive:

  1. Latest articles & interviews
  2. AI events: updates, free passes and discount codes
  3. Opportunities to join AI Time Journal initiatives

from collections import Counter

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

from sklearn.svm import SVC, NuSVC, LinearSVC

# Create a dictionary of words with its frequency

train_dir = ‘train-emails’

dictionary = make_Dictionary(train_dir)

# Prepare feature vectors per training email and its labels

train_labels = np.zeros(702)

train_labels[351:701] = 1

train_matrix = extract_features(train_dir)

# Training SVM and Naive Bayes classifier

model1 = MultinomialNB()

model2 = LinearSVC()

model1.fit(train_matrix,train_labels)

model2.fit(train_matrix,train_labels)

# Test the unseen emails for Spam

test_dir = ‘test-emails’

test_matrix = extract_features(test_dir)

test_labels = np.zeros(260)

test_labels[130:260] = 1

result1 = model1.predict(test_matrix)

result2 = model2.predict(test_matrix)

print confusion_matrix(test_labels,result1)

print confusion_matrix(test_labels,result2)

Filtering

For each message M

Calculate the overall message filtering indication by:

I[M] = f(s[M])

F is a filtering dependent function:

if  I[M]>threshold

the message is marked as spam

else

the message is marked as non-spam

Conclusion

         Before predicting spam emails, the preprocessing and tokenizing techniques are to be done. Then we can easily split up the messages.  And we can easily find out the emails which are spam or ham.

Opinions expressed by contributors are their own.

About Monisha M

Editorial Staff Intern Pandian Saraswathi Yadav Engineering College, interested in Data Analytics and Machine Learning.

View all posts by Monisha M →