Introduction
Spam is a fact of life on the internet. If you enable comments or contact sections on your website, you will have to deal with spammers. To prevent your site from making a poor first impression, you’ll need to find a way to stop spam in its tracks. This is especially important if you are developing a website without a content management system like Wordpress as they come bundled with spam filtering plugins. You could also use an API like Akismet, however this comes at a cost, which can be avoided by implementing a relatively accurate model of your own.
Kaggle and other data science bootcamps are great for learning how to build and optimize models, but they don’t teach you how to actually use this models in real world scenarios, where there’s a major difference between building a model, and deploying it to be used by end users on the internet.
In this tutorial, you’re going to build an SMS spam detection web application. This application will be built in Python using the Django framework, and will include a deep learning model that you will train to detect SMS spam by leveraging the Naive Bayes theorem.
Naive Bayes classification
The classification of Naive Bayes is a simple probability algorithm based on the fact that all model characteristics are independent. We assume that every word in the message is independent of all other words in the context of the spam filters, and we count them with the ignorance of the context.
By the state of the current set of terms, our classification algorithm generates probabilities of the message to be spam or not spam. The probability estimation is based on the Bayes formula, and the formula components are determined on the basis of the word frequencies in the whole message package.
Model Building
The data is a collection of SMS messages tagged as spam or ham that can be found here. First, we will use this dataset to build a prediction model that will accurately classify which texts are spam and which are not, then save the model to be used later for predictions.
Exploration of dataset
The first thing that should be done is to import dependencies. If you do not have the libraries installed, kindly do so before proceeding.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import joblib
import pickle
Next, we load the dataset using pandas:
df = pd.read_csv('https://raw.githubusercontent.com/paulwababu/datasets/main/spam.csv', encoding = 'latin-1')
print(df.head())
Drop the unwanted columns, like so:
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
We have to convert the non-numerical column 'spam' and 'ham' into numerical values using pandas map() function
df['label'] = df['v1'].map({'ham': 0, 'spam': 1})
Then we have to separate the feature columns(independent variables) from the target column(dependent variable).
The feature columns are the columns that we try to predict from, and the target column is the column with the values we try to predict.
X = df['v2']
y = df['label']
ML Model Building
Let us now proceed to building our actual model.
cv = CountVectorizer()
X = cv.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = MultinomialNB()
model.fit(X_train,y_train)
#model.score(X_test,y_test)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.99 0.99 0.99 1587
1 0.93 0.92 0.92 252
accuracy 0.98 1839
macro avg 0.96 0.95 0.96 1839
weighted avg 0.98 0.98 0.98 1839
Not only Naive Bayes classifier easy to implement but also provides very good result.
In the code above, we create a vectorize function that transforms a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. We then proceed to splitting the data into train and test variables which we use to get the classification report of the model. We then call the multinomial Naive Bayes model which is suitable for classification with discrete features (e.g., word counts for text classification)
Model and Vectorizer Persistence.
After training the model, we should to have a way to persist the model for future use without having to retrain. To achieve this, need to save the model for the later use. Add the following lines of code:
# Save the model
joblib_file = "MultinomialNaiveBayesModel.joblib"
joblib.dump(model, joblib_file)
We also need to save the vectorize function that we created earlier otherwise you throw it away because a vectorizer once created, doesn't exist past the lifetime of your vectorize function.
# Save the vectorizer
vec_file = 'MultinomialNaiveBayesModelVectorizer.pickle'
pickle.dump(cv, open(vec_file, 'wb'))
If we intend to retrain the model, we can use the partial_fit function in order to keep improving the model incase of model degradation over time. I will post a blog later that addresses how to identify and correct dataset shift in machine learning
Turning the Spam Message Classifier into a Django Web Application
Having trained and saved the model for classifying SMS messages in the previous section, we will develop a web application that consists of a simple web page with a form field that lets us enter a message. After submitting the message to the web application, it will render it on a new page which gives us a result of spam or not spam.
Below is snapshot of the final implementation
Following Python best practices, we will create a virtual environment for our project, and install the required packages.
First, create the project directory.
$ mkdir djangoapp
$ cd djangoapp
Now, create a virtual environment and install the required packages.
For macOS and Unix systems:
$ python3 -m venv myenv
$ source myenv/bin/activate
(myenv) $ pip install django requests numpy joblib scikit-learn
For Windows:
$ python3 -m venv myenv
$ myenv\Scripts\activate
(myenv) $ pip install django requests numpy joblib scikit-learn
Setting Up Your Django Application
First, navigate to the directory djangoapp we created and establish a Django project.
(myenv) $ django-admin startproject mainapp
This will auto-generate some files for your project skeleton:
mainapp/
manage.py
mainapp/
__init__.py
settings.py
urls.py
asgi.py
wsgi.py
Now, navigate to the directory you just created (make sure you are in the same directory as manage.py) and create your app directory.
(myenv) $ python manage.py startapp monitor
This will create the following:
monitor/
__init__.py
admin.py
apps.py
migrations/
__init__.py
models.py
tests.py
views.py
On the mainapp/settings.py file, look for the following line and add the app we just created above.
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'monitor', #new line
]
Ensure you are in the monitor directory then create a new directory called templates then a new file called urls.py. Your directory structure of monitor application should look like this
monitor/
__init__.py
admin.py
apps.py
migrations/
templates/
__init__.py
models.py
tests.py
urls.py
views.py
Ensure your mainapp/urls.py file, add our monitor app URL to include the URLs we shall create next on the monitor app:
from django.contrib import admin
from django.urls import path, include
urlpatterns = [
#path('admin/', admin.site.urls),
path('', include('monitor.urls')),#monitor app url
]
Now, on the monitor/urls.py file, add our website there:
from django.urls import path
from .views import *
urlpatterns = [
path('', views.sms, name = 'sms'),
path('inbox/', views.inbox, name='inbox')
]
Let’s create another directory to store our machine learning model. I’ll also add the dataset to the project for those who want to achieve the whole dataset. (It is not compulsory to create a data folder.) Be sure to move the vectorizer file and the joblib file we created earlier to ml/model folder
(venv)$ mkdir ml
(venv)$ mkdir ml/models
(venv)$ mkdir ml/data
We also need to tell Django where our machine learning model and our vectorizer file is located. Add these lines to settings.py file:
import os
MODELS = os.path.join(BASE_DIR, 'ml/models')
Load Model and Vectorizer through apps.py
Load your machine learning models and your vectorizer in apps.py so that when the application starts, the trained model is loaded only once. Otherwise, the trained model is loaded each time an endpoint is called, and then the response time will be slower.
Let’s update apps.py
import os
import joblib
from django.apps import AppConfig
from django.conf import settings
class ApiConfig(AppConfig):
name = 'api'
MODEL_FILE = os.path.join(settings.MODELS, "MultinomialNaiveBayesModel.joblib")
model = joblib.load(MODEL_FILE)
class VectorizerConfig(AppConfig):
name = 'api2'
MODEL_FILE = os.path.join(settings.MODELS, "MultinomialNaiveBayesModelVectorizer.pickle")
model = joblib.load(MODEL_FILE)
Edit models.py
Create our database models which we shall use to store our classified models. On the monitor/models.py file:
from django.db import models
# Create your models here.
class Monitor2(models.Model):
message = models.CharField(max_length=50, blank=True, null=True)
SPAM = 1
HAM = 0
IS_SPAM_OR_NAH = [(SPAM, 'spam'), (HAM, 'not_spam')]
messageClassified = models.IntegerField(choices=IS_SPAM_OR_NAH, null=True)
contact = models.CharField(max_length=50, blank=True, null=True)
Edit views.py
The views will be mainly responsible for two tasks:
- Process incoming POST requests.
- Make a prediction with the incoming data and give the result as a Response.
- Display the classified text into a HTML template.
import os
from datetime import datetime
from .models import *
from django.shortcuts import render, redirect
def sms(request):
if request.method == 'POST':
number = request.POST['contact']
message = request.POST['message']
# datetime object containing current date and time
now = datetime.now()
now = now.strftime("%d/%m/%Y %H:%M:%S")
naiveModel = ApiConfig.model
naiveVect = VectorizerConfig.model
convertString = str(message)
message = convertString
data = [message]
vect = naiveVect.transform(data).toarray()
my_prediction = naiveModel.predict(vect)
print(my_prediction)
saveNow = Monitor2(
message=message,
messageClassified=my_prediction,
contact=number
)
saveNow.save()
return render(request, 'sms.html')
#inbox view
def inbox(request):
dataSaved = Monitor2.objects.all()
data = {
"dataSaved": dataSaved,
}
print(data)
return render(request, 'inbox.html', data)
On the monitor/templates folder, create sms.html and inbox.html web page and add the lines below:
monitor/templates/sms.html file:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
</head>
<body>
<form id="myform" method="POST">
{% csrf_token %}
<div class="row">
<div class="col-6 form-group">
<input type="text" name="name1" class="form-control p-4" placeholder="Your Name" required="required">
</div>
<div class="col-6 form-group">
<input type="text" name="contact" class="form-control p-4" placeholder="Your Contact" required="required">
</div>
</div>
<div class="form-group">
<textarea class="form-control py-3 px-4" name="message" rows="5" placeholder="Message" required="required"></textarea>
</div>
<div>
<button class="btn btn-primary py-3 px-5" type="submit">Send Message</button>
</div>
</form>
</body>
</html>
monitor/templates/inbox.html file:
<!DOCTYPE html>
<html>
<style>
table, th, td {
border:1px solid black;
}
</style>
<body>
<h2>A basic HTML table</h2>
<table style="width:100%">
<tr>
<th>#</th>
<th>From</th>
<th>Body</th>
<th>Classification</th>
</tr>
{% for x in dataSaved %}
<tr>
<td>{{ loop.index }}</td>
<td>{{ x.contact }}</td>
<td>{{ x.message }}</td>
{% if x.messageClassified == 1 %}
<td>Spam</td>
{% else %}
<td>Non Spam</td>
{% endif %}
</tr>
{% endfor %}
</table>
</body>
</html>
Make the necessary migrations like so:
(myenv) $ python manage.py makemigrations
(myenv) $ python manage.py migrate
(myenv) $ python manage.py runserver
Testing if it works!
Head over http://127.0.0.1:8000 and complete the form with both spam and non spam
Proceed to 127.0.0.1:8000/inbox to check out the classified data! Below is a snapshot of my implementation, sorry I couldn't make the CSS;(