Spam messages classification
Text mining (deriving information from text) is a wide field which has gained popularity with the huge text data being generated. Automation of a number of applications like sentiment analysis, document classification, topic classification, text summarization, machine translation, etc has been done using machine learning models.
Spam filtering is a beginner’s example of document classification task which involves classifying an email as spam or non-spam (a.k.a. ham) mail. Spam box in your Gmail account is the best example of this. So lets get started in building a spam filter on a publicly available mail corpus. I have extracted equal number of spam and non-spam emails from Ling-spam corpus. The extracted subset on which we will be working can be downloaded from here.
We will walk through the following steps to build this application :
Preparing the text data. Creating word dictionary. Feature extraction process Training the classifier
cd MyDrive/
msg = pd.read_csv('SMSSpamCollection',sep = '\t',names = ["label",'messages'])
msg
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus= []
for i in range(0,len(msg)):
review = re.sub('[^a-zA-Z]',' ',msg['messages'][i])
review = review.lower()
review = review.split()
#lamatization
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
corpus.append(review)
#creating BOG
from sklearn.feature_extraction.text import CountVectorizer
cv =CountVectorizer(max_features = 2500) # discariding the lower frequece words
X = cv.fit_transform(corpus).toarray()
y = pd.get_dummies(msg['label']) #mapping label to 0 and 1
y = y.iloc[:,1].values
#train test split
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size =.20,random_state = 0)
#train model
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train,Y_train)
Y_pred = spam_detect_model.predict(X_test)
from sklearn.metrics import confusion_matrix
confusion_m = confusion_matrix(Y_test,Y_pred)
confusion_m
from sklearn.metrics import accuracy_score
acc = accuracy_score(Y_test,Y_pred)
print(acc)
We got 98% accuracy with stemming