IMDB Sentiment Analysis

Importing Libraries

In [1]:
# importing the libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import string
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split as tts
In [2]:
# importing the dataset
data = pd.read_csv("IMDB-Dataset.csv")

Understanding the Dataset

In [3]:
# top values of the data-set
data.head()
Out[3]:
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
In [4]:
# shape of the data
data.shape
Out[4]:
(50000, 2)
In [5]:
# column names 
data.columns
Out[5]:
Index(['review', 'sentiment'], dtype='object')
In [6]:
# count of unique values in the column
data['sentiment'].value_counts()
Out[6]:
positive    25000
negative    25000
Name: sentiment, dtype: int64

The dataset is balanced as there are equal number of data points in negative and positive sentiments.

In [7]:
# top 10 elements of the dataset
data.head(10)
Out[7]:
review sentiment
0 One of the other reviewers has mentioned that ... positive
1 A wonderful little production. <br /><br />The... positive
2 I thought this was a wonderful way to spend ti... positive
3 Basically there's a family where a little boy ... negative
4 Petter Mattei's "Love in the Time of Money" is... positive
5 Probably my all-time favorite movie, a story o... positive
6 I sure would like to see a resurrection of a u... positive
7 This show was an amazing, fresh & innovative i... negative
8 Encouraged by the positive comments about this... negative
9 If you like original gut wrenching laughter yo... positive
In [8]:
# data from the bottom
data.tail(5)
Out[8]:
review sentiment
49995 I thought this movie did a down right good job... positive
49996 Bad plot, bad dialogue, bad acting, idiotic di... negative
49997 I am a Catholic taught in parochial elementary... negative
49998 I'm going to have to disagree with the previou... negative
49999 No one expects the Star Trek movies to be high... negative

Processing the data

In [9]:
def clean_text1(text):
    text=text.lower()
    text=re.sub('\[.*?\]','',text)
    text=re.sub('[%s]'%re.escape(string.punctuation),'',text)
    text=re.sub('\w*\d\w*','',text)
    return text

cleaned1=lambda x:clean_text1(x)
In [10]:
data['review']=pd.DataFrame(data.review.apply(cleaned1))
In [11]:
data.head()
Out[11]:
review sentiment
0 one of the other reviewers has mentioned that ... positive
1 a wonderful little production br br the filmin... positive
2 i thought this was a wonderful way to spend ti... positive
3 basically theres a family where a little boy j... negative
4 petter matteis love in the time of money is a ... positive
In [12]:
# second round of cleaning
def clean_text2(text):
    text=re.sub('[''"",,,]','',text)
    text=re.sub('\n','',text)
    return text

cleaned2=lambda x:clean_text2(x)
In [13]:
data['review']=pd.DataFrame(data.review.apply(cleaned2))
data.head()
Out[13]:
review sentiment
0 one of the other reviewers has mentioned that ... positive
1 a wonderful little production br br the filmin... positive
2 i thought this was a wonderful way to spend ti... positive
3 basically theres a family where a little boy j... negative
4 petter matteis love in the time of money is a ... positive

Splitting the Data

In [14]:
x = data.iloc[0:,0].values
y = data.iloc[0:,1].values
In [40]:
xtrain,xtest,ytrain,ytest = tts(x,y,test_size = 0.26,random_state = 5)

Extracting the features

In [41]:
tf = TfidfVectorizer()
from sklearn.pipeline import Pipeline

Building the Model

In [42]:
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression()
model=Pipeline([('vectorizer',tf),('classifier',classifier)])

model.fit(xtrain,ytrain)
Out[42]:
Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', LogisticRegression())])
In [43]:
ypred=model.predict(xtest)
In [44]:
# model score
accuracy_score(ypred,ytest)
Out[44]:
0.8955384615384615
In [45]:
# confusion matrix
A=confusion_matrix(ytest,ypred)
print(A)
[[5629  757]
 [ 601 6013]]
In [46]:
# f1 score
recall=A[0][0]/(A[0][0]+A[1][0])
precision=A[0][0]/(A[0][0]+A[0][1])
F1=2*recall*precision/(recall+precision)
print(F1)
0.8923589093214964
In [ ]: