IMDB Sentiment Analysis¶

Importing Libraries¶

# importing the libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import string
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split as tts

# importing the dataset
data = pd.read_csv("IMDB-Dataset.csv")

Understanding the Dataset¶

# top values of the data-set
data.head()

# shape of the data
data.shape

(50000, 2)

# column names 
data.columns

Index(['review', 'sentiment'], dtype='object')

# count of unique values in the column
data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

The dataset is balanced as there are equal number of data points in negative and positive sentiments.¶

# top 10 elements of the dataset
data.head(10)

# data from the bottom
data.tail(5)

Processing the data¶

def clean_text1(text):
    text=text.lower()
    text=re.sub('\[.*?\]','',text)
    text=re.sub('[%s]'%re.escape(string.punctuation),'',text)
    text=re.sub('\w*\d\w*','',text)
    return text

cleaned1=lambda x:clean_text1(x)

data['review']=pd.DataFrame(data.review.apply(cleaned1))

data.head()

# second round of cleaning
def clean_text2(text):
    text=re.sub('[''"",,,]','',text)
    text=re.sub('\n','',text)
    return text

cleaned2=lambda x:clean_text2(x)

data['review']=pd.DataFrame(data.review.apply(cleaned2))
data.head()

Splitting the Data¶

x = data.iloc[0:,0].values
y = data.iloc[0:,1].values

xtrain,xtest,ytrain,ytest = tts(x,y,test_size = 0.26,random_state = 5)

Extracting the features¶

tf = TfidfVectorizer()
from sklearn.pipeline import Pipeline

Building the Model¶

from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression()
model=Pipeline([('vectorizer',tf),('classifier',classifier)])

model.fit(xtrain,ytrain)

Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', LogisticRegression())])

ypred=model.predict(xtest)

# model score
accuracy_score(ypred,ytest)

0.8955384615384615

# confusion matrix
A=confusion_matrix(ytest,ypred)
print(A)

[[5629  757]
 [ 601 6013]]

# f1 score
recall=A[0][0]/(A[0][0]+A[1][0])
precision=A[0][0]/(A[0][0]+A[0][1])
F1=2*recall*precision/(recall+precision)
print(F1)

0.8923589093214964

	review	sentiment
49995	I thought this movie did a down right good job...	positive
49996	Bad plot, bad dialogue, bad acting, idiotic di...	negative
49997	I am a Catholic taught in parochial elementary...	negative
49998	I'm going to have to disagree with the previou...	negative
49999	No one expects the Star Trek movies to be high...	negative

	review	sentiment
0	One of the other reviewers has mentioned that ...	positive
1	A wonderful little production. <br /><br />The...	positive
2	I thought this was a wonderful way to spend ti...	positive
3	Basically there's a family where a little boy ...	negative
4	Petter Mattei's "Love in the Time of Money" is...	positive

	review	sentiment
0	one of the other reviewers has mentioned that ...	positive
1	a wonderful little production br br the filmin...	positive
2	i thought this was a wonderful way to spend ti...	positive
3	basically theres a family where a little boy j...	negative
4	petter matteis love in the time of money is a ...	positive