Sentiment Analysis in Python using NLTK

Analyzing troves of reviews many organizations use for outreach and measurement but are they capturing the overall tone of a review? When I’m referencing the concept of tone I’m trying to extract from text on positive or negative reception. This is where Sentiment Analysis can be used to determine what is the reception tone whether its reviews on a latest product of feedback from surveys. Sentiment analysis refers to the process of analyzing large volumes of text to determine whether it expresses a positive, neutral or negative sentiment. This has become largely commoditized as tools are created to quickly get started on running a check on sentiment and in this blog I’m exploring the concept using Natural Language Tool Kit from python. This is adapted from a kaggle.com notebook however has some modifications to the original workflow.

Use of Kaggle Notebooks

Getting started will need the following items

Kaggle.com (account otherwise) download the dataset and notebook and run in Colab by Google or locally via Jupyter notebook (link)
Python

If you don’t have a account via Kaggle you can download the dataset that mirrors the code.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

import nltk

We take the imports we’ll need in the first line of code assuming we are running in a form of jupyter notebook to parse and stop code as it executes then we read our data. I’ll make a describe statement but roughly it should mirror this below.

# Read in data
df = pd.read_csv('../input/amazon-fine-food-reviews/Reviews.csv')
print(df.shape)
df = df.head(500)
print(df.shape)
df.describe() # Added to give us a preview

We can see our columns as helpfunessnumerator, helpfulnessdenominator, score and time.

The use of the next block should give us a better idea of the dataset with the following.

df.head()

Now based on the information we have we can start exploring this with Exploratory Analysis we know we have a “Score” column here is it graphed in a line chart.

ax = df['Score'].value_counts().sort_index() \
    .plot(kind='line',
          title='Count of Reviews by Stars',
          figsize=(10, 5))
ax.set_xlabel('Review by Stars')
plt.show()

We can also have this model by a bar chart to show the maximum a bit more readable.

Now to see what text we are working with we know in our Dataframe has a column [“Text”] lets observe that and make a random selection.

text = df["Text"]
print(text)

So let’s go with our number 497 since I see the text “Kettle chips” I’ll opt for something a little out there.

text = df["Text"][497]
print(text)

Now that we have our text we can run the following items against our NLTK import.

# First we tokenize our text
tokens = nltk.word_tokenize(text)
tokens[:10]

# Second we use pos_tag against our tokens to filter adjectives/adverbs that can carry sentiment
tagged = nltk.pos_tag(tokens)
tagged[:10]

# Third we run chunk.ne_chunk to build a named entity apart of the chunk
entities = nltk.chunk.ne_chunk(tagged)
entities.print()

As you can see this outputs Person as “Kettle” and Organization as “Chips”.

Use of VADER Sentiment Scoring

The use of NLTK’s SentinmentIntensityAnalyzer is to filer out the use of neg/neu/pos scores of the text.

VADER Stands for (Valence Aware Dictionary and Sentiment Reasoner) is a sentiment analysis tool designed to understand social media text and informal language this can be comments, reviews, tweets. This works by analyzing the polarity of words and a assignment of score this can be denoted as the following -1 to +1.

Compound score > 0.05: Positive sentiment
Compound score < – 0.05: Negative Sentiment
Compound score between -0.05 and 0.05 Neutral Sentiment

If you are running this in a jupyter notebook and need to import you can run the following to use the NLTK library.

%pip install nltk

As a example of code we are going to run we will run a few examples and show you the outputs of polarity scores.

from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm

sia = SentimentIntensityAnalyzer()

sia.polarity_scores("This is the most lackluster customer support ever.")

Now we can pass in our Kettle Chips review inside our Sentiment Intensity Analysis. As we can see we’re returning more of Neutral being the higher value of the sentiment polarity score.

sia.polarity_scores(text)

Now we can run the polarity score across our data set we reference tdqm to load the dataframe.

res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = row['Text']
    myid = row['Id']
    res[myid] = sia.polarity_scores(text)

Then we run the following code.

west = pd.DataFrame(res).T
west = west.reset_index().rename(columns={'index': 'Id'})
west = west.merge(df, how='left')
west.head()

We can plot our west I’ve stored this as “west”.

ax = sns.barplot(data=west, x='Score', y='compound')
ax.set_title('Compound Score by Amazon Star Review')
plt.show()

Now we can map this further as the data set will have a ‘positive’, ‘neutral’ and ‘negative’.

figure, axs = plt.subplots(1, 3, figsize=(12, 3))
sns.barplot(data=west, x='Score', y='pos', ax=axs[0])
sns.barplot(data=west, x='Score', y='neu', ax=axs[1])
sns.barplot(data=west, x='Score', y='neg', ax=axs[2])
axs[0].set_title=('Positive')
axs[1].set_title=('Neutral')
axs[2].set_title=('Negative')
plt.tight_layout()
plt.show()

Where to go from here?

From the small dataset we’ve analyzed this introduces the concept of how Sentiment Analysis works but also the measurements to work with text and understand how items are classified or grouped. This is a small deviation from the kaggle notebook with the same data set and just some minimal changes. Credit to the original code written by Rob Mulla the deviations and changes of nomenclatures were used to describe the concepts further and also share this knowledge widely. We can start exploring further with our own datasets such as product reviews to identify if our product or “fictional” product is well received or not, we could also add a word cloud to identify trends that were either clustered or common.

Sources

https://www.geeksforgeeks.org/python/python-sentiment-analysis-using-vader [Source 1]

https://www.kaggle.com/code/robikscube/sentiment-analysis-python-youtube-tutorial [Source 2]