N-Gram models

python

datacamp

machine learning

nlp

feature engineering

Author

kakamana

Published

March 30, 2023

N-Gram models

Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews

This N-Gram models is part of Datacamp course: Introduction to Natural Language Processing in Python This course teaches techniques for extracting useful information from text and converting it into a format suitable for ML models. You will learn about POS tagging, named entity recognition, readability scores, n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. Additionally, you will learn how to calculate the similarity between two documents. During this process, you will be able to predict the sentiment of movie reviews and build recommenders for movies and Ted Talks. In the course of this course, you will learn how to engineer critical features from any text and solve some of the most challenging problems in data science.

This is my learning experience of data science through DataCamp. These repository contributions are part of my learning journey through my graduate program masters of applied data sciences (MADS) at University Of Michigan, DeepLearning.AI, Coursera & DataCamp. You can find my similar articles & more stories at my medium & LinkedIn profile. I am available at kaggle & github blogs & github repos. Thank you for your motivation, support & valuable feedback.

These include projects, coursework & notebook which I learned through my data science journey. They are created for reproducible & future reference purpose only. All source code, slides or screenshot are intellactual property of respective content authors. If you find these contents beneficial, kindly consider learning subscription from DeepLearning.AI Subscription, Coursera, DataCamp

Code

!pip install spacy

WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages)
WARNING: You are using pip version 21.2.3; however, version 23.0.1 is available.
You should consider upgrading via the 'C:\Users\dghr201\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.

Requirement already satisfied: spacy in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (3.5.1)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (3.0.8)
Requirement already satisfied: typer<0.8.0,>=0.3.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (0.7.0)
Requirement already satisfied: thinc<8.2.0,>=8.1.8 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (8.1.9)
Requirement already satisfied: setuptools in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (57.4.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (3.3.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (2.0.7)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (2.0.8)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (2.28.1)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (6.3.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (1.0.9)
Requirement already satisfied: pathy>=0.10.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (0.10.1)
Requirement already satisfied: numpy>=1.15.0 in c:\users\dghr201\appdata\roaming\python\python39\site-packages (from spacy) (1.23.2)
Requirement already satisfied: packaging>=20.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (21.3)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (1.0.4)
Requirement already satisfied: jinja2 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (3.1.2)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (1.1.1)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (1.10.7)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (3.0.12)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from spacy) (2.4.6)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\users\dghr201\appdata\roaming\python\python39\site-packages (from spacy) (4.64.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from packaging>=20.0->spacy) (3.0.9)
Requirement already satisfied: typing-extensions>=4.2.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy) (4.3.0)
Requirement already satisfied: idna<4,>=2.5 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2022.6.15)
Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.26.12)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from thinc<8.2.0,>=8.1.8->spacy) (0.0.4)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from thinc<8.2.0,>=8.1.8->spacy) (0.7.9)
Requirement already satisfied: colorama in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from tqdm<5.0.0,>=4.38.0->spacy) (0.4.6)
Requirement already satisfied: click<9.0.0,>=7.1.1 in c:\users\dghr201\appdata\roaming\python\python39\site-packages (from typer<0.8.0,>=0.3.0->spacy) (8.1.3)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\dghr201\appdata\local\programs\python\python39\lib\site-packages (from jinja2->spacy) (2.1.1)

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy

plt.rcParams['figure.figsize'] = (8, 8)

Building a bag of words model

Bag of words model
- Extract word tokens
- Compute frequency of word tokens
- Construct a word vector out of these frequencies and vocabulary of corpus

BoW model for movie taglines

In this exercise, you have been provided with a corpus of more than 7000 movie tag lines. Your job is to generate the bag of words representation bow_matrix for these taglines. For this exercise, we will ignore the text preprocessing step and generate bow_matrix directly.

Code

movies = pd.read_csv('dataset/movie_overviews.csv').dropna()
movies['tagline'] = movies['tagline'].str.lower()
movies.head()

	id	title	overview	tagline
1	8844	Jumanji	When siblings Judy and Peter discover an encha...	roll the dice and unleash the excitement!
2	15602	Grumpier Old Men	A family wedding reignites the ancient feud be...	still yelling. still fighting. still ready for...
3	31357	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	friends are the people who let you be yourself...
4	11862	Father of the Bride Part II	Just when George Banks has recovered from his ...	just when his world is back to normal... he's ...
5	949	Heat	Obsessive master thief, Neil McCauley leads a ...	a los angeles crime saga

Code

corpus = movies['tagline']

Code

from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Print the shape of bow_matrix
print(bow_matrix.shape)

(7033, 6614)

Code

print("\nYou now know how to generate a bag of words representation for a given corpus of documents. Notice that the word vectors created have more than 6600 dimensions. However, most of these dimensions have a value of zero since most words do not occur in a particular tagline.")


You now know how to generate a bag of words representation for a given corpus of documents. Notice that the word vectors created have more than 6600 dimensions. However, most of these dimensions have a value of zero since most words do not occur in a particular tagline.

Analyzing dimensionality and preprocessing

You have been provided with a lem_corpus that contains lowercased, lemmatized, and stopword-free versions of the movie taglines from the previous exercise.

In this exercise, you are required to generate the bag of words representation bow_lem_matrix for these lemmatized taglines and to compare its shape to that of the bow_matrix obtained in the previous exercise.

Code

!python3 -m spacy download en_core_web_sm

Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.

Code

nlp = spacy.load('en_core_web_sm')
stopwords = spacy.lang.en.stop_words.STOP_WORDS

Code

lem_corpus = corpus.apply(lambda row: ' '.join([t.lemma_ for t in nlp(row)
                                                if t.lemma_ not in stopwords
                                                and t.lemma_.isalpha()]))

Code

lem_corpus

1                            roll dice unleash excitement
2                                   yell fight ready love
3                            friend people let let forget
4                              world normal surprise life
5                                  los angeles crime saga
                              ...                        
9091                         kingsglaive final fantasy xv
9093                       happen vegas stay vegas happen
9095    decorate officer devote family man defend hono...
9097                              god incarnate city doom
9098                                      band know story
Name: tagline, Length: 7033, dtype: object

Code

vectorizer = CountVectorizer()

# Generate of word vectors
bow_lem_matrix = vectorizer.fit_transform(lem_corpus)

# Print the shape of how_lem_matrix
print(bow_lem_matrix.shape)

(7033, 4941)

Mapping feature indices with feature names

In the previous exercise, we learned that CountVectorizer does not necessarily index the vocabulary alphabetically. In this exercise, we will learn how to map each feature index to its corresponding feature name.

Code

sentences = ['The lion is the king of the jungle',
             'Lions have lifespans of a decade',
             'The lion is an endangered species']

Code

vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(sentences)

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())

# Map the column names to vocabulary
bow_df.columns = vectorizer.get_feature_names()

# Print bow_df
bow_df

C:\Users\dghr201\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

	an	decade	endangered	have	is	jungle	king	lifespans	lion	lions	of	species	the
0	0	0	0	0	1	1	1	0	1	0	1	0	3
1	0	1	0	1	0	0	0	1	0	1	1	0	0
2	1	0	1	0	1	0	0	0	1	0	0	1	1

Code

print("\nObserve that the column names refer to the token whose frequency is being recorded. As an example, since the first column name is an, the first feature represents how often 'an' occurs in a given sentence. As a result of get_feature_names(), we receive a list corresponding to the mapping between the feature indexes and the vocabulary names.")


Observe that the column names refer to the token whose frequency is being recorded. As an example, since the first column name is an, the first feature represents how often 'an' occurs in a given sentence. As a result of get_feature_names(), we receive a list corresponding to the mapping between the feature indexes and the vocabulary names.

Building a BoW Naive Bayes classifier

Steps
- Text preprocessing
- Building a bag-of-words model (or representation)
- Machine Learning

BoW vectors for movie reviews

You are given two pandas Series, X_train and X_test, which contain movie reviews. They represent the training and testing review data, respectively. Your task is to preprocess the reviews and generate BoW vectors for these two sets using CountVectorizer.

After we have generated the BoW vector matrices X_train_bow and X_test_bow, we will be able to apply a machine learning model to them and conduct sentiment analysis.

Code

movie_reviews = pd.read_csv('dataset/movie_reviews_clean.csv')
movie_reviews.head()

	review	sentiment
0	this anime series starts out great interesting...	0.0
1	some may go for a film like this but i most as...	0.0
2	i ve seen this piece of perfection during the ...	1.0
3	this movie is likely the worst movie i ve ever...	0.0
4	it ll soon be 10 yrs since this movie was rele...	1.0

Code

X = movie_reviews['review']
y = movie_reviews['sentiment']

Code

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Code

vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# fit and transform X_train
X_train_bow = vectorizer.fit_transform(X_train)

# Transform X_test
X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)

(757, 15158)
(253, 15158)

Predicting the sentiment of a movie review

For the training and test movie review data, you generated bag-of-words representations in the previous exercise. In this exercise, we will use this model to train a Naive Bayes classifier that can detect the sentiment of a movie review and compute its accuracy by using this model. This model can only classify a review as either positive (1) or negative (0) since it is a binary classification problem. It is incapable of detecting neutral reviews.

Code

from sklearn.naive_bayes import MultinomialNB

# Create a MultinomialNB object
clf = MultinomialNB()

# Fit the classifier
clf.fit(X_train_bow, y_train)

# Measure the accuracy
accuracy = clf.score(X_test_bow, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = 'The movie was terrible. The music was underwhelming and the acting mediocre.'
prediction = clf.predict(vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

ValueError: Input y contains NaN.

Building n-gram models

BoW shortcomings
    Example
        The movie was good and not boring -> positive
        The movie was not good and boring -> negative
    Exactly the same BoW representation!
    Context of the words is lost.
    Sentiment dependent on the position of not
n-grams
    Contiguous sequence of n elements (or words) in a given document.
    Bi-grams / Tri-grams
n-grams Shortcomings
    Increase number of dimension, occurs curse of dimensionality
    Higher order n-grams are rare

n-gram models for movie tag lines

In this exercise, we have been provided with a corpus of more than 9000 movie tag lines. Our job is to generate n-gram models up to n equal to 1, n equal to 2 and n equal to 3 for this data and discover the number of features for each model.

We will then compare the number of features generated for each model.

Code

vectorizer_ng1 = CountVectorizer(ngram_range=(1, 1))
ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1, 2))
ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))
ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" %
      (ng1.shape[1], ng2.shape[1], ng3.shape[1]))

ng1, ng2 and ng3 have 6614, 37100 and 76881 features respectively

Higher order n-grams for sentiment analysis

Similar to a previous exercise, we are going to build a classifier that can detect if the review of a particular movie is positive or negative. However, this time, we will use n-grams up to n=2 for the task.

Code

ng_vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train_ng = ng_vectorizer.fit_transform(X_train)
X_test_ng = ng_vectorizer.transform(X_test)

Code

clf_ng = MultinomialNB()

# Fit the classifier
clf_ng.fit(X_train_ng, y_train)

# Measure the accuracy
accuracy = clf_ng.score(X_test_ng, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = 'The movie was not good. The plot had several holes and the acting lacked panache'
prediction = clf_ng.predict(ng_vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

ValueError: Input y contains NaN.

Comparing performance of n-gram models

You now know how to conduct sentiment analysis by converting text into various n-gram representations and feeding them to a classifier. In this exercise, we will conduct sentiment analysis for the same movie reviews from before using two n-gram models: unigrams and n-grams upto n equal to 3.

We will then compare the performance using three criteria: accuracy of the model on the test set, time taken to execute the program and the number of features created when generating the n-gram representation.

Code

import time

start_time = time.time()

# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(movie_reviews['review'],
                                                    movie_reviews['sentiment'],
                                                    test_size=0.5,
                                                    random_state=42,
                                                    stratify=movie_reviews['sentiment'])

# Generateing ngrams
vectorizer = CountVectorizer(ngram_range=(1,1))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print the accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. " %
      (time.time() - start_time, clf.score(test_X, test_y)))
print("The ngram representation had %i features." % (train_X.shape[1]))

ValueError: Input y contains NaN.

Code

start_time = time.time()

# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(movie_reviews['review'],
                                                    movie_reviews['sentiment'],
                                                    test_size=0.5,
                                                    random_state=42,
                                                    stratify=movie_reviews['sentiment'])

# Generateing ngrams
vectorizer = CountVectorizer(ngram_range=(1,3))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print the accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. " %
      (time.time() - start_time, clf.score(test_X, test_y)))
print("The ngram representation had %i features." % (train_X.shape[1]))

ValueError: Input y contains NaN.