Dealing with text data

March 19, 2023

Dealing with text data

Finally, we’ll look at ways to engineer columnar features from unstructured text data. How different approaches might affect how much context is extracted from a text, and how to balance the need for context without creating too many features.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
speech_df = pd.read_csv('dataset/inaugural_speeches.csv')
Name Inaugural Address Date text
0 George Washington First Inaugural Address Thursday, April 30, 1789 Fellow-Citizens of the Senate and of the House...
1 George Washington Second Inaugural Address Monday, March 4, 1793 Fellow Citizens: I AM again called upon by th...
2 John Adams Inaugural Address Saturday, March 4, 1797 WHEN it was first perceived, in early times, t...
3 Thomas Jefferson First Inaugural Address Wednesday, March 4, 1801 Friends and Fellow-Citizens: CALLED upon to u...
4 Thomas Jefferson Second Inaugural Address Monday, March 4, 1805 PROCEEDING, fellow-citizens, to that qualifica...

Encoding text

Text cleanup

To convert an unstructured text string into a set of numeric columns that can be ingested by a machine learning model, multiple steps need to be taken. You should standardize your data and eliminate any characters that might cause problems later.

# Replace all non letter characters with a whitespace
speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')

# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()

# Print the first 5 rows of the text_clean column
0    fellow citizens of the senate and of the house...
1    fellow citizens   i am again called upon by th...
2    when it was first perceived  in early times  t...
3    friends and fellow citizens   called upon to u...
4    proceeding  fellow citizens  to that qualifica...
Name: text_clean, dtype: object
Feature-rich text

It is possible to calculate the length and number of words of free form text once the text has been cleaned and standardized.

# Find the length of each text
speech_df['char_cnt'] = speech_df['text_clean'].str.len()

# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()

# Find the average length of word
speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']

# Print the first 5 rows of these columns
print(speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']])
                                           text_clean  char_cnt  word_cnt  \
0   fellow citizens of the senate and of the house...      8616      1432   
1   fellow citizens   i am again called upon by th...       787       135   
2   when it was first perceived  in early times  t...     13871      2323   
3   friends and fellow citizens   called upon to u...     10144      1736   
4   proceeding  fellow citizens  to that qualifica...     12902      2169   
5   unwilling to depart from examples of the most ...      7003      1179   
6   about to add the solemnity of an oath to the o...      7148      1211   
7   i should be destitute of feeling if i was not ...     19894      3382   
8   fellow citizens   i shall not attempt to descr...     26322      4466   
9   in compliance with an usage coeval with the ex...     17753      2922   
10  fellow citizens   about to undertake the arduo...      6818      1130   
11  fellow citizens   the will of the american peo...      7061      1179   
12  fellow citizens  the practice of all my predec...     23527      3912   
13  called from a retirement which i had supposed ...     32706      5585   
14  fellow citizens   without solicitation on my p...     28739      4821   
15  elected by the american people to the highest ...      6599      1092   
16  my countrymen   it a relief to feel that no he...     20089      3348   
17  fellow citizens   i appear before you this day...     16820      2839   
18  fellow citizens of the united states   in comp...     21032      3642   
19  fellow countrymen     at this second appearing...      3934       706   
20  citizens of the united states   your suffrages...      6521      1138   
21  fellow citizens   under providence i have been...      7736      1342   
22  fellow citizens   we have assembled to repeat ...     14969      2498   
23  fellow citizens   we stand to day upon an emin...     17774      2990   
24  fellow citizens   in the presence of this vast...     10155      1695   
25  fellow citizens   there is no constitutional o...     26175      4399   
26  my fellow citizens   in obedience of the manda...     12340      2028   
27  fellow citizens   in obedience to the will of ...     23691      3980   
28  my fellow citizens   when we assembled here on...     13426      2216   
29  my fellow citizens  no people on earth have mo...      5565       991   
30  my fellow citizens   anyone who has taken the ...     32160      5439   
31  there has been a change of government  it bega...      9554      1712   
32  my fellow citizens   the four years which have...      8402      1535   
33  my countrymen   when one surveys the world abo...     20294      3348   
34  my countrymen   no one can contemplate current...     23937      4055   
35  my countrymen   this occasion is not alone the...     22961      3771   
36  i am certain that my fellow americans expect t...     10910      1888   
37  when four years ago we met to inaugurate a pre...     10629      1831   
38  on each national day of inauguration since    ...      7674      1371   
39  mr  chief justice  mr  vice president  my frie...      3086       573   
40  mr  vice president  mr  chief justice  and fel...     13707      2292   
41  my friends  before i begin the expression of t...     14003      2475   
42  the price of peace mr  chairman  mr  vice pres...      9277      1688   
43  vice president johnson  mr  speaker  mr  chief...      7706      1390   
44  my fellow countrymen  on this occasion  the oa...      8242      1502   
45  senator dirksen  mr  chief justice  mr  vice p...     11701      2152   
46  mr  vice president  mr  speaker  mr  chief jus...     10048      1835   
47  for myself and for our nation  i want to thank...      6934      1238   
48  senator hatfield  mr  chief justice  mr  presi...     13787      2457   
49  senator mathias  chief justice burger  vice pr...     14601      2586   
50  mr  chief justice  mr  president  vice preside...     12536      2342   
51  my fellow citizens today we celebrate the myst...      9119      1608   
52  my fellow citizens at this last presidential i...     12374      2201   
53  president clinton  distinguished guests and my...      9084      1606   
54  vice president cheney  mr  chief justice  pres...     12199      2122   
55  my fellow citizens     i stand here today humb...     13637      2452   
56  vice president biden  mr  chief justice  membe...     12174      2151   
57  chief justice roberts  president carter  presi...      8555      1488   

0          6.016760  
1          5.829630  
2          5.971158  
3          5.843318  
4          5.948363  
5          5.939779  
6          5.902560  
7          5.882318  
8          5.893865  
9          6.075633  
10         6.033628  
11         5.988974  
12         6.014059  
13         5.856043  
14         5.961211  
15         6.043040  
16         6.000299  
17         5.924621  
18         5.774849  
19         5.572238  
20         5.730228  
21         5.764531  
22         5.992394  
23         5.944482  
24         5.991150  
25         5.950216  
26         6.084813  
27         5.952513  
28         6.058664  
29         5.615540  
30         5.912852  
31         5.580607  
32         5.473616  
33         6.061529  
34         5.903083  
35         6.088836  
36         5.778602  
37         5.805025  
38         5.597374  
39         5.385689  
40         5.980366  
41         5.657778  
42         5.495853  
43         5.543885  
44         5.487350  
45         5.437268  
46         5.475749  
47         5.600969  
48         5.611315  
49         5.646172  
50         5.352690  
51         5.671020  
52         5.621990  
53         5.656289  
54         5.748822  
55         5.561582  
56         5.659693  
57         5.749328  

Word counts

Word count (I)

In a similar manner to how you worked with categorical variables earlier, you can create features based on the actual content of each text.

For each unique word in the dataset a column is created.

For each entry, the number of times this word occurs is counted and the count value is entered into the respective column.

These “count” columns can then be used to train machine learning models.

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate CountVectorizer
cv = CountVectorizer()

# Fit the vectorizer['text_clean'])

# Print feature names

['abandon', 'abandoned', 'abandonment', 'abate', 'abdicated', 'abeyance', 'abhorring', 'abide', 'abiding', 'abilities', 'ability', 'abject', 'able', 'ably', 'abnormal', 'abode', 'abolish', 'abolished', 'abolishing', 'aboriginal'] Counting words (II)

Once the vectorizer has been fitted to the data, it can be used to transform the text into an array representing the word counts. This array will contain a row for each block of text and columns for each feature generated by the vectorizer that you observed in the last exercise.

# Apply the vectorizer
cv_transformed = cv.transform(speech_df['text_clean'])

# Print the full array
cv_array = cv_transformed.toarray()
(58, 9043)

Limiting your features

By default, CountVectorizer creates a feature for every single word in your corpus. This can lead to far too many features, often ones that have little analytical value.

To reduce the number of features, you can set the following parameters in CountVectorizer:

min_df : Use only words that occur in more than this percentage of documents. This can be used to remove outlier words that will not generalize across texts.

max_df : Use only words that occur in less than this percentage of documents. This is useful to eliminate very common words that occur in every corpus without adding value such as “and” or “the”.

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Specify arguements to limit the number of features generated
cv = CountVectorizer(min_df=0.2,max_df=0.8)

# Fit, transform, and convert into array
cv_transformed = cv.fit_transform(speech_df['text_clean'])
cv_array = cv_transformed.toarray()

# Print the array shape
(58, 818)

Text to DataFrame

The count-based features need to be reformatted in an array so that they can be combined with the rest of the dataset. Concatenate the array with the original DataFrame and convert it into a Pandas DataFrame with the feature names you found earlier as column names.

# Create a DataFrame with these features
cv_df = pd.DataFrame(cv_array,

# Add the new columns to the original DataFrame
speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)
                Name         Inaugural Address                      Date  \
0  George Washington   First Inaugural Address  Thursday, April 30, 1789   
1  George Washington  Second Inaugural Address     Monday, March 4, 1793   
2         John Adams         Inaugural Address   Saturday, March 4, 1797   
3   Thomas Jefferson   First Inaugural Address  Wednesday, March 4, 1801   
4   Thomas Jefferson  Second Inaugural Address     Monday, March 4, 1805   

                                                text  \
0  Fellow-Citizens of the Senate and of the House...   
1  Fellow Citizens:  I AM again called upon by th...   
2  WHEN it was first perceived, in early times, t...   
3  Friends and Fellow-Citizens:  CALLED upon to u...   
4  PROCEEDING, fellow-citizens, to that qualifica...   

                                          text_clean  char_cnt  word_cnt  \
0  fellow citizens of the senate and of the house...      8616      1432   
1  fellow citizens   i am again called upon by th...       787       135   
2  when it was first perceived  in early times  t...     13871      2323   
3  friends and fellow citizens   called upon to u...     10144      1736   
4  proceeding  fellow citizens  to that qualifica...     12902      2169   

   avg_word_length  Counts_abiding  Counts_ability  ...  Counts_women  \
0         6.016760               0               0  ...             0   
1         5.829630               0               0  ...             0   
2         5.971158               0               0  ...             0   
3         5.843318               0               0  ...             0   
4         5.948363               0               0  ...             0   

   Counts_words  Counts_work  Counts_wrong  Counts_year  Counts_years  \
0             0            0             0            0             1   
1             0            0             0            0             0   
2             0            0             0            2             3   
3             0            1             2            0             0   
4             0            0             0            2             2   

   Counts_yet  Counts_you  Counts_young  Counts_your  
0           0           5             0            9  
1           0           0             0            1  
2           0           0             0            1  
3           2           7             0            7  
4           2           4             0            4  

[5 rows x 826 columns]
To limit these common words from overwhelming your model, normalization can be used. Counting the occurrences of words may be useful, but it may skew the results undesirably. As discussed in the video, we will use term frequency-inverse document frequency (Tf-idf) in this lesson. It reduces the value of common words, while increasing the weight of words that do not occur often.

erm frequency-inverse document frequency

Term Frequency - Inverse Document Frequency

\[\begin{equation*} \text{TF-IDF} = \frac{\text{count of word occurrences}}{\text{Total words in documents}} \frac {\log{\left(\frac{\text{Number of docs word is in}}{\text{Total number of docs}}\right)}} \end{equation*}\]


\[\begin{equation*} \text{TF-IDF} = \log{\left(\frac{\text{Total number of docs}}{\text{Number of docs word is in}}\right)} \times \frac{\text{Total words in documents}}{\text{count of word occurrences}} \end{equation*}\]


While counts of occurrences of words can be useful to build models, words that occur many times may skew the results undesirably. To limit these common words from overpowering your model a form of normalization can be used. In this lesson you will be using Term frequency-inverse document frequency (Tf-idf) as was discussed in the video. Tf-idf has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = tv.fit_transform(speech_df['text_clean'])

# Create a DataFrame with these features
tv_df = pd.DataFrame(tv_transformed.toarray(),
   TFIDF_action  TFIDF_administration  TFIDF_america  TFIDF_american  \
0      0.000000              0.133415       0.000000        0.105388   
1      0.000000              0.261016       0.266097        0.000000   
2      0.000000              0.092436       0.157058        0.073018   
3      0.000000              0.092693       0.000000        0.000000   
4      0.041334              0.039761       0.000000        0.031408   

   TFIDF_americans  TFIDF_believe  TFIDF_best  TFIDF_better  TFIDF_change  \
0              0.0       0.000000    0.000000      0.000000      0.000000   
1              0.0       0.000000    0.000000      0.000000      0.000000   
2              0.0       0.000000    0.026112      0.060460      0.000000   
3              0.0       0.090942    0.117831      0.045471      0.053335   
4              0.0       0.000000    0.067393      0.039011      0.091514   

   TFIDF_citizens  ...  TFIDF_things  TFIDF_time  TFIDF_today  TFIDF_union  \
0        0.229644  ...      0.000000    0.045929          0.0     0.136012   
1        0.179712  ...      0.000000    0.000000          0.0     0.000000   
2        0.106072  ...      0.032030    0.021214          0.0     0.062823   
3        0.223369  ...      0.048179    0.000000          0.0     0.094497   
4        0.273760  ...      0.082667    0.164256          0.0     0.121605   

   TFIDF_united  TFIDF_war  TFIDF_way  TFIDF_work  TFIDF_world  TFIDF_years  
0      0.203593   0.000000   0.060755    0.000000     0.045929     0.052694  
1      0.199157   0.000000   0.000000    0.000000     0.000000     0.000000  
2      0.070529   0.024339   0.000000    0.000000     0.063643     0.073018  
3      0.000000   0.036610   0.000000    0.039277     0.095729     0.000000  
4      0.030338   0.094225   0.000000    0.000000     0.054752     0.062817  

[5 rows x 100 columns]
Inspecting Tf-idf values

After creating Tf-idf features you will often want to understand what are the most highest scored words for each corpus. This can be achieved by isolating the row you want to examine and then sorting the the scores from high to low.

sample_row = tv_df.iloc[0]

# Print the top 5 words of the sorted output
TFIDF_government    0.367430
TFIDF_public        0.333237
TFIDF_present       0.315182
TFIDF_duty          0.238637
TFIDF_country       0.229644
Name: 0, dtype: float64

Transforming unseen data

The transformations you perform before training a machine learning model must also be applied to the new unseen data when creating vectors from text. Follow the same approach as in the last chapter: fit the vectorizer only on training data and apply it to test data.

train_speech_df = speech_df.iloc[:45]
test_speech_df = speech_df.iloc[45:]
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectorizer and transform the data
tv_transformed = tv.fit_transform(train_speech_df['text_clean'])

# Transform test data
test_tv_transformed = tv.transform(test_speech_df['text_clean'])

# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.toarray(),
TFIDF_action TFIDF_administration TFIDF_america TFIDF_american TFIDF_authority TFIDF_best TFIDF_business TFIDF_citizens TFIDF_commerce TFIDF_common
0 0.000000 0.029540 0.233954 0.082703 0.000000 0.000000 0.000000 0.022577 0.0 0.000000
1 0.000000 0.000000 0.547457 0.036862 0.000000 0.036036 0.000000 0.015094 0.0 0.000000
0 0.000000 0.029540 0.233954 0.082703 0.000000 0.000000 0.000000 0.022577 0.0 0.000000 ... 0.0 0.000000 0.115378 0.000000 0.024648 0.079050 0.033313 0.000000 0.299983 0.134749
1 0.000000 0.000000 0.547457 0.036862 0.000000 0.036036 0.000000 0.015094 0.0 0.000000 ... 0.0 0.019296 0.092567 0.000000 0.000000 0.052851 0.066817 0.078999 0.277701 0.126126
2 0.000000 0.000000 0.126987 0.134669 0.000000 0.131652 0.000000 0.000000 0.0 0.046997 ... 0.0 0.000000 0.075151 0.000000 0.080272 0.042907 0.054245 0.096203 0.225452 0.043884
3 0.037094 0.067428 0.267012 0.031463 0.039990 0.061516 0.050085 0.077301 0.0 0.000000 ... 0.0 0.098819 0.210690 0.000000 0.056262 0.030073 0.038020 0.235998 0.237026 0.061516
4 0.000000 0.000000 0.221561 0.156644 0.028442 0.087505 0.000000 0.109959 0.0 0.023428 ... 0.0 0.023428 0.187313 0.131913 0.040016 0.021389 0.081124 0.119894 0.299701 0.153133

5 rows × 100 columns


Using longer n-grams

To date, you have created features based on the individual words in the texts. In a machine learning model, this can be very powerful, however, it may be concerned that a great deal of context is being ignored when you look at words individually. When creating models, you can avoid this problem by using n-grams, which are sequences of n words grouped together. For example:

bigrams: Sequences of two consecutive words

trigrams: Sequences of two consecutive words

You can automatically create these in your dataset by specifying the ngram_range argument as a tuple (n1, n2) that includes all n-grams in the range n1 to n2.

cv_trigram_vec = CountVectorizer(max_features=100,
                                 ngram_range=(3, 3))

# Fit and apply trigram vectorizer
cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])

# Print the trigram features
['ability preserve protect',
 'agriculture commerce manufactures',
 'america ideal freedom',
 'amity mutual concession',
 'anchor peace home',
 'ask bow heads',
 'best ability preserve',
 'best interests country',
 'bless god bless',
 'bless united states']

Finding the most common words

Its always advisable once you have created your features to inspect them to ensure that they are as you would expect. This will allow you to catch errors early, and perhaps influence what further feature engineering you will need to do.

# Create a DataFrame of the features
cv_tri_df = pd.DataFrame(cv_trigram.toarray(),

# Print the top 5 words in the sorted output
Counts_constitution united states    20
Counts_people united states          13
Counts_mr chief justice              10
Counts_preserve protect defend       10
Counts_president united states        8
dtype: int64
