Creating Features - Feature Engineering for Machine Learning

python

datacamp

feature engineering

machine learning

Author

kakamana

Published

March 19, 2023

Creating features

This course will teach us the basics of feature engineering and how to use it. We’ll load, explore, and visualize a survey response dataset, and we’ll see what types of data it contains, and why those types influence how you build your feature set. We’ll make new features from both categorical and continuous columns with the pandas package

This Creating features is part of Datacamp course: Feature engineering for machine learning in Python

This is my learning experience of data science through DataCamp. These repository contributions are part of my learning journey through my graduate program masters of applied data sciences (MADS) at University Of Michigan, DeepLearning.AI, Coursera & DataCamp. You can find my similar articles & more stories at my medium & LinkedIn profile. I am available at kaggle & github blogs & github repos. Thank you for your motivation, support & valuable feedback.

These include projects, coursework & notebook which I learned through my data science journey. They are created for reproducible & future reference purpose only. All source code, slides or screenshot are intellactual property of respective content authors. If you find these contents beneficial, kindly consider learning subscription from DeepLearning.AI Subscription, Coursera, DataCamp

Code

import pandas as pd

Feature generation: why do it?

* Different types of data:

  *  Continuous: either integers (or whole numbers) or floats (decimals)
  *  Categorical: one of a limited set of values, e.g., gender, country of birth
  *  Ordinal: ranked values often with no details of distance between them
  *  Boolean: True/False values
  *  Datetime: dates and times

Code

so_survey_df = pd.read_csv('dataset\Combined_DS_v10.csv')
so_survey_df.head()

	SurveyDate	FormalEducation	ConvertedSalary	Hobby	Country	StackOverflowJobsRecommend	VersionControl	Age	Years Experience	Gender	RawSalary
0	2/28/18 20:20	Bachelor's degree (BA. BS. B.Eng.. etc.)	NaN	Yes	South Africa	NaN	Git	21	13	Male	NaN
1	6/28/18 13:26	Bachelor's degree (BA. BS. B.Eng.. etc.)	70841.0	Yes	Sweeden	7.0	Git;Subversion	38	9	Male	70,841.00
2	6/6/18 3:37	Bachelor's degree (BA. BS. B.Eng.. etc.)	NaN	No	Sweeden	8.0	Git	45	11	NaN	NaN
3	5/9/18 1:06	Some college/university study without earning ...	21426.0	Yes	Sweeden	NaN	Zip file back-ups	46	12	Male	21,426.00
4	4/12/18 22:41	Bachelor's degree (BA. BS. B.Eng.. etc.)	41671.0	Yes	UK	8.0	Git	39	7	Male	£41,671.00

Code

print(so_survey_df.dtypes)

SurveyDate                     object
FormalEducation                object
ConvertedSalary               float64
Hobby                          object
Country                        object
StackOverflowJobsRecommend    float64
VersionControl                 object
Age                             int64
Years Experience                int64
Gender                         object
RawSalary                      object
dtype: object

Choosing specific data types

Datasets often have columns with multiple data types (like the one you’re working with). Most machine learning models require a consistent data type across features. Most feature engineering techniques only work with one type of data at a time. When working with DataFrames, you’ll often want to access just certain types of columns.

Code

# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int','float'])

# Print the column names contained in so_survey_df_num
print(so_numeric_df.columns)

Index(['ConvertedSalary', 'StackOverflowJobsRecommend', 'Age',
       'Years Experience'],
      dtype='object')

Dealing with categorical features

Encoding categorical features
    One-hot encoding
    Dummy encoding
One-hot vs. dummies
    One-hot encoding: Explainable features
    Dummy encoding: Necessary information without duplication

One-hot encoding and dummy variables

To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables. In this exercise, you will create both types of encoding, and compare the created column sets.

Code

# Convert the Country column to a one hot encoded Data Frame
one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')

# Print the columns names
print(one_hot_encoded.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
       'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',
       'OH_UK', 'OH_USA', 'OH_Ukraine'],
      dtype='object')

Code

# Create dummy variables for the Country column
dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')

# Print the columns names
print(dummy.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
       'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',
       'DM_USA', 'DM_Ukraine'],
      dtype='object')

Dealing with unusual categories

There can be a lot of different categories for some features, but they’re not evenly distributed. For instance, Data Science’s favorite languages include Python, R, and Julia. But some people have their own bespoke choices, like FORTRAN, C, and so on. You might not want to create a feature for every value, but just for the ones that show up most often.

Code

countries = so_survey_df.Country

# Get the counts of each category
country_counts = countries.value_counts()

# Print the count values for each category
print(country_counts)

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
UK               95
India            95
Ukraine           9
Ireland           5
Name: Country, dtype: int64

Code

mask = countries.isin(country_counts[country_counts < 10].index)

# Print the top 5 rows in the mask series
print(mask.head())

0    False
1    False
2    False
3    False
4    False
Name: Country, dtype: bool

Code

# Label all other categories as Other
countries[mask] = 'Other'

# Print the updated category counts
print(countries.value_counts())

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
UK               95
India            95
Other            14
Name: Country, dtype: int64

C:\Users\dghr201\AppData\Local\Temp\ipykernel_37200\753486482.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  countries[mask] = 'Other'

Numeric variables

Binarizing columns

Even though numeric values can often be used without feature engineering, there will be times when manipulation can be useful. For example, sometimes you don’t care about the magnitude of a value, just its direction, or even if it exists. You’ll want to binarize a column in these cases. The so_survey_df data has a lot of survey respondents who are working for free (without pay). Adding a new column titled Paid_Job will let you know whether each person is paid (their salary is greater than zero).

Code

# Create the Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())

   Paid_Job  ConvertedSalary
0         0              NaN
1         1          70841.0
2         0              NaN
3         1          21426.0
4         1          41671.0

Binning values

You don’t really care about the exact value of a numeric column, but rather the bucket it falls into. You can use this when plotting values or simplifying machine learning models. Most of the time, it’s used on continuous variables where accuracy isn’t as important e.g. age, height, wages.

Code

# Bin the continuous variable ConvertedSalary into 5 bins
so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=5)

# Print the first 5 rows of the equal_binned column
print(so_survey_df[['equal_binned', 'ConvertedSalary']].head())

          equal_binned  ConvertedSalary
0                  NaN              NaN
1  (-2000.0, 400000.0]          70841.0
2                  NaN              NaN
3  (-2000.0, 400000.0]          21426.0
4  (-2000.0, 400000.0]          41671.0

Code

# Import numpy
import numpy as np

# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'],
                                         bins, bins,labels=labels)

# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())

  boundary_binned  ConvertedSalary
0             NaN              NaN
1          Medium          70841.0
2             NaN              NaN
3             Low          21426.0
4             Low          41671.0