Mar 4, 2021
Have you ever been faced with such a dataset so unclean and irregular, that you feel terrified? 😯. If your answer is NO, you have either been practicing deep learning for over 3 years or you rarely ever practice at all? But jokes aside, we deep learners frequently get faced with rather unclean data that needs tedious amounts of processing to be usable. Several of these problems which we are gonna tackle in this tutorial by solving a machine learning problem in a beginner-friendly way. let’s get started!
In this Tutorial(We Cover the Foll):
NOTE: If at any point in time, you feel confused, make sure to check my Colab notebook and follow along with it so you can see the full transfer of data between data frames Full Notebook Summary
Structured Data is Data that can be tabulated or visualized in a table format. In machine learning, this term refers to numerical Data also known as Continous Values. Structured Data is logically the opposite of unstructured data, which refers to data that can not be technically tabularized or which has an irregular format, examples are, Images, text and videos data
Structured data conforms to a tabular format with a relationship between the different rows and columns. Common examples of structured data are Excel files or SQL databases. — Big Data Framework1
In the real world, we gather a lot of structured data, which follows the laws of the real world. When they are collected, no consideration is given to how they can be used for machine learning and collecting insights. An Example: Imagine, a law firm kept the records on all their customers in the year 2019, with information on say, how long the company’s representatives talked on the phone with customers, how much money customers invested for insurance, etc. In a record as this, there are natural data that are not readily convertible for working with. In this case, employing machine learning cleaning methods is the only viable open. Let’s dive into the subject data for this tutorial.
The dataset for this tutorial contains two files:
In other words, we have a training data frame with 6500 rows and 20 columns and a testing/evaluating data frame with 3500 rows and 19 columns.
You work for a company that sells sculptures that are acquired from various artists around the world. Your task is to predict the cost required to ship these sculptures to customers based on the information provided in the dataset. The data frames contain several columns which depict the features we are working with, and we need to build a model that predicts the last column in the test set(Cost of Sculptures). Let’s Visualize the dataset to see what it looks like.
To visualize the dataset, we should load it up, import the necessary packages & modules. I’m working on Colab for this tutorial and the dataset is stored in my drive. To follow along with this guide , download dataset here. Upload the zip file to drive
1
2
from google.colab import drive
drive.mount('/content/drive')
1
2
3
4
5
6
7
8
import os
import random
import tensorflow as tf
import numpy as np
import pandas as pd
# Use seaborn for pairplot
import matplotlib.pyplot as plt
import seaborn as sns
1
2
3
4
5
from zipfile import ZipFile
file_name = '/path/to/dataset'
with ZipFile(file_name, 'r') as grab:
grab.extractall('/path/')
print('Done')
1
2
3
4
5
raw_dataset = pd.read_csv('/path/to/extracted/csv')
dataset = raw_dataset.copy()
dataset.head()
# Checking for nan values
print(datasetisnull().sum())
In the image below, we can see the total of 20 columns and their features; Customer ID, Artist Name, Artist Reputation, Width, Height, Width, Material, Price of Sculpture, Base Shipping Price, International, Express Shipment, Installation Included, Transport, Fragile, Customer Information, Remote Location, Scheduled Date, Delivery Date, Customer Location. We also have a lot of Nan Values amongst our data, this would be tackled after dropping irrelevant columns with string values and over-tedious formats like a date.
Some particular columns contain string values and other formats that are irrelevant in predicting the cost of the artwork.
For this next step, we have two options to consider, should we:
Let’s approach this somewhat logically: If we start with the first option, we are likely going to meet with columns that can’t be(or rather, too tedious) to embed. But if we drop them first, we can exclude those absurdly irrelevant features before moving on embedding the relevant portion of columns.
Before Dropping Columns in the training and test set, let’s first pop-out our target column (Cost) and save it. Later we can reference it for fitting our model.
1
2
X, y = raw_dataset.drop('Cost', axis=1), raw_dataset.iloc[:,-1:]
y.head()
In the above code block, we separated our cost from the main train data frame and stored it in a new df. On drawing the head though, we can see something strange at work!. The cost consists of both negative and positive values, this won’t do, we want Only positive values. As I said earlier real-world data never comes the way you expect. This could just be a wrong entry by a tired cashier or a more acceptable answer perhaps would be that (For all Art Sculptures that were delivered with defects, the company holds the loss?). Anyway, we can easily correct this using the absolute function in pandas. The absolute of any number is = the **positive of that number**
1
2
train_Y = y['Cost'].abs()
train_Y.head()
With the above code, our target variable now contains only positive values. Now that’s taken care of, we can proceed to drop special irrelevant columns from our training set
1
2
3
4
import pandas as pd
import numpy as np
X = raw_dataset.copy()
train = X.drop(['Cost','Customer Id','Artist Name','Delivery Date','Scheduled Date','Customer Location'], axis=1)
Earlier, I checked our columns for nana values. In total it returned about 4000 Nan Values. But in every column we see that nan values are occurring more in some especially (Material, Transport, Remote_Location, Width, Height, etc). For this step, we want to replace nan values in First Columns with more than two classes. Material, Remote_Location, and Transport fit the description, the reason is this, if we embed them we stand a chance of losing the relative relationship in our features. To do this, we are gonna create a simple function that fills Nan values with the highest occurring category in the column. e.g In the Material column, we have 7 classes (Brass, Stone, Aluminium, Bronze, Clay, ). Our function loops through the rows and replaces Nan Values with the highest repeating class.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#Function to replace NAN values with mode value
def replace_nan_most_freq(DataFrame,ColName):
most_frequent_category=DataFrame[ColName].mode()[0]
replace nan values with most occured category
DataFrame[ColName +"-Imputed"] = DataFrame[ColName]
DataFrame[ColName + "-Imputed"].fillna(most_frequent_category,inplace=True)
#Call function to impute most occured category
for Columns in ['Material', 'Remote Location', 'Transport']:
replace_nan_most_freq(train,Columns)
replace_nan_most_freq(test,Columns)
# Display imputed result
train[['Material','Material-Imputed','Remote Location','Remote Location-Imputed','Transport','Transport-Imputed']].head(10)
test[['Material','Material-Imputed','Remote Location','Remote Location-Imputed','Transport','Transport-Imputed]].head(10)
#Drop actual columns
train = train.drop(['Material', 'Remote Location','Transport'], axis = 1)
test = test.drop(['Material', 'Remote Location','Transport'], axis = 1)
With that, if we plot the train and test df, we get the following: As you can see Material has been replaced by Material_Imputed, likewise with Transport and Remote Location.;
In categorical Embedding, we want to check for columns with a data type of object
and label encode them using scikit’s learn Label Encoder.
1
2
3
4
5
6
# Get list of categorical variables
s = (train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
1
2
3
4
5
6
7
8
9
10
11
from sklearn.preprocessing import LabelEncoder
# Make copy to avoid changing original data
label_X_train = train.copy()
label_X_test = test.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
label_X_train[col] = label_encoder.fit_transform(train[col])
label_X_test[col] = label_encoder.fit_transform(test[col])
With that, plotting the head gives us this:
All category columns have been lav\bel encoded to numerical values. The last step is to fill up the remaining columns that still contain Nan Values like Width, Height, Weight. Because, if you can remember, we only replaced the category nan values above. The code below uses sklearn’s Simple Impueter method.
1
2
3
4
5
6
7
8
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(label_X_train))
imputed_X_test = pd.DataFrame(my_imputer.fit_transform(label_X_test))
# Imputation removed column names; put them back
imputed_X_train.columns = train.columns
imputed_X_test.columns = test.columns
Now if we check for Nan values, it returns zero across all columns.
With that taken care of, we can now start doing some feature selection and deciding which columns are of no use to us. A lot of logical thinking is important as this is a real-world problem and real-world insight is required.
NOTE: Any column we decide to drop must be reflected in the test set. Likewise, any preprocessing step we take. As this data(test set) is what we would be making predictions on, and inconsistent columns would result in errors.
When faced with a feature selection problem in deep learning and machine learning, there are several methods you can apply to arrive at better features during training, therefore, better model accuracy.
We have now 19 columns in our training set after dropping Cost.
Customer ID: A set of unique values associated with every customer (This adds absolutely nothing of value)
The above can be achieved with little logical deductions and insights, but as you can see, there are still some uncertainties. By using methods, tests, and libraries we have a better ground to decide.
Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features. The example below uses the f_classif statistical test for positive numerical features to select 10 of the best features from the Art Exhibition Dataset. Some of the feature columns must be dropped before you can use this method: Those that are string values e.g Customer Id and Artist Name.
1
2
3
4
5
6
7
8
9
10
#apply SelectKBest class to extract top 10 best features
from sklearn.feature_selection import SelectKBest, f_classif
bestfeatures = SelectKBest(score_func=f_classif, k=10)
fit = bestfeatures.fit(x_final,train_Y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(x_final.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(10,'Score')) #print 10 best features
The top ten best features are displayed below helping us get rid of uncertainties like the Fragile
column.
You can get the feature importance of each feature of your dataset by using the feature importance property of the model. Feature importance gives you a score for each feature of your data, the higher the score, the more important or relevant is the feature towards your output variable. Feature importance is an inbuilt class that comes with Tree-Based Classifiers, we will be using Extra Tree Regressor for extracting the top 10 features for the dataset.
1
2
3
4
5
6
7
8
9
from sklearn.ensemble import ExtraTreesRegressor
import matplotlib.pyplot as plt
model = ExtraTreesRegressor()
model.fit(x_final,train_Y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=x_final.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
Again, we see that some certain feature columns are among the top 10 using this selection method. We now know those very important features we should target when we limit the training features to say, 10.
Correlation states how the features are related to each other or the target variable. Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable) Heatmap makes it easy to identify which features are most related to the target variable, we will plot a heatmap of correlated features using the seaborn library.
1
2
3
4
5
6
#get correlations of each features in dataset
corrmat = raw_dataset.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(raw_dataset[top_corr_features].corr(),annot=True,cmap="RdYlGn")
Now, we are talking. Let’s make a round-up of the 11 features we want to use for training, by selecting those occurring the most after the feature selection methods above.
We would quickly drop the remaining columns while leaving the above, and Finally, proceed with scaling our data & fitting our model with sklearn StandardScaler and Random Forest Regressor respectively.
1
2
x_train = imputed_X_train.drop(['Fragile','Remote Location','Installation Included'], axis=1)
x_test = imputed_X_test.drop(['Fragile','Remote Location','Installation Included'], axis=1)
1
2
3
4
5
import sklearn
scaler = sklearn.preprocessing.StandardScaler()
x_train = pd.DataFrame(scaler.fit_transform(imputed_X_train), columns=imputed_X_train.columns)
x_test = pd.DataFrame(scaler.fit_transform(imputed_X_test), columns=imputed_X_test.columns)Location','Installation Included'], axis=1)
1
2
3
4
5
6
7
8
#Import Random Forest Model
from sklearn.ensemble import RandomForestRegressor
#Create a Gaussian Classifier
rgf=RandomForestRegressor(n_estimators=100, random_state=42)
#Train the model using the training sets y_pred=clf.predict(X_test)
rgf.fit(x_train,train_Y)
Last But not least, we need to evaluate our model on the test set, prepare it and arrange it in a new data frame with two columns Customer Id and Cost.
And now this is how the final model looks like. You can download the .csv file here.
We have had fun with this project, cleaning, visualizing, dropping, performing feature engineering, and learning how to use the sklearn library for Machine Learning work. It’s a comprehensive guide where I decided to tackle problems to show you how you can solve similar problems. Naturally, you have questions to ask, and I’m only happy to see you ask them. The comments box below is available for your chats. Thanks for reading this article, I hoped it has achieved its purpose of guiding you through the concept of feature selection engineering in deep learning, Chel.