One Hot Encoding – How to deal with categorical data in Machine Learning

image-2

Many models in machine learning don’t work with categorical data. So what do we do in that case? Of course you can always just remove them, but you would lose a lot of valuable information. So in this post, I share how you can use one hot encoding to make that information usable.

I stumbled upon this problem when I recently refreshed my knowledge on decision trees and tried to build one for classification on this Heart Failure Prediction Dataset on Kaggle. It’s a nice beginner dataset for classification. It has however quite a few categorical variables and classic decision trees can’t deal with those 🙁

Table of Contents

Can your categories be ordered or not?

Categorical features have a finite amount of possible values (you could call them categories).

If your data has an inherent ordering, you can simply use this order as your new features. This is called “Integer encoding” in case you want to google it 🙂

Example:

  • Feature values: “big”, “medium”, “small”
  • new feature values: 3, 2, 1

If the categorical values also have no inherent ordering, you can instead use a strategy called “One Hot Encoding” or also called using “dummy variables”, which we’ll take a closer look at now.

Example dataset

Here is the dataset as provided by Kaggle:

It has some “nice” columns like Age, which are clearly ordered and theoretically unlimited. Those need no further work for a decision tree. Side note: For some models you might want to normalize or standardize these numerical features though 😉

One hot encoding example: Creating a binary column for each category

One column that does need more work though is “ST_Slope”, which can have values in {“Up”, “Flat”, “Down”}. Here are the first five entries for this column:

The basic idea of one hot enconding is that you create a new column for every possible value in the categorical column, which then looks like this:

We now have a boolean column for every possible value (Up, Down, Flat) that tells us if the patient either has this ST Slope (=1) or not (=0).

You can probably drop one binary column

There is one last thing to say here. You might have noticed that one column is useless here and it could be any of three, because we only need two columns to convey the information. If Down=0 and Flat=0, we know without checking that Up=1, because the column St_slope is never empty. So we can leave out the Up-column. But with the same reasoning we could leave out any of the other columns instead.

Applying one hot encoding with Python Pandas

You don’t need to do any of this by hand, because pandas offers you a handy function for it:

heart_data_x_encoded = pd.get_dummies(heart_data_x, drop_first=True)

The drop_first option does exactly what I just described above: it only creates x-1 columns for a feature with x possible values.

You might also notice that you can just chuck the whole dataset into the function and pandas will check which columns need to be one-hot-encoded and leaves numerical columns alone. Probably still safer to check the resulting dataset to see if everything is correct & our heart-disease dataset now has the following features:

2 comments

Leave a Reply