It would be great if our data was always ready to go and be fed into a machine learning model. That is however not the case in most projects. Among the many possible issues, a likely one is that your data has missing values.
Possible reasons for missing data are that perhaps the data was aggregated from different sources and they don’t all share the same features, or the system measuring specific values had outages, or perhaps the data storage has been corrupted in places.
Why we have to deal with missing values
Whatever the reason, we can’t just leave the holes in the data.
If the missing values are presented as a special value (like -999, or 0), these might skew the distribution of the feature – similar to issues arising with outliers.
Another option is that the values are represented as “None” in Python, which is a value that many machine learning algorithms are not equiped to parse at all.
The naive approach: Throw away the affected data
Strictly speaking, this is not a method for imputing, because here we are not replacing the missing values, instead we are just getting rid of them, it can however be effective if the missing data only affects a few rows of your data. If you have a pandas.DataFrame training_data, you get delete these rows like this:
training_data.dropna()
Another method that should only be chosen in extreme values is deleting a whole feature or column. If the feature value is missing in majority of the data instances, it might not hold a lot of information, so you could try deleting it:
# use axis=1 to make sure you are deleting a column, not a row (axis=0)
training_data.drop('feature_name', axis=1)
Impute the missing values with the median of the existing values
A simple strategy that allows us to keep all the recorded data is using the median of the existing values in this feature.
You can either compute this value by hand using your training dataset and then insert it into the missing spots. You do have to do this for every column with missing values like this:
# training_data of type pandas.DataFrame
median = training_data['column1'].median()
training_data['column1'].fillna(median, inplace=True)
Another simpler technique is to use the SimpleImputer from scikit-learn though:
import sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit_transform(training_data)
imputer.transform(test_data)
The advantage is that this approach learns the median for features with fit() and applies it to all columns at the same time with .transform(). It also saves all the medians in the imputer instance, so you can use it to apply the transformation to new test data easily.
Attention: Configure your imputation method ONLY on training data
No matter what imputation method you choose (“median”, “mean”, “most_frequent” etc.), to get an accurate estimation of the performance of your model on the test data, you need to compute the imputed value only on the training set.
One reason for that is that in production you will probably evaluate single data instances, so for those you will likely use the training median as well.
Another reason could be a standard scaling of your model values. Using the training mean to center your values around 0 means you should also use the training values for the imputation, otherwise you run the risk of skewing your model.
1 comment