I participated in this month’s beginner’s challenge on a simulated dataset that Kaggle releases every month. In the August 2022 challenge we are given simulated data from a fictional product test series and given the measured data, the task is to predict whether the product will fail or not in each case.
In this post I’ll describe what new things I learned while trying my hand at the challenge.
Table of Contents
- The task and dataset
- Random Forest
- GridSearchCV: Tuning hyperparameters and using cross-validation
- The power of preprocessing and feature engineering
- Conclusions: A humbling experience and plans to learn feature engineering
The task and dataset
The task is a binary classification. It is pitched as “an opportunity to help the fictional company Keep It Dry improve its main product Super Soaker. The product is used in factories to absorb spills and leaks.
The company has just completed a large testing study for different product prototypes. Can you use this data to build a model that predicts product failures?”
The dataset itself is simulated, but still is created to provide a challenge. You have missing values, many features – most of which the community agreed were noise – and categorical features that differed between training and test set. More on that below.
Random Forest
This was one of the first times I applied the random forest classifier, though thanks to the uniform scikit-learn interface, where you always call the same methods of .fit() and .predict() no matter which model you picked, this was not too different from using Decision Trees.
A new hyperparameter to tune here for example is how many trees your model is made of with the n_estimators parameter.
GridSearchCV: Tuning hyperparameters and using cross-validation
It is kind of embarassing to tell you that for most of my career I just wrote the loops for tuning hyperparameters by hand. One of the reasons is that during my time in university, we implemented a lot of the algorithms ourselves (instead of using scikit-learn implementation) and focused on understanding rather than optimizing hyperparameters for hard to learn datasets. So in a way, this is just part of “Putting ML into production”, which is an area I’m improving in, now that I work in industry or more specifically consulting.
And while the for-loops are quick to write, it’s tedious to keep track of all the results per parameter set to find the best sets of parameters. This is where the GridSearchCV class from scikit-learn comes in, which is a great time-saver if you use one of their models. Here is how I used it with the RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
clf = RandomForestClassifier()
param_grid = {'n_estimators': [3, 50, 100],
'max_features': ['sqrt', 'log2', None, 10],
'max_depth': [2, 3, 4]
}
grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='accuracy', return_train_score=True)
grid_search.fit(X_train, y_train)
You then have various ways to access the found results:
# returns the best model, already trained
grid_search.best_estimator_
# returns the best test score (which is the mean of the cross-validaton scores)
# here we get the accuracy, because that is our scoring function chosen above
grid_search.best_score_
# return the best parameters (of the above model) as a dict
grid_search.best_params_
# output: {'max_depth': 4, 'max_features': None, 'n_estimators': 100}
The power of preprocessing and feature engineering
The participants that shared their code used a variety of models, each with quite a bit of success on the leaderboard. However, more important than the model seemed to be the preprocessing in this challenge.
First off, there were a lot of missing values in this dataset. I collected some basic approaches to filling missing values in this blog post this week. Other participants used other approaches though as well, which was more than I could learn in a few hours.
Another aspect were the categorical columns. I know the basic approach of one-hot-encoding categories – I wrote a blog post about that last month – to make them theoretically work with most machine learning models. Other participants used Weight of Evidence encoding or Label Encoding (even though the scikit-learn documentation says to only use this for target features, not input features).
A big issue is that most people don’t comment their code or explain their decisions. This makes it super hard for me to figure out if an approach like Label Encoding is just their favourite encoder or if there were specific clues in the data that made them choose this approach.
I can’t say for certain that preprocessing is where my issues were but it seems likely especially given how much noise was in the data according to other participants. To be frank: My model did not perform much better than guessing (around 0.5), but the highest score was only 0.6. Which brings me to:
Conclusions: A humbling experience and plans to learn feature engineering
My takeaway here is that I should spend some time with feature engineering techniques, especially concerning correlation and categorical variables. Again, my education was mainly focused on the machine learning algorithm side and less so on the work with imperfect datasets, so this gap makes sense (it’s still an uncomfortable realization though 😀 ).
All in all this experience was very humbling and I hope to have more success in the next challenge to cheer me up.