Machine Learning Study Plan for July 2022

Have you ever been totally motivated and started a course? Or bought a book? But then you never finished them and just felt bored by them and then felt bad because you didn’t finish it? Yeah, me too.

This month I’m trying a different approach by setting more specific goals oriented by things that excite me.

Table of Contents

Goal 1: Kaggle Playground Series – Clustering Challenge

Each month Kaggle publishes a new dataset and task aimed at intermediate learners. You should have a basic understanding of Machine Learning and done some tutorials and small challenges before attempting this, but if you are close to completing independent projects like the big challenges on Kaggle but they still feel intimidating (like they do for me, I wouldn’t even know where to start), then the Playground Series is the way to ease your toes into it.

I have tried these once before, but I want to make more of an effort to complete those challenges to improve my practical knowledge and my portfolio at the same time. For July this means trying a handful of different strategies on the dataset and improving my knowledge about Clustering methods as this is this month’s task.

The challenge is available for free on https://www.kaggle.com/competitions/tabular-playground-series-jul-2022

For that I want to read the chapters about Clustering in both “Introduction to Statistical Learning” (available online for free by the authors) and “Elements of Statistical Learning”, with the latter being the more advanced cousin of the first book, and see which of those methods I can apply to the Kaggle dataset.

Luckily other participants can also publish their solution code for the challenge and I will study some of the successful notebooks to see which methods they have used. Of course, I don’t just want to copy them, but instead research the methods and hopefully understand why they work better than other methods.

Goal 2: Decision Trees towards XGBoost

Last week I wrote a post about a fascinating research paper concerning tree-methods, specifically advanced ones like XGBoost, and their robustness. I mentioned that I didn’t know enough yet to understand most of the research paper, so that is my goal for the next few months.

Decision Trees for Regression and Classification

Decision-Trees
Classification decision tree example

For this month I will aim to understand the basics of decision trees for regression as well as classification, which includes things like Information Gain and the Gini Index. And I want to build a small scikit-learn project for both tasks. Admittedly, I already started with that last month, so I only have some loose ends to tie up here and finish the projects.

Bootstrapping

Next on the list towards more complex tree-methods is the method of bootstrapping and then random forests. Ideally I would get all the way to random forests, but I’ll consider that more of a stretch goal.

Again, my main theory sources are the two books mentioned above and just good old googling for details.

Update after the month

Kaggle challenge: I categorise this as a huge success! I uploaded multiple solutions to the challenge and even read most of my “required reading” from the books I mentioned.

Though I can’t claim to have won, it was extremely interesting to see the code of others with scores better than me throughout the challenge. Through reading their code I learned of Gaussian Mixtures as well as supplementing unsupervised with supervised learning techniques. Both of these approaches were also new to my colleagues, so I could immediately share these ideas as well 🙂

Decision Trees: This was only half a success. I had a rough few weeks last month that made me overthink content a lot. I ended up publishing two posts about how to preprocess categorical data with one-hot encoding and also a walkthrough of a decision tree classifier both here and on YouTube.

But I didn’t get to understanding Information Gain and Gini Index and neither did I learn about bootstrapping. Admittedly, I’m not very excited about the details for information gain right now, so I might skip that for now and continue with bootstrapping next month.

That’s it for this month. On to the next adventure 🙂

Consent Management Platform by Real Cookie Banner