I’m challenging myself to do a full data science project on a Kaggle data set in 14 days – including data understanding, preprocessing, modeling and evaluation. I’m sharing this to make sure I actually do it and to show a realistic portrayal of what is possible in 14 days if you’re not a Kaggle grandmaster.
I don’t know about you, but I’m part of the club of people who don’t get stuff done and procrastinate (or “do research”) until the deadline is fast approaching. So for this project I’m going for maximum accountability. And I’m inviting you to follow along:
Table of Contents
- Join me on this challenge
- Rules of the challenge
- The data set used
- Why the heck am I doing this to myself?
Join me on this challenge
Note: The challenge is over now, but you can still get access to all the code and updates here.
- code and/or a link to my current state of code
- interesting insights about the data, e.g. plots or first training results
- mistakes I made and tips on how to avoid or fix those in your own work
You have 2 options here:
- Code along and attempt the challenge yourself.
- Just observe and read the updates to see a realistic portrayal of data science work, no steps skipped.
- Okay, I lied, there is option 3: Switch between the two and jump in with a bit of code whenever you want. You can always take my code and change things, if you have a better idea on how to approach the topic.
You can unsubscribe at any time of course.
Rules of the challenge
Here are the simple rules:
- deadline: start on Monday, January 30th & finish on Sunday, February 12th
- limited time: I have a 9-5 job from Monday to Friday, so I really need to focus on the essentials.
- goal: Go through the stages of data understanding, modeling and evaluation. I will not have the best model out there, but every project improves our knowledge and techniques.
- no cheating: I’m not going to look at the results of other challenge participants.
- accountability: I’m sending out a daily update of my code and results for the world to judge (okay, I might be a bit nervous…)
The data set used
I’m using a synthetic data set created by Kaggle that they used in their Playground Series, which is aimed at people getting started with Kaggle challenges. The deadline for this competition is already on the 31st of January, so obviously my results will not be in time for the challenge. It is still a great data set to practice on.
The data is based on this Credit Card Fraud Detection data set, and falls in the category of anomaly detection (also called outlier detection), which is a special case of binary classification with one of the classes (the fraud cases) being much rarer than the non-fraud examples.
Why the heck am I doing this to myself?
Excellent question. This might be a terribly stressful idea and experience. Only one way to find out though.
Deadlines and keeping things moving forward are one of the big reasons why university or bootcamps often work better than self-teaching. I went to university, got my degrees, but in the end it was time for me to finally get a full-time job. I am however still far away from having learned all I want to learn & now I’m experimenting with goal setting. In the past I set monthly goals, but it was easy to fall off the bandwagon with those.
I personally love challenges. I did Advent of Code last December for example. One could also argue that me insisting on double majoring at uni was a challenge as well. Or running a half marathon as my first ever running event. There is just something about the things that seem just barely possibly out of reach (or are they in reach?) that excites me immensely.
I fail some of these challenges, because sometimes the preparation just wasn’t that great, or too many things were going on in my life. And sometimes the challenge was simply more than I could handle.
Will this one be a success?