Winners of the Kaggle x Google Cloud & NCAA® March Madness Analytics Competition

The Fung Institute for Engineering Leadership

5 min readJun 4, 2020

The power of data and machine learning tools can help us understand and make decisions for just about anything — whether it’s regarding health, finance, or in this case, sports.

This group of graduate engineering students — two Master of Engineering (MEng) and two Master of Science (MS) — at UC Berkeley saw the opportunity to leverage the power of data and machine learning in this unique analytics competition.

Online collaboration between team members Michael Karpe, Remi Thai, Emilien Etchevers, Haley Wohlever, and Kieran Janin

About the Competition

In the Google Cloud & NCAA® March Madness Analytics Competition hosted through Kaggle, teams were challenged to utilize machine learning techniques to conduct exploratory data analysis and uncover the “madness” of the famous men’s and women’s college basketball tournaments.

The teams needed to define and quantify the “madness” in order to tell a clear data story of its role in the tournament.

The MEng Data Story

After carefully evaluating the provided data, the MEng team concluded that determining entertainment value and excitement were key in analyzing the “madness” of the games. In order to quantify the entertainment value of the games, they developed metrics on leaderboard switches, suspense (the last moment the winning team took the lead), and point spread. To further measure the excitement surrounding the games, they decided to add relevant Twitter data to their project and conducted a sentiment analysis of tweets occurring during the tournament.

These graphs plots depict how the difference in seed between teams impacts scoring trends throughout the game.

We were able to interview the team — Michael Karpe (IEOR), Remi Le Thai (ME), Emilien Etchevers (IEOR), Haley Wohlever (ME), and Kieran Janin (CEE) — about their process through this competition.

How did the team get interested in this competition?

We found out about this competition while deciding on a final project topic for the introduction to machine learning class (IEOR 242: Applications in Data Science) we were enrolled in. When we found this competition in early February, it sparked our interest because March Madness was set to begin and end within the semester, which would have allowed us to get live updates, and thus see how our models would play out, as the tournament progressed. While, of course, COVID-19 prevented the NCAA games from being held this year, we still enjoyed analyzing tournament data from previous seasons.

A a heat map of the number of Twitter mentions each team had throughout the 2019 season and when the mentions took place shows that at a specific date when a team is mentioned frequently, there is a corresponding event in the tournament that the team was involved in.

How did you decide on the final idea for your data story?

Our team convened many times before choosing the path for our story, and it eventually came down to us wanting to make something original and deciding what kind of data we could leverage to best optimize our models. We realized that Twitter was not only a readily available and robust data source, but would allow us to analyze and potentially predict people’s reactions to the games — something we felt was key in measuring madness.

What were some of the biggest challenges you faced getting to your solution?

There were a few main challenges we faced in coming to our solution: the subjectivity and openness of the prompt, and the massive amount of available data.

In most data science classes and projects, the objective is fairly clear, and the relevant data has been narrowed down for you. In this case, the problem statement was purposefully vague, and we were swimming in data — starting from the 2015 and detailing the matches down to specific moves that occurred in each game. We had to make a lot of careful decisions about what factors to include in order to tell an accurate and understandable story.

Has this competition changed your knowledge of how data is used in industry today?

Participating in this competition brought to our attention a couple of things: firstly, that presentation is key, and secondly, to be careful when reading data analysis.

Industries want to have models that are accessible, so that a general audience can understand the message they are trying to get across. The visualization of data is a powerful way to portray information about a topic and often sways people’s likelihood to agree with an insight. While going through various data reports, at times we found ourselves drawn to and impressed by certain presentations, but, when carefully looking at their content, came to understand not much content had truly been uncovered.

Which brings us to how careful you have to be when reading data analysis — if you are really looking for something in data, there’s probably a way to “find” it. For this reason, it is always worth asking yourself: how important are these results for this company? for this message? This can help you — as it has helped us — make more informed decisions when interpreting presented data analyses.

Has this competition or process of using various machine learning (ML) techniques influenced any of your professional goals?

Michael: I have been studying ML for a few years, but this competition brought to light issues in ML I hadn’t previously encountered. Focusing on the analytical side of ML helped me consider the various factors that come into play when converting theoretical learning to real-world application.

Remi: Before I took this course, I had nothing to do with data science. This project was extremely educational and gave me practical insight into how to use data to tell a story. I hope to combine what I’ve learned here with my mechanical engineering background in the future.

Emilien: Before this, I wasn’t sure about pursuing ML/data analysis — my previous work focused on innovation management. However, this project and class made me realize that this is a direction I’d like to explore in industry.

Haley: I have a background in mechanical engineering and this was the first data science/ML class I’ve taken. I am currently on track to complete my PhD, and this class and competition have given me a great taste of how I might be able to apply ML concepts to my research.

Kieran: I come from a civil engineering background and had never before used data analytic techniques in my work . This experience of learning how to harness data for a bigger picture is something I’ll definitely be looking for in future job prospects.

Edited by Shivani Lamba.

Connect with the team: Michael Karpe, Remi Thai, Emilien Etchevers, Haley Wohlever, and Kieran Janin