Data Analysis (and Other Things): Using Machine Learning to Predict the NCAA Tournament

Every year around the middle of March college basketball owns our country’s attention for the greatest event in all of sports, March Madness. Let’s be honest, the main reason March Madness garners so much attention, beyond just die hard college basketball fans, is the office bracket pool. People who otherwise wouldn’t have even known it was basketball season fill out their brackets based on team color, mascot, or favorite cities they’ve visited. Others take a more analytical approach. This year, in an effort to gain an advantage in my office pool, I decided to combine my interest in college basketball with what I do in my profession and use machine learning to fill out my bracket.

I hate when I hear somebody talking about how they predicted some crazy upset but fail to mention that they filled out seven different brackets. You didn’t predict anything, bud. My strategy has always been 1 bracket a year and I usually fill it out in about 10 minutes so that I don’t over think anything. This season though, I’m going 100% with math and machine learning. No picking the hot teams, sentimental picks, or local favorites. Simply plugging in the numbers and seeing what comes out. This strategy has been used by statisticians for a while now, most notably by Ken Pomeroy whose website KenPom.com has been ranking teams and predicting games as far back as 2002. For me, this is completely for learning purposes. If it was a foolproof strategy I'd be on my way to Vegas. After all, the odds of predicting a perfect bracket are somewhere in the ‘1 in 9.2 quintillion’ range. Famous investor Warren Buffett, who previously offered $1 million to anyone that could predict a perfect bracket, has even upped his offer to $1 million a year for life to anyone that could predict the Sweet Sixteen (Berkshire Hathaway employees only). If I can predict somewhere in the range of 70-75% I believe it will be a success. In testing throughout the season my model had ~72% accuracy.

The two technologies that I used for this project were R programming and Microsoft Azure Machine Learning. For those that don’t know, R programming is an open source programming language that focuses on statistics. Azure Machine Learning is a cloud based solution that uses algorithms to learn from the data and try to make predictions based off of what it learns. In this experiment I used R to pull the data from the web, uploaded that data to Azure ML, and ran my machine learning experiment to predict the outcome of NCAA Tournament games. Going into this project I had virtually no hands on experience with either of these tools. This was simply a way for me to learn some new technologies and have a little fun.

Now, let me tell you how I did it...

With any machine learning project that you work on, the data collection and cleansing stage will take up the bulk of your time. I wanted to make sure that I could automate the process of data collection so that I could run my R code each morning and have the latest game logs for every Division I game of the season in one unified file. For this experiment I am pulling all of the data from www.sports-reference.com. I ultimately wanted to predict whether a team would win or lose a game and this site provides all of the data that I will need to make that prediction. The data is also in HTML table format and easily extracted using the ‘XML’ library in R. This is the only source of data that I used for this experiment. I could have blended in other data sources but for this first run I chose to keep it simple.

Once I was able to pull all of the data and get it into a format that I could work with in Azure ML, my next task was to choose which data points to use in my machine learning experiment. This task seems pretty straightforward but there is a lot more that goes into it than you may think. As I said earlier, I wanted to be able to predict whether a team would win or lose a game. In Azure ML this is known as a two-class classification since our outcome is binary (W or L). When looking at the dataset I needed to look for data points that would help me most accurately predict whether a team would win or lose. This is where some basketball knowledge helps.

If I handed you a box score at the end of a game without telling you which team won the game, you could probably determine the outcome for yourself by looking at a few key statistics. Perhaps “Which team made more shots?” or “Which team got more rebounds?” etc. The problem with this is that these are raw statistics that can vary greatly from game to game. A team may score 75 points in two separate games but it may take them 68 shots to reach that mark in one game and 52 shots in the next. For this reason it is better to use statistics that have been normalized per possession or per opportunity rather than your standard box score statistic. This way we can determine how efficient a team was in a particular facet of the game.

For now, I’m not going to share the data points that I used. Maybe in the future..

After selecting my data points I was finally ready to run my experiment in Azure ML. As I said earlier, I used a Two-Class Classifier to predict whether a team would win or lose. There are many algorithms to choose from when working with two-class classification. I tested a few but ultimately chose to use the Two-Class Neural Network. Artificial neural networks are designed to work much like the neural network in the human brain. The data points are passed in as input, they progress through a hidden layer of interconnected nodes where the data is learned, weighted and passed through as output. Once the dataset is trained it is scored and able to be evaluated. This score shows how well our experiment predicted the outcome of past games based on the training data that we fed it. It will also give us a confidence score between 0 and 1 to show how accurate an outcome may be.

This has been a trial and error process over the past couple of months. It took me a while to define the data points that gave me the most accurate predictions. The good thing is that with only a couple minor changes I’ll be able to use this code again next season to refine the process and hopefully predict games more accurately. As for the final product, I’ll be posting that on Thursday after the first game tips. Maybe it will help me win my office pool, maybe I’ll lose to Sue in accounting. Either way, I’ve learned a lot that will help me in my job and I already have a head start for next year’s tournament.

Data Analysis (and Other Things)

About Me

Sunday, March 12, 2017

Using Machine Learning to Predict the NCAA Tournament

No comments:

Post a Comment