# Predicting the 2023 NBA MVP with Machine Learning

The 2022–23 NBA regular season has finished up, players are done recording their regular season stats and we’ve got all the information needed to predict the NBA Most Valuable Player award.

Most years, there is usually a stand out candidate whose performances were head and shoulders above the rest, leading to an anticlimactic announcement, that everybody already knew.

But 2022–23 is not most seasons.

This year, the finalists are Nikola Jokić, Joel Embiid and Giannis Antetokounmpo, all putting up some of the greatest individual seasons of all time, in the same year — its unprecedented.

It’s one of the tightest races for the award ever, but don’t take my word for it — Charles Barkley *guaranteeeed *on Inside the NBA “This is going to be the closest vote ever, in my opinion” — and we all know that Chuck is usually right about things.

*Sidenote: I’m a diehard Celtics fan, and while Jayson Tatum has had an almost-MVP worthy season, if there’s any way I can twist the data to argue that he should win, I’m going to do that (I couldn’t).*

Keep reading, because we’re going to use the AI & Analytics Engine to predict who will win the 2022–23 NBA MVP, using the power of machine learning, without needing a single line of code.

# The problem with the MVP award

The MVP trophy is the most prestigious individual award given to a player. But what is the criteria to be the MVP? It should be an easy question, and there should be clear, consistent parameters defining what constitutes the MVP, so the award can be given fairly every year (there’s not).

The issue is trying to define “Valuable”, because there’s many different opinions.

- The best player in the league?
- The most impactful player in the league?
- The best player on the best team?
- The most valuable player to his team?
- The player that contributes to winning the most?

There’s not really a clear answer. But here’s what we do know — Narratives play a big part. Voter fatigue is a real thing, otherwise Michael Jordan and Lebron James would have just kept winning year after year. But people get bored, and it’s a shame to see an all time great finish their career without one on their resume. , Often, the question isn’t really “Who’s the MVP?” but rather “Who’s turn is it to win MVP?”.

This makes using machine learning a bit tricky, because narrative is impossible to measure. So the side quest we’ll also be going down, is answering the question “which metrics are most important in winning the MVP?”.

# The 2023 NBA MVP candidates

The problem with this year, is that each candidate has their own claim; Jokic is the most efficient, Embiid is the most dominant, and Giannis won the most. But even still, there’s not much separating them in each category.

There’s one thing that Embiid has that the other two don’t though — Narrative. Jokic has won the award the previous two years, and Giannis won the two years before that. But Embiid has never won it, managing only to come second the last two years.

We’ve seen ex-players, now media personalities like Rajon Rondo and Jalen Rose say Embiid is their choice, while JJ Reddick has chosen Giannis, citing “The best player on the best team”.

If you’re unaware or need a refresher, these are some summary statistics (games played, points/rebounds/assists per game and team wins) for each player, to give you an idea of their respective seasons.

# Using machine learning to predict the MVP:

The first thing we have to do is understand how the MVP is decided. Each year 100–130 members of the NBA media votes, giving players 10, 7, 5, 3, and 1 vote(s). The votes are tallied, and the player with the most votes wins.

There’s two approaches we can take towards predicting the winner with machine learning:

**Classification:**The first way is by a classification method. We define a target column called: “Is_MVP”. It will contain`1`

if the player is and MVP.`0`

otherwise. The problem with this method of building the training data is that the that the data is heavily imbalanced. Each season, there are hundreds of players, however, there is only a single MVP. Overall, over 40 seasons in our data, we will have only 40 positive labels. This creates technical difficulties in training and evaluation.**Regression:**The second possible way is using a regression method, and predicting a number. Because the number of voters changes each season, we can use the metric “MVP_award_share”, which is the number of votes divided by the number of possible votes. This works much better, because each year there’s about 10–20 players who receive at least one vote.

It’s worth noting, that we’re not making any decisions about what *valuable* means, or who had the objectively best season. We’re looking at which statistics seem to correlate to being voted MVP in the past (side-quest), and according to that criteria, predicting which player this year had the most *MVP-ish* season.

# The dataset

The data is taken from this dataset, which scraped basketball reference and has taken every single players stats for every single season from 1982–2022. There’s a few groups of statistics that we’ll be using in order to predict the MVP share variable:

## Games played and winning percentage

Some humble but important stats, the number of games that a player plays, and how many games their team won. There is a slight problem with both, because there has been 4 seasons since 1982, in which the total games played have been less than 82 (two were due to lockouts, and two were due to covid). Therefore I adjusted these both to be a percentage of the possible games available.

## Counting stats per game

These are the stock standard basketball statistics, how many points, *rebounds, assists, steals, blocks, and turnovers *they averaged a game, you get the gist.

## Percentage stats per game

There’s a bit of an issue when discussing per game stats. The pace of the league changes over eras; In the 80’s and 2020’s, teams play quickly, whereas the 90’s were slow. Players in higher paced eras have more possessions to record per game stats, therefore, the inclusion of percentage versions of the per game stats helps adjust for this.

## Shooting accuracy

*True Shooting% (TS%)* is a statistic that takes into account that 3-pointers are worth more than 2-pointers, and incorporates free throw accuracy. In an ideal world, we’d adjust different eras by using *True Shooting% relative to league average (TS%+)*, but that data wasn’t in the dataset I used. Ahh well, not a big deal.

## Advanced metrics

Ahh yes, advanced metrics, the phrase that makes NBA old-heads shudder. These are various metrics created by data scientists with a love for sports, with the goal of quantifying how good a sports player is. The ones we’ll be using are:

*Player Efficiency Rating (PER):*A measure of per-minute production standardized such that the league average is 15.*Win Shares (WS, OWS, DWS):*An estimate of the number of wins contributed by a player. This also has offense and defence variations.*Box Plus/Minus (BPM, OBPM, DBPM):*A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team. This also has offense and defence variations.*Value Over Replacement Player (VORP):*A box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season.

You don’t need to know exactly how they’re all calculated and the differences between them, because it’s not really important. Just remember that generally, the higher the number = the better they played.

# The methodology

## Data filtering

The first step is processing the training data. With over 17,000 entries, seasons with non-zero MVP share represented around 3% of the data, so it’s worth filtering. The immediate thing that comes to mind is getting filtering seasons that either played few games, or few minutes per game. So I limited the training to only the ones with;

- Greater than 30 minutes per game
- Greater than 60% of games played (equivalent to 49 in a regular 82 game season)

This reduced the training data set to 3500 entries, where non-zero MVP share represented 15% of the total dataset. For each season, there was roughly 80–90 qualifying players. One interesting note is that the number of players that got a vote has trended downwards, meaning the decision has become more unanimous in the last decade.

## Machine learning with the AI & Analytics Engine

With the data ready, It was time to go to the AI & Analytics Engine.

The first step is to upload the training data, which the machine learning models uses to predict the unknown, 2023 data. As mentioned, this is a regression problem, because we are predicting a numerical value for the MVP award share column.

The next step is to define the feature set, or predictors that are important. It was important to unselect the player information stats, like name and team, because we want to ensure that the models don’t get confused around correlation and causation.

## Model training

The next stage is building the models. I built three different models, all using different tree-based machine learning algorithms that all have different ways of training on the data. Despite all having fairly similar prediction qualities (R2 scores), they work differently and produce different results, so the average of the three will be taken.

## Getting the predictions

After each model has trained, the last step is to upload the test 2022–23 season data, where the MVP share is obviously unknown. The Engine spits out a CSV file, we just repeat that process for each of the three models, and then we can put it into a spreadsheet and see the results.

# Hurry up and get to the results already

## Which stats correlate to winning an MVP

Now for that little side quest from earlier — lets investigate that stats that the MVP players tend to be really strong in historically. The Feature Importance tab in the Engine allows us to see exactly how much each feature impacts each model, here are the results.

Win shares, player efficiency rating and win loss% are all leading indicators, being in the top four for all three models. The XGBoost and LightGBM regression models both have PPG as 3rd most impactful, whereas in randomized trees it ranks 6th (that plays a big part in each models predictions).

## The Winner of the 2023 NBA MVP will be…

Alright, finally the results. Here they are.

We see that the extremely randomized trees model heavily favour Jokic, whereas XGBoost and LightGBM regressions both moderately favour Embiid. All three consider Giannis to have had a strong, but not quite *MVP-ish* season.

After all of that work, there’s still only a marginal difference splitting Embiid and Jokic in the average of all three models. In my opinion, it’s going to Embiid, due to a factor that machine learning can’t possibly quantify — the narrative. He’s been so close the past two years, it’s hard to see it not going to him.

*NOTE: The MVP has just been announced, and Joel Embiid has indeed won with 0.915 MVP award share, Jokic coming second with 0.674, and Giannis third with 0.606. I’m also happy to say Jayson Tatum was the consensus fourth place, with 0.280.*

*This post is part of my ongoing series of blog articles on using machine*

learning algorithms in the AI & Analytics Engine to predict sport events

and results.

**Predicting the 2022 World Cup with Machine Learning**

Which nation will win the worlds largest sporting event?

**Predicting the 2023 NBA MVP with Machine learning**

Who will claim MVP honors in one of the tighest races in memory?

*Check them out if you're interested! And if you have any requests, let me*

know, I'm available on *linkedin*.