A machine learning project using Statcast data to predict batter performance with a direction-aware expected weighted on-base average (xwOBA) model.
Click the button below to generate a random plate appearance and see its predicted xwOBA compared to the actual outcome.
Actual wOBA values:
0.855 = Single, 1.248 = Double, 1.575 = Triple, 2.014 = Home Run
To begin, I created a Random Forest model using scikit-learn that predicts xwOBA based on two primary inputs: exit velocity (launch speed) and launch angle. The live demonstration you see above is a snapshot version of this model's logic. I used a large dataset of Statcast information to train the model, a common practice in baseball analytics to understand how batted ball characteristics translate to a team's offensive success.

This scatter plot shows the relationship between a batted ball's launch angle, exit velocity, and its wOBA value. The warmer colors (red/orange) indicate a higher wOBA, demonstrating the barrel zone of optimal launch angle and exit velocity.
The next phase of the project involved creating a model that, for a given player over a season, predicts their overall wOBA. This is a crucial metric in modern baseball, and the MLB itself uses a similar approach to generate player-specific statistics like xwOBA. By evaluating all of a player's batted balls over a season, these models provide a more stable and predictive measure of a player's offensive skill than traditional batting average or slugging percentage.

The graph above shows the correlation between a player's actual wOBA and the xwOBA predicted by the initial model, a great way to validate the model's accuracy.
My key insight was that the direction of a batted ball might have a significant impact on its xwOBA, which is not currently captured by the base model. I hypothesized that pulled balls might have a different expected outcome than balls hit to the opposite field. By adding a feature to my model that distinguishes between these directions based on the batter's handedness, I was able to improve the model's accuracy. The improvement was small—a 4% increase in accuracy, but this is a significant finding in the world of data analytics where even a small gain can have a large impact. This demonstrates that while the MLB's current models are excellent, there's always room for refinement by incorporating more nuanced data points.