A machine learning project using Statcast data to predict batter performance with a direction-aware expected weighted on-base average (xwOBA) model.
Click the button below to generate a random plate appearance and see its predicted xwOBA compared to the actual outcome.
Actual wOBA values:
0.855 = Single, 1.248 = Double, 1.575 = Triple, 2.014 = Home Run
To begin, I created a Random Forest model using scikit-learn that predicts xwOBA based on two primary inputs: exit velocity (launch speed) and launch angle. The live demonstration you see above is a snapshot version of this model's logic. I used a large dataset of Statcast information to train the model, a common practice in baseball analytics to understand how batted ball characteristics translate to a team's offensive success.

This scatter plot shows the relationship between a batted ball's launch angle, exit velocity, and its wOBA value. The warmer colors (red/orange) indicate a higher wOBA, demonstrating the barrel zone of optimal launch angle and exit velocity.
The next phase of the project involved creating a model that, for a given player over a season, predicts their overall wOBA. This is a crucial metric in modern baseball, and the MLB itself uses a similar approach to generate player-specific statistics like xwOBA. By evaluating all of a player's batted balls over a season, these models provide a more stable and predictive measure of a player's offensive skill than traditional batting average or slugging percentage.

The graph above shows the correlation between a player's actual wOBA and the xwOBA predicted by the initial model, a great way to validate the model's accuracy.
My key insight was that the direction of a batted ball might have a significant impact on its xwOBA, which is not currently captured by the base model. I hypothesized that pulled balls might have a different expected outcome than balls hit to the opposite field. By adding a feature to my model that distinguishes between these directions based on the batter's handedness, I was able to improve the model's accuracy. The improvement was small—a 4% increase in accuracy, but this is a significant finding in the world of data analytics where even a small gain can have a large impact. This demonstrates that while the MLB's current models are excellent, there's always room for refinement by incorporating more nuanced data points.
The full feature engineering implementation, confusion matrix, and per-feature importance breakdown for the direction-aware model.
Premium content
$2.00
Powered by ContentPay