Building an Expected Goals Model in Youth Soccer

In this article, I attempt to build one of the first publicly available xG models for youth soccer in my role as the Head of Analysis for the Boston Bolts.

Methodology:

I started with building a standard xG model from shot distance and angle to compute the probability of a shot resulting in a goal (xG). Then, to account for some of the variance in shots, I added a “shot type” categorical variable, which was found to be statistically significant for each category. The values for this variable are listed in the table below:

Key-Value Table
Shot Type Explanation
Normal A standard shot which doesn't fit any of the criteria listed below
First Time A shot that is not a volley but is taken first time, with no touch in-between
Volley A shot that is taken when the ball is in the air and without a touch in-between
1 on 1 A shot where the attacking player only has to beat the goalkeeper, there is no defender marking them.
Header A shot taken with the head of an attacking player
FK A direct free kick shot
Penalty A penalty. Has a default value of 0.65.

Data:

In my research, I struggled to find much publicly available data so I had to use some data from the professional ranks for shot types that don’t happen as often (think headers and free-kicks). This isn’t ideal and potentially something I could improve in v2 of the xG model. For headers and free kicks, I used publicly available data from StatsBomb and WyScout, two reputable data sources in the soccer analytics community. The rest of the data came from test data that I logged for the 2023-24 season for the Bolts. In the next section below, I will compare the actual probability of a goal versus the expected probability of a goal that my model produced.

Performance

From a call with Grant Rhines, the Head of Data & Insights at San Diego FC, he recommended that it would be a fun project to test my xG model against the actual goal rate. So, this is my attempt at accomplishing that. Below, there is a scatterplot that compares the actual and expected probability of a shot registering as a goal.

This chart evaluates the xG model I created compared to the actual proportion of goals at that xG level. The size of the points resemble the number of shots.

What this chart exemplifies is some bias in the model that I built. Overall, for the low-ish amount of data that I used compared to other xG models and the limitations associated with youth soccer, I believe this model performs fairly well. It does undersell the actual probability of a goal for xG values between 0.4 and 0.6 and oversells at the xG value of 0.8. As you can see from the size of the points, these areas seem to be areas that have a low amount of shots, so I could look to add more data points in the future to improve performance.

This plot looks at xG based on the shot type variable that I added in. Once again, the size of the dots are the number of shots for that shot type in the binned xG. 

Headers and Free Kicks

Now, separating by shot type, some interesting trends form. It seems that headers and free kicks with high xG values way outperform the model. This could be down to the variance in youth soccer, where walls aren’t as solid and goalkeepers don’t have as commanding of a physical presence. There aren’t a lot of situations where there are high value free kicks or headers too, but this would be an interesting case study to look into to improve performance.

First Time Shots

There is an overperformance of first time shots until 0.4 xG, where the model is pretty good and then a huge jump to 0.5 xG, where the model is very much underperforming. I think this is down to the passage of play surrounding the shot. Shots with a high first time xG value tend to come from cutbacks or low crosses across the six yard box, where there is limited time for the goalkeeper to react and it is much easier to get the shot on target since you are closer to the goal, whereas shots around the 18 yd box or just inside the 18 yd box could come from second balls or dictated by the defensive structure of the opponent closing down on a pass (overall teams and goalkeepers have much more time to react here and it is harder to direct a shot on goal). With first time shots, it seems to me like the distance and angle of the shot mean more than just standard shots.

Conclusion

Thank you all for reading this, I had a fun time getting back to writing and would like to get an article out every 1 or 2 weeks as I have a lot more information I’d like to share. I’m a senior looking for positions in sports analytics so if you’re interested in what I do, my GitHub is linked with a bunch of projects, models, and visualizations I’ve created. Additionally, feel free to shoot me a message on LinkedIn or Twitter. See you next time!

Previous
Previous

Using Streamlit as a Dashboarding Framework in Sports Analytics