Building a Database for a Soccer Academy

In this article, I will walk you through the steps I took to build a database for the Boston Bolts, an MLS NEXT academy with over 200 athletes under our analytics umbrella.

Why to Create a Database?

Before I go into the details of what our database looks like, I need to first cover why we need a database. This can be broken down into three areas:

1) GitHub not having the capacity to store all of our data. Towards the end of each month, due to the restrictions set by GitHub and our increasing amount of data (especially with our increased use of tracking data); storing our data through folders in GitHub will not be an scalable solution.

2) Industry standard. Many clubs and companies within the industry already have and use databases and knowing how to create, enter, and access tables using SQL is a crucial skill in the industry, so I figured why not learn it!

3) Ease of access for coaches and new analysts to the cleanest and most updated data. I have had two new analysts come on and join our team and we have a couple coaches who I could see potentially getting into analytics in Python and SQL and making a database enables them to get access to clean and organized tables for them to work off of.

Evaluating My Options

Working in an academy environment, there is not a lot of resources, with the Bolts having one of the biggest budgets towards analytics in all of MLS Next. This means I needed to find a product that was going to be free or close to free and easy to use (I am by no means an expert in data engineering). This again boiled down to three options, which I ultimately chose Amazon RDS, but this was mostly down to the fact that most of my applications didn’t use/require google cloud or Microsoft services.

1) Amazon RDS is a strong industry standard (most job applications have this) and has a free to use tier. Although costs can rise quickly after the free tier, so you have to be careful.

2) Google Cloud also has a free tier but there is more limited usage, but integrates well with other google cloud services.

3) Microsoft Azure like the others, also has a free tier but also like the others, costs can grow quickly. However, it lends itself well to a Microsoft ecosystem.

Drafting Tables

To draft the tables that would be useful for the database, I needed to go back to our end user applications and see which files or folders we were accessing the most. Once I did this, I looked through our data providers product and uncovered that a lot of the tables were essentially already made by the way they are setup. This drastically limited the amount of time that went into this step as I now needed to convert the tables that I essentially was already using in Python to SQL Code and upload them to the database with a consistent and robust workflow.

Creating and Adding Data to Tables

To keep this article brief, I will walk you through the steps I used to create one of the tables. Let’s do this for the players_started table, which comes from our technical data provider, PlayerStatData (see below). The need for this table arose from our use of the team formation on the Post-Match Review App. To create this table, I needed to connect to our database using the mysql connector in python and run the code on a sample table to create the outline for the table (columns and data) and beginning adding data in at a later step.

Here the df is the sample data frame of the lineup table and we are adding the values of the test data frame as the general format for the table. Aside from the Starts variable, the rest of the values are strings. NOTE: There is some cleaning and pre-processing that I do for each of the tables, which is dependent upon the tables themselves, but I won’t go into that for this article.

Now to replicate this process for all of the games this season, I will use a slightly different Python script. Since there is no direct API access to PlayerStatData, I must still use my local environment to extract files from and upload them to the database where our other analysts and coaches can access.

Here I am using my local environment to access the files from PlayerStatData and to save run time, I have a text file of all the csv files that I have previously processed into the database from PlayerStatData, so I avoid having to run those through the for loop which saves run time and storage. If I needed to make a change to the structure of the table (adding a new column) and thus needed to update the database with different data, then I can set the reset flag to True and it will add all the files from that folder into the database.

This for loop goes through each file in the folder and checks if its in processed_files. If it’s not then we add it to the database and if it is, then we don’t need to add it to the database to limit duplicated instances.

Implementation

For the database implementation, our end-user products are Streamlit apps (Post-Match Review and IDPs). Each week, I run Python scripts on my local machine to retrieve data and upload it to the database. These database tables are then accessed by the Streamlit apps hosted on GitHub. One of our analysts (special shoutout to Sean Guillen) is optimizing this process by using an EC2 instance that has our Python script which pulls data from the PlayerData API (our physical data provider) and runs it on an automated cron job every Monday morning and uploads it to the database. We're still refining this workflow ahead of the spring season, so if you have any suggestions or insights, we'd love to hear them!

Visual Schema

This is the overall breakdown of the data providers and in house analytics for the Bolts with the tables, value names, and character types. This is a wholistic view of the tables of the database without the interactions among them (admittedly, there aren’t many interactions occurring at the moment).

Next Steps:

You may have noticed that there aren’t many relationships between the tables here and that is by design. To start out, we wanted to keep the database relatively small with not many relationships, however as we continue to expand and grow, we may need to divide the tables up or make them more interactive between each other. An example of this is with the xg report and actions report tables, as to assign a player an xG value for a particular shot that they took at a point in the match, we need to join these tables that have the shot location (xg report) with the player that took it (actions report) to have the proper xG assigned to the correct player. An example of this relationship is shown below:

Conclusion

Thank you for reading! Although we are the beginning of creating a database, I really enjoyed outlining this process as it informed me of some potential issues and educated me on the basics of data engineering. If you have any advice or helpful tips, please let me know! If you are interested in my work, feel free to connect with me on LinkedIn or Twitter—I’d love to chat. See you next time!