About the project
Introduction
As soccer is a popular and widely watched sport around the world, with millions of fans paying close attention to major tournaments such as the World Cup, UEFA Champions League, and English Premier League, etc. Thus, fans, bettors, and sports commentators are interested in predicting these championships’ outcomes. What’s more important is that it would be valuable and worth knowing how they could improve their overall performance by analyzing strengths, weaknesses, home-away ground effect, and the physical and psychological conditions of the players.
The main goal of this report is to predict the results (win, lose, or draw) of European soccer games from seasons 2008 to 2016 based on both team’s and players’ attributes(weight, height, handedness, etc.), defensive and attacking strategies, and possible home-away ground effect, trying out the machine learning models. Data preprocessing (dimension reduction with PCA) and EDA (exploratory data analysis) will be done on the very first, followed by fitting different models together with fine-tuning parameters. Results, challenges, further steps, and final conclusion will be covered.
About the dataset
The dataset is downloaded from Kaggle’s European Soccer Database (https://www.kaggle.com/datasets/hugomathien/soccer). The creator Hugo Mathien aggregates match information from the official website of EA Sports FIFA games and football data in the UK. 103,916 teams, 51,958 matches, 11,075 players, and 183,978 players’ attributes over 8 years (between seasons 2008 and 2016) are recorded, including top-tier soccer leagues in 11 European countries (Belgium, England, France, Germany, Italy, the Netherlands, Poland, Portugal, Scotland, Spain, and Switzerland). 3 major tables (each with several subtables) are merged for future analysis:
- Player-related data (Player table, Player_Attributes table)
- Team-related data (Country table, League table, Team table, Team_Attribute table)
- Match-related data (Match table)