2 - 4) EDA - Match in-game shot information

81 minute read

import numpy as np 
import pandas as pd 

import sqlite3 as sql 

import matplotlib.pyplot as plt 
import matplotlib.patches as mpatches
import seaborn as sns 
import missingno as msno 

from utils import my_histogram
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
con = sql.connect("../data/database.sqlite")
df_match_ingame_shot = pd.read_csv("../data/df_match_ingame_shot.csv")
df_match_basic = pd.read_csv("../data/df_match_basic.csv")
  • Let’s analyze the goal and shot information from df_match_ingame_shot and df_match_basic table.
df_match_ingame_shot
match_api_id event_id elapsed team_api_id category type subtype player1_api_id player2_api_id
0 489042 378998 22 10261.0 goal n header 37799.0 38807.0
1 489042 379019 24 10260.0 goal n shot 24148.0 24154.0
2 489043 375546 4 9825.0 goal n shot 26181.0 39297.0
3 489044 378041 83 8650.0 goal n distance 30853.0 30889.0
4 489045 376060 4 8654.0 goal n shot 23139.0 36394.0
... ... ... ... ... ... ... ... ... ...
229033 2030171 4940379 19 8370.0 shot shotoff NaN 36130.0 NaN
229034 2030171 4940624 44 8370.0 shot shotoff NaN 34104.0 NaN
229035 2030171 4940738 49 8558.0 shot shotoff NaN 107930.0 NaN
229036 2030171 4940963 71 8370.0 shot shotoff NaN 210065.0 NaN
229037 2030171 4941072 83 8558.0 shot shotoff NaN 629579.0 NaN

229038 rows × 9 columns

df_match_basic
match_api_id country_id league_id season stage match_date home_team_api_id away_team_api_id home_team_goal away_team_goal match_result
0 492473 1 1 2008/2009 1 2008-08-17 9987 9993 1 1 draw
1 492474 1 1 2008/2009 1 2008-08-16 10000 9994 0 0 draw
2 492475 1 1 2008/2009 1 2008-08-16 9984 8635 0 3 away_win
3 492476 1 1 2008/2009 1 2008-08-17 9991 9998 5 0 home_win
4 492477 1 1 2008/2009 1 2008-08-16 7947 9985 1 3 away_win
... ... ... ... ... ... ... ... ... ... ... ...
25974 1992091 24558 24558 2015/2016 9 2015-09-22 10190 10191 1 0 home_win
25975 1992092 24558 24558 2015/2016 9 2015-09-23 9824 10199 1 2 away_win
25976 1992093 24558 24558 2015/2016 9 2015-09-23 9956 10179 2 0 home_win
25977 1992094 24558 24558 2015/2016 9 2015-09-22 7896 10243 0 0 draw
25978 1992095 24558 24558 2015/2016 9 2015-09-23 10192 9931 4 3 home_win

25979 rows × 11 columns

  • Combine the home team id, away team id, match date, match result, and season information to the df_match_ingame_shot table.
home_info = df_match_basic[["match_api_id", "home_team_api_id", "match_date", "match_result", "season"]].rename(columns = {"home_team_api_id": "team_api_id"})
home_info["home_away"] = "home"
away_info = df_match_basic[["match_api_id", "away_team_api_id", "match_date", "match_result", "season"]].rename(columns = {"away_team_api_id": "team_api_id"})
away_info["home_away"] = "away"
home_away_info = pd.concat([home_info, away_info], axis = 0)
df_match_ingame_shot = df_match_ingame_shot.merge(home_away_info, how = "left", on = ["match_api_id", "team_api_id"])
df_match_ingame_shot
match_api_id event_id elapsed team_api_id category type subtype player1_api_id player2_api_id match_date match_result season home_away
0 489042 378998 22 10261.0 goal n header 37799.0 38807.0 2008-08-17 draw 2008/2009 away
1 489042 379019 24 10260.0 goal n shot 24148.0 24154.0 2008-08-17 draw 2008/2009 home
2 489043 375546 4 9825.0 goal n shot 26181.0 39297.0 2008-08-16 home_win 2008/2009 home
3 489044 378041 83 8650.0 goal n distance 30853.0 30889.0 2008-08-16 away_win 2008/2009 away
4 489045 376060 4 8654.0 goal n shot 23139.0 36394.0 2008-08-16 home_win 2008/2009 home
... ... ... ... ... ... ... ... ... ... ... ... ... ...
229033 2030171 4940379 19 8370.0 shot shotoff NaN 36130.0 NaN 2015-10-23 home_win 2015/2016 home
229034 2030171 4940624 44 8370.0 shot shotoff NaN 34104.0 NaN 2015-10-23 home_win 2015/2016 home
229035 2030171 4940738 49 8558.0 shot shotoff NaN 107930.0 NaN 2015-10-23 home_win 2015/2016 away
229036 2030171 4940963 71 8370.0 shot shotoff NaN 210065.0 NaN 2015-10-23 home_win 2015/2016 home
229037 2030171 4941072 83 8558.0 shot shotoff NaN 629579.0 NaN 2015-10-23 home_win 2015/2016 away

229038 rows × 13 columns

  • Combine the league name to the df_match_ingame_shot table.
org_league = pd.read_sql(
    "select * from League", con
    )
df_match_ingame_shot = df_match_ingame_shot.merge(df_match_basic[["match_api_id", "league_id"]], how = "left", on = "match_api_id")
df_match_ingame_shot = df_match_ingame_shot.merge(org_league.rename(columns = {"id" : "league_id", "name": "league_name"}), 
                                                            how = "left", on = "league_id")
df_match_ingame_shot.drop(["league_id", "country_id"], axis = 1, inplace = True)
df_match_ingame_shot
match_api_id event_id elapsed team_api_id category type subtype player1_api_id player2_api_id match_date match_result season home_away league_name
0 489042 378998 22 10261.0 goal n header 37799.0 38807.0 2008-08-17 draw 2008/2009 away England Premier League
1 489042 379019 24 10260.0 goal n shot 24148.0 24154.0 2008-08-17 draw 2008/2009 home England Premier League
2 489043 375546 4 9825.0 goal n shot 26181.0 39297.0 2008-08-16 home_win 2008/2009 home England Premier League
3 489044 378041 83 8650.0 goal n distance 30853.0 30889.0 2008-08-16 away_win 2008/2009 away England Premier League
4 489045 376060 4 8654.0 goal n shot 23139.0 36394.0 2008-08-16 home_win 2008/2009 home England Premier League
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
229033 2030171 4940379 19 8370.0 shot shotoff NaN 36130.0 NaN 2015-10-23 home_win 2015/2016 home Spain LIGA BBVA
229034 2030171 4940624 44 8370.0 shot shotoff NaN 34104.0 NaN 2015-10-23 home_win 2015/2016 home Spain LIGA BBVA
229035 2030171 4940738 49 8558.0 shot shotoff NaN 107930.0 NaN 2015-10-23 home_win 2015/2016 away Spain LIGA BBVA
229036 2030171 4940963 71 8370.0 shot shotoff NaN 210065.0 NaN 2015-10-23 home_win 2015/2016 home Spain LIGA BBVA
229037 2030171 4941072 83 8558.0 shot shotoff NaN 629579.0 NaN 2015-10-23 home_win 2015/2016 away Spain LIGA BBVA

229038 rows × 14 columns

  • Combine the league name to the df_match_basic table.
df_match_basic = df_match_basic.merge(org_league.drop("country_id", axis = 1).rename(columns = {"id" : "league_id", "name": "league_name"}), 
                                                      how = "left", on = "league_id")
df_match_basic.drop(["league_id", "country_id"], axis = 1, inplace = True)
df_match_basic
match_api_id season stage match_date home_team_api_id away_team_api_id home_team_goal away_team_goal match_result league_name
0 492473 2008/2009 1 2008-08-17 9987 9993 1 1 draw Belgium Jupiler League
1 492474 2008/2009 1 2008-08-16 10000 9994 0 0 draw Belgium Jupiler League
2 492475 2008/2009 1 2008-08-16 9984 8635 0 3 away_win Belgium Jupiler League
3 492476 2008/2009 1 2008-08-17 9991 9998 5 0 home_win Belgium Jupiler League
4 492477 2008/2009 1 2008-08-16 7947 9985 1 3 away_win Belgium Jupiler League
... ... ... ... ... ... ... ... ... ... ...
25974 1992091 2015/2016 9 2015-09-22 10190 10191 1 0 home_win Switzerland Super League
25975 1992092 2015/2016 9 2015-09-23 9824 10199 1 2 away_win Switzerland Super League
25976 1992093 2015/2016 9 2015-09-23 9956 10179 2 0 home_win Switzerland Super League
25977 1992094 2015/2016 9 2015-09-22 7896 10243 0 0 draw Switzerland Super League
25978 1992095 2015/2016 9 2015-09-23 10192 9931 4 3 home_win Switzerland Super League

25979 rows × 10 columns

1. Goal

df_match_ingame_shot_goal = df_match_ingame_shot[df_match_ingame_shot.category == "goal"]

1.1. Goal trend

Q. Is there any goal trend change over the seasons?

goals_trend = df_match_basic.groupby(["league_name", "season"]).mean()[["home_team_goal", "away_team_goal"]].reset_index()
fig, axes = plt.subplots(4, 3, figsize = (20, 15))

for i, league in enumerate(goals_trend.league_name.unique()):
    home = goals_trend.loc[goals_trend.league_name == league, ["league_name", "season", "home_team_goal"]].rename(columns = {"home_team_goal" : "goals"})
    home["home_away"] = "home"

    away = goals_trend.loc[goals_trend.league_name == league, ["league_name", "season", "away_team_goal"]].rename(columns = {"away_team_goal" : "goals"})
    away["home_away"] = "away"
    
    home_m_away = home.copy()
    home_m_away["goals"] = home.goals - away.goals 
    home_m_away["home_away"] = "home goals - away goals"

    home_away = pd.concat([home, away, home_m_away], axis = 0)
    
    sns.lineplot(data = home_away, x = "season", y = "goals", hue = "home_away", marker = "o", ax = axes[i // 3, i % 3])

    axes[i // 3, i % 3].set_xlabel(" ", fontsize = 10)
    axes[i // 3, i % 3].set_ylabel(" ", fontsize = 10)
    axes[i // 3, i % 3].set_title(f"{league} goals trend", fontsize = 15)
    axes[i // 3, i % 3].tick_params(axis='x', labelrotation = 90)

plt.tight_layout()

png

  • Home team scores more goals than away team in all leagues.
  • The gap between the average number of home goals and the average number of away goals has remained similar across all seasons from the each league.

$\quad \rightarrow$ Since season has no effect in the goal trend, let’s check the goal distribution without considering the season.

1.2. Goal distribution

  • The number of goals has the most direct effect on winning or losing. If a team score more goals and get less, the team can win the game.
  • If we can identify the characteristics of a team that scores a lot of goals and a team that scores few goals, we will have an advantage in predicting the match result.

Q. How many more goals did the winning team score than the losing team?

plt.figure(figsize = (15, 6))
sns.countplot(x = abs(df_match_basic[df_match_basic.match_result != "draw"].home_team_goal - 
                      df_match_basic[df_match_basic.match_result != "draw"].away_team_goal))
plt.title("Difference of number of goals between winning team and losing tema", fontsize = 20)
plt.xlabel("Goal difference", fontsize = 13)
plt.ylabel("Count", fontsize = 13)
Text(0, 0.5, 'Count')

png

  • Almost half of the matches are won by a difference of only one goal.

Q. The difference of number of goals between home and away team.

plt.figure(figsize = (15, 6))
sns.countplot(x = df_match_basic.home_team_goal - df_match_basic.away_team_goal)
plt.title("Difference of number of goals between home team and away team", fontsize = 20)
plt.xlabel("Goal difference", fontsize = 13)
plt.ylabel("Count", fontsize = 13)
Text(0, 0.5, 'Count')

png

  • The side larger than 0 in the distribution is slightly thicker than the smaller side.
  • That is, home team usually scores more goals than the away team.

Q. What is the difference between home and away team win rates?

df_match_basic.match_result.value_counts()
home_win    11917
away_win     7466
draw         6596
Name: match_result, dtype: int64
sns.countplot(x = df_match_basic.match_result,
              order = ["home_win", "draw", "away_win"])
plt.xlabel("Match result", fontsize = 15)
plt.ylabel("Count", fontsize = 15)
Text(0, 0.5, 'Count')

png

  • Out of a total of 25979 matches
    • About 45% (11917 / 25979) were won by the home team
    • About 25% (6596 / 25979) were draw
    • About 29% (7466 / 25979) were won by the away team

$\color{magenta} \quad \rightarrow$ This percentage can be used as a baseline accuracy later.

Q. Is there a difference in the distribution of home and away team goals by league?

df_match_basic.groupby("league_name").home_team_goal.describe()
count mean std min 25% 50% 75% max
league_name
Belgium Jupiler League 1728.0 1.609375 1.293458 0.0 1.0 1.0 2.0 7.0
England Premier League 3040.0 1.550987 1.311615 0.0 1.0 1.0 2.0 9.0
France Ligue 1 3040.0 1.402961 1.170743 0.0 1.0 1.0 2.0 6.0
Germany 1. Bundesliga 2448.0 1.626634 1.339529 0.0 1.0 1.0 2.0 9.0
Italy Serie A 3017.0 1.500829 1.221797 0.0 1.0 1.0 2.0 7.0
Netherlands Eredivisie 2448.0 1.779820 1.405274 0.0 1.0 2.0 3.0 10.0
Poland Ekstraklasa 1920.0 1.394792 1.183249 0.0 0.0 1.0 2.0 6.0
Portugal Liga ZON Sagres 2052.0 1.408382 1.226192 0.0 0.0 1.0 2.0 8.0
Scotland Premier League 1824.0 1.429276 1.294928 0.0 0.0 1.0 2.0 9.0
Spain LIGA BBVA 3040.0 1.631250 1.388339 0.0 1.0 1.0 2.0 10.0
Switzerland Super League 1422.0 1.663150 1.363970 0.0 1.0 1.0 3.0 7.0
fig, axes = plt.subplots(1, 2, figsize = (20, 7), sharey = True)

sns.boxplot(data = df_match_basic, x = "home_team_goal", y = "league_name", showfliers = False, ax = axes[0])
axes[0].set_xlabel("goals", fontsize = 15)
axes[0].set_ylabel("League", fontsize = 15)
axes[0].set_title("Home team goals distribution", fontsize = 20)

sns.barplot(df_match_basic.groupby("league_name").home_team_goal.describe().reset_index(), x = "mean", y = "league_name")
axes[1].set_xlabel("goals", fontsize = 15)
axes[1].set_ylabel("League", fontsize = 15)
axes[1].set_title("Home team average goals per game", fontsize = 20)
Text(0.5, 1.0, 'Home team average goals per game')

png

  • Looking at the distribution of the number of home team goals by league over 8 seasons, it can be seen that the Dutch and Swiss leagues have more home team goals than other leagues.
  • Average Home team goals per game are around 1.5 across all leagues.
  • In the case of the France, Poland, Portugal, and Scotland, home team average goals per game are slightly lower than other leagues.
  • In the case of the Netherland, home team average goals per game are slightly higher than other leagues.
df_match_basic.groupby("league_name").away_team_goal.describe()
count mean std min 25% 50% 75% max
league_name
Belgium Jupiler League 1728.0 1.192130 1.125123 0.0 0.0 1.0 2.0 7.0
England Premier League 3040.0 1.159539 1.144629 0.0 0.0 1.0 2.0 6.0
France Ligue 1 3040.0 1.040132 1.059765 0.0 0.0 1.0 2.0 9.0
Germany 1. Bundesliga 2448.0 1.274918 1.200392 0.0 0.0 1.0 2.0 8.0
Italy Serie A 3017.0 1.116009 1.078392 0.0 0.0 1.0 2.0 7.0
Netherlands Eredivisie 2448.0 1.301062 1.240430 0.0 0.0 1.0 2.0 6.0
Poland Ekstraklasa 1920.0 1.030208 1.046159 0.0 0.0 1.0 2.0 6.0
Portugal Liga ZON Sagres 2052.0 1.126218 1.155469 0.0 0.0 1.0 2.0 6.0
Scotland Premier League 1824.0 1.204496 1.151384 0.0 0.0 1.0 2.0 6.0
Spain LIGA BBVA 3040.0 1.135855 1.161079 0.0 0.0 1.0 2.0 8.0
Switzerland Super League 1422.0 1.266526 1.176233 0.0 0.0 1.0 2.0 7.0
fig, axes = plt.subplots(1, 2, figsize = (20, 7), sharey = True)

sns.boxplot(data = df_match_basic, x = "away_team_goal", y = "league_name", showfliers = False, ax = axes[0])
axes[0].set_xlabel("goals", fontsize = 15)
axes[0].set_ylabel("League", fontsize = 15)
axes[0].set_title("Away team goals distribution", fontsize = 20)

sns.barplot(df_match_basic.groupby("league_name").away_team_goal.describe().reset_index(), x = "mean", y = "league_name")
axes[1].set_xlabel("goals", fontsize = 15)
axes[1].set_ylabel("League", fontsize = 15)
axes[1].set_title("Away team average goals per game", fontsize = 20)

plt.tight_layout()

png

  • The away team’s goals distribution over the eight seasons shows a nearly identical distribution across all leagues compared to the home team’s goals distributions.
  • Away teams score around 1.1 goals per game on average across all leagues.
  • In the case of the France and the Poland, away team average goals per game are slightly lower than other leagues.
  • In the case of the Germany, Netherlands, and Switzeland, away team average goals per game are slightly higher than other leagues.

  • Average home and away team goals per game vary by league.

$\color{magenta} \quad \rightarrow$ There is a difference in the average goals of home and away teams in each league, so it is a good idea to try fitting our model differently for each league later.

Q. What is the difference between home and away team win rates between leagues ?

df_match_result_pivot = df_match_basic.pivot_table(index = "league_name", columns = "match_result", values = "stage", aggfunc = "count")
df_match_result_pivot["sum"] = df_match_result_pivot.sum(axis = 1)

match_result_prop = df_match_result_pivot[["home_win", "away_win", "draw"]].divide(df_match_result_pivot["sum"], axis = 0).multiply(100)

ax = match_result_prop.plot.barh(stacked = True, figsize = (10, 8), width = 0.75, edgecolor = 'w')
ax.legend(['Home Team','Away Team','Draw'], bbox_to_anchor = (1.2, 1), loc = 'upper right')
plt.title('Probability of match result by league', fontsize = 20)
plt.xlabel('Proportion',fontsize = 15)
plt.ylabel('Leagues', fontsize = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

for  i, j in enumerate(match_result_prop.index):
    plt.text(match_result_prop.loc[j, 'home_win']/2, i, str(round(match_result_prop.loc[j, 'home_win'], 2)) + ' %', fontsize = 12)
    plt.text(match_result_prop.loc[j, 'home_win'] + match_result_prop.loc[j,'away_win']/3, i, str(round(match_result_prop.loc[j,'away_win'], 2)) + ' %',fontsize = 12)
    plt.text(match_result_prop.loc[j, 'home_win'] + match_result_prop.loc[j,'away_win']+ match_result_prop.loc[j, 'draw']/2, i, str(round(match_result_prop.loc[j, 'draw'], 2)) + '%', fontsize = 12)

png

  • The overall tendency for the home team to have the highest win rate and the lowest draw rate is similar across leagues.
  • However, there are some differences in the detailed ratio.

$\color{magenta} \quad \rightarrow$ This league-specific ratios can later be used as the baseline accuracy for each league when fitting the model for each league.

Q. How strong is the relationship between home and away average goals and wins ?

  • Usually, the home team scores more goals and the away team scores fewer goals, so the home team has a higher win rate.
  • It is important for us to predict in which matches the home team will lose, the away team will win, and neither team will win.

  • So let’s try to figure out the relationship between how many goals are scored at home & away and the winning percentage.
df_team = pd.read_csv("../data/df_team.csv")
distinct_team = df_team[["team_api_id", "team_long_name"]].drop_duplicates()
goal_match_result_info = distinct_team.merge(df_match_basic.groupby("home_team_api_id").mean().home_team_goal.reset_index().rename(columns = {"home_team_api_id" : "team_api_id",
                                                                                                 "home_team_goal" : "home_avg_goal"}),
                                             how = "left", on = "team_api_id")
goal_match_result_info = goal_match_result_info.merge(df_match_basic.groupby("home_team_api_id").mean().away_team_goal.reset_index().rename(columns = {"home_team_api_id" : "team_api_id",
                                                                                                 "away_team_goal" : "home_opponent_avg_goal"}),
                                                      how = "left", on = "team_api_id")
goal_match_result_info = goal_match_result_info.merge(df_match_basic.groupby("away_team_api_id").mean().away_team_goal.reset_index().rename(columns = {"away_team_api_id" : "team_api_id",
                                                                                                 "away_team_goal" : "away_avg_goal"}),
                                                      how = "left", on = "team_api_id")
goal_match_result_info = goal_match_result_info.merge(df_match_basic.groupby("away_team_api_id").mean().home_team_goal.reset_index().rename(columns = {"away_team_api_id" : "team_api_id",
                                                                                                 "home_team_goal" : "away_opponent_avg_goal"}),
                                                      how = "left", on = "team_api_id")
result_from_home = df_match_basic.groupby(["home_team_api_id", "match_result"]).count().match_api_id.reset_index() \
                                 .pivot_table(index = "home_team_api_id", columns = "match_result", values = "match_api_id").reset_index()
result_from_home = result_from_home.rename(columns = {"home_team_api_id" : "team_api_id", "home_win": "win_at_home", 
                                                      "draw": "draw_at_home", "away_win": "lose_at_home"})
result_from_home["num_home_matches"] = result_from_home.win_at_home + result_from_home.draw_at_home + result_from_home.lose_at_home
result_from_home["win_at_home"] = result_from_home.win_at_home / result_from_home.num_home_matches  
result_from_home["draw_at_home"] = result_from_home.draw_at_home / result_from_home.num_home_matches  
result_from_home["lose_at_home"] = result_from_home.lose_at_home / result_from_home.num_home_matches  

result_from_home.drop("num_home_matches", axis = 1, inplace = True)
result_from_away = df_match_basic.groupby(["away_team_api_id", "match_result"]).count().match_api_id.reset_index() \
                                 .pivot_table(index = "away_team_api_id", columns = "match_result", values = "match_api_id").reset_index()
result_from_away = df_match_basic.groupby(["away_team_api_id", "match_result"]).count().match_api_id.reset_index() \
                                 .pivot_table(index = "away_team_api_id", columns = "match_result", values = "match_api_id").reset_index()
result_from_away = result_from_away.rename(columns = {"away_team_api_id" : "team_api_id", "home_win": "lose_at_away", 
                                                      "draw": "draw_at_away", "away_win": "win_at_away"})
result_from_away["num_away_matches"] = result_from_away.win_at_away + result_from_away.draw_at_away + result_from_away.lose_at_away
result_from_away["win_at_away"] = result_from_away.win_at_away / result_from_away.num_away_matches  
result_from_away["draw_at_away"] = result_from_away.draw_at_away / result_from_away.num_away_matches  
result_from_away["lose_at_away"] = result_from_away.lose_at_away / result_from_away.num_away_matches  
result_from_away.drop("num_away_matches", axis = 1, inplace = True)
result_from_away
match_result team_api_id win_at_away draw_at_away lose_at_away
0 1601 0.316667 0.241667 0.441667
1 1773 0.133333 0.333333 0.533333
2 1957 0.200000 0.308333 0.491667
3 2033 0.173333 0.373333 0.453333
4 2182 0.416667 0.275000 0.308333
... ... ... ... ...
294 158085 0.204082 0.367347 0.428571
295 177361 0.200000 0.333333 0.466667
296 188163 0.294118 0.117647 0.588235
297 208931 0.157895 0.315789 0.526316
298 274581 0.166667 0.233333 0.600000

299 rows × 4 columns

goal_match_result_info = goal_match_result_info.merge(result_from_home[["team_api_id", "win_at_home", "draw_at_home", "lose_at_home"]], how = "left", on = "team_api_id") \
                                               .merge(result_from_away[["team_api_id", "win_at_away", "draw_at_away", "lose_at_away"]], how = "left", on = "team_api_id")
goal_match_result_info
team_api_id team_long_name home_avg_goal home_opponent_avg_goal away_avg_goal away_opponent_avg_goal win_at_home draw_at_home lose_at_home win_at_away draw_at_away lose_at_away
0 9930 FC Aarau 1.236111 1.652778 0.888889 2.152778 0.347222 0.166667 0.486111 0.138889 0.305556 0.555556
1 8485 Aberdeen 1.223684 0.967105 1.177632 1.381579 0.440789 0.250000 0.309211 0.348684 0.243421 0.407895
2 8576 AC Ajaccio 1.122807 1.350877 0.912281 1.877193 0.280702 0.333333 0.385965 0.105263 0.368421 0.526316
3 8564 Milan 1.794702 0.847682 1.480263 1.210526 0.609272 0.225166 0.165563 0.407895 0.296053 0.296053
4 10215 Académica de Coimbra 1.096774 1.258065 0.838710 1.532258 0.282258 0.387097 0.330645 0.169355 0.274194 0.556452
... ... ... ... ... ... ... ... ... ... ... ... ...
283 10192 BSC Young Boys 2.230769 1.160839 1.468531 1.447552 0.594406 0.237762 0.167832 0.398601 0.244755 0.356643
284 8021 Zagłębie Lubin 1.288889 1.200000 1.011111 1.377778 0.388889 0.300000 0.311111 0.266667 0.266667 0.466667
285 8394 Real Zaragoza 1.250000 1.328947 0.842105 1.828947 0.381579 0.223684 0.394737 0.184211 0.223684 0.592105
286 8027 Zawisza Bydgoszcz 1.433333 1.266667 1.066667 1.700000 0.433333 0.166667 0.400000 0.200000 0.300000 0.500000
287 10000 SV Zulte-Waregem 1.660377 1.273585 1.226415 1.367925 0.424528 0.301887 0.273585 0.311321 0.283019 0.405660

288 rows × 12 columns

target_team_api = goal_match_result_info.sort_values("win_at_home", ascending = False).iloc[:25].team_api_id
goal_match_result_info.loc[goal_match_result_info.team_api_id.isin(target_team_api), "home_win_class"] = "home_win_best_10%_teams"

target_team_api = goal_match_result_info.sort_values("win_at_home").iloc[:25].team_api_id
goal_match_result_info.loc[goal_match_result_info.team_api_id.isin(target_team_api), "home_win_class"] = "home_win_worst_10%_teams"

target_team_api = goal_match_result_info.sort_values("win_at_away", ascending = False).iloc[:25].team_api_id
goal_match_result_info.loc[goal_match_result_info.team_api_id.isin(target_team_api), "away_win_class"] = "away_win_best_10%_teams"

target_team_api = goal_match_result_info.sort_values("win_at_away").iloc[:25].team_api_id
goal_match_result_info.loc[goal_match_result_info.team_api_id.isin(target_team_api), "away_win_class"] = "away_win_worst_10%_teams"

goal_match_result_info
team_api_id team_long_name home_avg_goal home_opponent_avg_goal away_avg_goal away_opponent_avg_goal win_at_home draw_at_home lose_at_home win_at_away draw_at_away lose_at_away home_win_class away_win_class
0 9930 FC Aarau 1.236111 1.652778 0.888889 2.152778 0.347222 0.166667 0.486111 0.138889 0.305556 0.555556 NaN NaN
1 8485 Aberdeen 1.223684 0.967105 1.177632 1.381579 0.440789 0.250000 0.309211 0.348684 0.243421 0.407895 NaN NaN
2 8576 AC Ajaccio 1.122807 1.350877 0.912281 1.877193 0.280702 0.333333 0.385965 0.105263 0.368421 0.526316 NaN away_win_worst_10%_teams
3 8564 Milan 1.794702 0.847682 1.480263 1.210526 0.609272 0.225166 0.165563 0.407895 0.296053 0.296053 NaN NaN
4 10215 Académica de Coimbra 1.096774 1.258065 0.838710 1.532258 0.282258 0.387097 0.330645 0.169355 0.274194 0.556452 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
283 10192 BSC Young Boys 2.230769 1.160839 1.468531 1.447552 0.594406 0.237762 0.167832 0.398601 0.244755 0.356643 NaN NaN
284 8021 Zagłębie Lubin 1.288889 1.200000 1.011111 1.377778 0.388889 0.300000 0.311111 0.266667 0.266667 0.466667 NaN NaN
285 8394 Real Zaragoza 1.250000 1.328947 0.842105 1.828947 0.381579 0.223684 0.394737 0.184211 0.223684 0.592105 NaN NaN
286 8027 Zawisza Bydgoszcz 1.433333 1.266667 1.066667 1.700000 0.433333 0.166667 0.400000 0.200000 0.300000 0.500000 NaN NaN
287 10000 SV Zulte-Waregem 1.660377 1.273585 1.226415 1.367925 0.424528 0.301887 0.273585 0.311321 0.283019 0.405660 NaN NaN

288 rows × 14 columns

fig, axes = plt.subplots(2, 2, figsize = (12, 10))

sns.histplot(goal_match_result_info, x = "home_avg_goal", ax = axes[0, 0])
sns.histplot(goal_match_result_info, x = "home_opponent_avg_goal", ax = axes[0, 1])

axes[0, 0].set_xlabel("Average goal at home", fontsize = 15)
axes[0, 0].set_ylabel("Count", fontsize = 15)

axes[0, 1].set_ylabel("")
axes[0, 1].set_xlabel("Opponent's average goal at home", fontsize = 15)
axes[0, 1].yaxis.set_tick_params(labelleft = False)

sns.histplot(goal_match_result_info, x = "away_avg_goal", ax = axes[1, 0])
sns.histplot(goal_match_result_info, x = "away_opponent_avg_goal", ax = axes[1, 1])

axes[1, 0].set_xlabel("Average goal at away", fontsize = 15)
axes[1, 0].set_ylabel("Count", fontsize = 15)

axes[1, 1].set_ylabel("")
axes[1, 1].set_xlabel("Opponent's average goal at away", fontsize = 15)
axes[1, 1].yaxis.set_tick_params(labelleft = False)

plt.suptitle("Histogram of 299 soccer teams", fontsize = 20)

Text(0.5, 0.98, 'Histogram of 299 soccer teams')

png

  • It can be seen that the ability to score goals or block goals in home/away matches is very different for each team.
  • These differences will affect home and away win rates for each team.

  • Let’s compare the best and worst 10% teams with the win rate at home.
fig, axes = plt.subplots(1, 2, figsize = (12, 5))

sns.boxplot(goal_match_result_info, x = "home_avg_goal", y = "home_win_class", ax = axes[0])
sns.boxplot(goal_match_result_info, x = "home_opponent_avg_goal", y = "home_win_class", ax = axes[1])

axes[0].set_ylabel("Home win class", fontsize = 15)
axes[0].set_xlabel("Average goal at home", fontsize = 15)

axes[1].set_ylabel("")
axes[1].set_xlabel("Opponent's average goal at home", fontsize = 15)
axes[1].yaxis.set_tick_params(labelleft = False)

plt.tight_layout()

png

  • Teams in the best 10% of home win percentage score an average of 1.2 more goals at home than teams in the worst 10%.
  • For the best 10% teams with high win percentage at home, the average number of goals scored by the opposing team is around 0.8, very low compared to 1.6 from the worst 10% teams.
goal_match_result_info
team_api_id team_long_name home_avg_goal home_opponent_avg_goal away_avg_goal away_opponent_avg_goal win_at_home draw_at_home lose_at_home win_at_away draw_at_away lose_at_away home_win_class away_win_class
0 9930 FC Aarau 1.236111 1.652778 0.888889 2.152778 0.347222 0.166667 0.486111 0.138889 0.305556 0.555556 NaN NaN
1 8485 Aberdeen 1.223684 0.967105 1.177632 1.381579 0.440789 0.250000 0.309211 0.348684 0.243421 0.407895 NaN NaN
2 8576 AC Ajaccio 1.122807 1.350877 0.912281 1.877193 0.280702 0.333333 0.385965 0.105263 0.368421 0.526316 NaN away_win_worst_10%_teams
3 8564 Milan 1.794702 0.847682 1.480263 1.210526 0.609272 0.225166 0.165563 0.407895 0.296053 0.296053 NaN NaN
4 10215 Académica de Coimbra 1.096774 1.258065 0.838710 1.532258 0.282258 0.387097 0.330645 0.169355 0.274194 0.556452 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
283 10192 BSC Young Boys 2.230769 1.160839 1.468531 1.447552 0.594406 0.237762 0.167832 0.398601 0.244755 0.356643 NaN NaN
284 8021 Zagłębie Lubin 1.288889 1.200000 1.011111 1.377778 0.388889 0.300000 0.311111 0.266667 0.266667 0.466667 NaN NaN
285 8394 Real Zaragoza 1.250000 1.328947 0.842105 1.828947 0.381579 0.223684 0.394737 0.184211 0.223684 0.592105 NaN NaN
286 8027 Zawisza Bydgoszcz 1.433333 1.266667 1.066667 1.700000 0.433333 0.166667 0.400000 0.200000 0.300000 0.500000 NaN NaN
287 10000 SV Zulte-Waregem 1.660377 1.273585 1.226415 1.367925 0.424528 0.301887 0.273585 0.311321 0.283019 0.405660 NaN NaN

288 rows × 14 columns

fig, axes = plt.subplots(1, 2, figsize = (12, 5))

sns.boxplot(goal_match_result_info, x = "away_avg_goal", y = "away_win_class", ax = axes[0])
sns.boxplot(goal_match_result_info, x = "away_opponent_avg_goal", y = "away_win_class", ax = axes[1])

axes[0].set_ylabel("Away win class", fontsize = 15)
axes[0].set_xlabel("Average goal at away", fontsize = 15)

axes[1].set_ylabel("")
axes[1].set_xlabel("Opponent's average goal at away", fontsize = 15)
axes[1].yaxis.set_tick_params(labelleft = False)

png

  • Teams in the best 10% of away win percentage score an average of 1.2 more goals at away than teams in the worst 10%.
  • For the best 10% teams with high win percentage at away, the average number of goals scored by the opposing team is around 1, very low compared to 2.2 from the worst 10% teams.

$\color{magenta} \quad \rightarrow$ So, let's create a variable for each team that indicates how well they score goals when home, how well they block goals when home, how well they score goals when away, and how well they block goals when away:

$\quad \quad$ - Each team’s last 1 / 3 / 5 / 10 / 20 / 30 / 60 / 90 matches

$\quad \quad \quad \quad$ - average goal at home

$\ \quad \quad \quad \quad$ - average opponent’s goal at home

$\ \quad \quad \quad \quad$ - average goal at away

$\ \quad \quad \quad \quad$ - average opponent’s goal at away

</font>

home_goal_info = df_match_basic[["home_team_api_id", "match_date", "season", "home_team_goal", "away_team_goal"]].sort_values(["home_team_api_id", "match_date"])
for i in [1, 3, 5, 10, 20, 30, 60, 90]:
    home_goal_info[f"home_team_avg_goal_at_home_last_{i}_matches"] = home_goal_info.groupby('home_team_api_id')['home_team_goal'].shift(1).rolling(i).mean()
    home_goal_info[f"home_team_avg_oppnt_goal_at_home_last_{i}_matches"] = home_goal_info.groupby('home_team_api_id')['away_team_goal'].shift(1).rolling(i).mean()
away_goal_info = df_match_basic[["away_team_api_id", "match_date", "season", "home_team_goal", "away_team_goal"]].sort_values(["away_team_api_id", "match_date"])
for i in [1, 3, 5, 10, 20, 30, 60, 90]:
    away_goal_info[f"away_team_avg_goal_at_away_last_{i}_matches"] = away_goal_info.groupby('away_team_api_id')['away_team_goal'].shift(1).rolling(i).mean()
    away_goal_info[f"away_team_avg_oppnt_goal_at_away_last_{i}_matches"] = away_goal_info.groupby('away_team_api_id')['home_team_goal'].shift(1).rolling(i).mean()
df_team_win_goal_rolling_features = df_match_basic[["match_api_id", "match_date", "home_team_api_id", "away_team_api_id"]]
df_team_win_goal_rolling_features = df_team_win_goal_rolling_features.merge(home_goal_info.drop(["season", "home_team_goal", "away_team_goal"], axis = 1), how = "left", on = ["match_date", "home_team_api_id"]) \
                                                                     .merge(away_goal_info.drop(["season", "home_team_goal", "away_team_goal"], axis = 1), how = "left", on = ["match_date", "away_team_api_id"]) 
                                                                    
df_team_win_goal_rolling_features                                                         
match_api_id match_date home_team_api_id away_team_api_id home_team_avg_goal_at_home_last_1_matches home_team_avg_oppnt_goal_at_home_last_1_matches home_team_avg_goal_at_home_last_3_matches home_team_avg_oppnt_goal_at_home_last_3_matches home_team_avg_goal_at_home_last_5_matches home_team_avg_oppnt_goal_at_home_last_5_matches home_team_avg_goal_at_home_last_10_matches home_team_avg_oppnt_goal_at_home_last_10_matches home_team_avg_goal_at_home_last_20_matches home_team_avg_oppnt_goal_at_home_last_20_matches home_team_avg_goal_at_home_last_30_matches home_team_avg_oppnt_goal_at_home_last_30_matches home_team_avg_goal_at_home_last_60_matches home_team_avg_oppnt_goal_at_home_last_60_matches home_team_avg_goal_at_home_last_90_matches home_team_avg_oppnt_goal_at_home_last_90_matches away_team_avg_goal_at_away_last_1_matches away_team_avg_oppnt_goal_at_away_last_1_matches away_team_avg_goal_at_away_last_3_matches away_team_avg_oppnt_goal_at_away_last_3_matches away_team_avg_goal_at_away_last_5_matches away_team_avg_oppnt_goal_at_away_last_5_matches away_team_avg_goal_at_away_last_10_matches away_team_avg_oppnt_goal_at_away_last_10_matches away_team_avg_goal_at_away_last_20_matches away_team_avg_oppnt_goal_at_away_last_20_matches away_team_avg_goal_at_away_last_30_matches away_team_avg_oppnt_goal_at_away_last_30_matches away_team_avg_goal_at_away_last_60_matches away_team_avg_oppnt_goal_at_away_last_60_matches away_team_avg_goal_at_away_last_90_matches away_team_avg_oppnt_goal_at_away_last_90_matches
0 492473 2008-08-17 9987 9993 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 492474 2008-08-16 10000 9994 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 492475 2008-08-16 9984 8635 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 492476 2008-08-17 9991 9998 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 492477 2008-08-16 7947 9985 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25974 1992091 2015-09-22 10190 10191 0.0 2.0 0.333333 1.666667 1.6 1.2 1.6 1.3 1.55 1.30 1.500000 1.333333 1.533333 1.133333 1.400000 1.311111 3.0 3.0 1.666667 3.000000 2.0 2.8 1.1 2.3 0.85 1.60 1.166667 1.600000 1.033333 1.500000 0.977778 1.377778
25975 1992092 2015-09-23 9824 10199 1.0 0.0 1.666667 1.333333 1.4 1.4 1.1 1.6 0.95 1.55 1.133333 1.866667 NaN NaN NaN NaN 1.0 0.0 2.333333 0.666667 1.8 0.6 2.0 0.7 1.80 1.25 1.566667 1.533333 1.450000 1.633333 1.377778 1.588889
25976 1992093 2015-09-23 9956 10179 3.0 2.0 3.666667 2.000000 2.6 1.2 2.0 1.0 1.70 1.25 1.900000 1.333333 1.616667 1.133333 1.488889 1.222222 2.0 0.0 1.000000 1.333333 1.0 1.8 0.9 1.1 0.90 1.35 0.800000 1.366667 0.816667 1.550000 0.966667 1.355556
25977 1992094 2015-09-22 7896 10243 0.0 1.0 0.666667 1.333333 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 3.0 1.333333 2.000000 1.4 1.6 1.2 1.9 1.60 1.45 1.566667 1.500000 1.583333 1.500000 1.566667 1.455556
25978 1992095 2015-09-23 10192 9931 4.0 0.0 2.333333 0.666667 2.0 0.8 2.2 0.8 2.40 1.10 2.166667 1.266667 2.016667 1.166667 2.011111 1.222222 3.0 1.0 3.000000 1.333333 2.4 1.6 2.4 1.3 2.40 1.45 2.200000 1.266667 1.883333 1.166667 1.877778 1.188889

25979 rows × 36 columns

  • Also let's use the each team's win or lose percentage at home and away for recent 1 / 3 / 5 / 10 / 20 / 30 / 60 / 90 matches.
home_win_info = df_match_basic[["home_team_api_id", "match_date", "match_result"]].sort_values(["home_team_api_id", "match_date"])
home_win_info.loc[home_win_info.match_result == "home_win", "is_win"] = 1 
home_win_info.loc[home_win_info.match_result != "home_win", "is_win"] = 0 

home_win_info.loc[home_win_info.match_result == "away_win", "is_lose"] = 1
home_win_info.loc[home_win_info.match_result != "away_win", "is_lose"] = 0
for i in [1, 3, 5, 10, 20, 30, 60, 90]:
    home_win_info[f"home_team_home_win_percentage_last_{i}_matches"] = home_win_info.groupby('home_team_api_id')['is_win'].shift(1).rolling(i).mean()
    home_win_info[f"home_team_home_lose_percentage_last_{i}_matches"] = home_win_info.groupby('home_team_api_id')['is_lose'].shift(1).rolling(i).mean()
away_win_info = df_match_basic[["away_team_api_id", "match_date", "match_result"]].sort_values(["away_team_api_id", "match_date"])
away_win_info.loc[away_win_info.match_result == "away_win", "is_win"] = 1 
away_win_info.loc[away_win_info.match_result != "away_win", "is_win"] = 0 

away_win_info.loc[away_win_info.match_result == "home_win", "is_lose"] = 1
away_win_info.loc[away_win_info.match_result != "home_win", "is_lose"] = 0
for i in [1, 3, 5, 10, 20, 30, 60, 90]:
    away_win_info[f"away_team_away_win_percentage_last_{i}_matches"] = away_win_info.groupby('away_team_api_id')['is_win'].shift(1).rolling(i).mean()
    away_win_info[f"away_team_away_lose_percentage_last_{i}_matches"] = away_win_info.groupby('away_team_api_id')['is_lose'].shift(1).rolling(i).mean()
df_team_win_goal_rolling_features = df_team_win_goal_rolling_features.merge(home_win_info.drop(["match_result", "is_win", "is_lose"], axis = 1), how = "left", on = ["match_date", "home_team_api_id"]) \
                                                                     .merge(away_win_info.drop(["match_result", "is_win", "is_lose"], axis = 1), how = "left", on = ["match_date", "away_team_api_id"]) \
                                                                     .drop(["match_date", "home_team_api_id", "away_team_api_id"], axis = 1)
df_team_win_goal_rolling_features                                                         
match_api_id home_team_avg_goal_at_home_last_1_matches home_team_avg_oppnt_goal_at_home_last_1_matches home_team_avg_goal_at_home_last_3_matches home_team_avg_oppnt_goal_at_home_last_3_matches home_team_avg_goal_at_home_last_5_matches home_team_avg_oppnt_goal_at_home_last_5_matches home_team_avg_goal_at_home_last_10_matches home_team_avg_oppnt_goal_at_home_last_10_matches home_team_avg_goal_at_home_last_20_matches home_team_avg_oppnt_goal_at_home_last_20_matches home_team_avg_goal_at_home_last_30_matches home_team_avg_oppnt_goal_at_home_last_30_matches home_team_avg_goal_at_home_last_60_matches home_team_avg_oppnt_goal_at_home_last_60_matches home_team_avg_goal_at_home_last_90_matches home_team_avg_oppnt_goal_at_home_last_90_matches away_team_avg_goal_at_away_last_1_matches away_team_avg_oppnt_goal_at_away_last_1_matches away_team_avg_goal_at_away_last_3_matches away_team_avg_oppnt_goal_at_away_last_3_matches away_team_avg_goal_at_away_last_5_matches away_team_avg_oppnt_goal_at_away_last_5_matches away_team_avg_goal_at_away_last_10_matches away_team_avg_oppnt_goal_at_away_last_10_matches away_team_avg_goal_at_away_last_20_matches away_team_avg_oppnt_goal_at_away_last_20_matches away_team_avg_goal_at_away_last_30_matches away_team_avg_oppnt_goal_at_away_last_30_matches away_team_avg_goal_at_away_last_60_matches away_team_avg_oppnt_goal_at_away_last_60_matches away_team_avg_goal_at_away_last_90_matches away_team_avg_oppnt_goal_at_away_last_90_matches home_team_home_win_percentage_last_1_matches home_team_home_lose_percentage_last_1_matches home_team_home_win_percentage_last_3_matches home_team_home_lose_percentage_last_3_matches home_team_home_win_percentage_last_5_matches home_team_home_lose_percentage_last_5_matches home_team_home_win_percentage_last_10_matches home_team_home_lose_percentage_last_10_matches home_team_home_win_percentage_last_20_matches home_team_home_lose_percentage_last_20_matches home_team_home_win_percentage_last_30_matches home_team_home_lose_percentage_last_30_matches home_team_home_win_percentage_last_60_matches home_team_home_lose_percentage_last_60_matches home_team_home_win_percentage_last_90_matches home_team_home_lose_percentage_last_90_matches away_team_away_win_percentage_last_1_matches away_team_away_lose_percentage_last_1_matches away_team_away_win_percentage_last_3_matches away_team_away_lose_percentage_last_3_matches away_team_away_win_percentage_last_5_matches away_team_away_lose_percentage_last_5_matches away_team_away_win_percentage_last_10_matches away_team_away_lose_percentage_last_10_matches away_team_away_win_percentage_last_20_matches away_team_away_lose_percentage_last_20_matches away_team_away_win_percentage_last_30_matches away_team_away_lose_percentage_last_30_matches away_team_away_win_percentage_last_60_matches away_team_away_lose_percentage_last_60_matches away_team_away_win_percentage_last_90_matches away_team_away_lose_percentage_last_90_matches
0 492473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 492474 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 492475 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 492476 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 492477 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25974 1992091 0.0 2.0 0.333333 1.666667 1.6 1.2 1.6 1.3 1.55 1.30 1.500000 1.333333 1.533333 1.133333 1.400000 1.311111 3.0 3.0 1.666667 3.000000 2.0 2.8 1.1 2.3 0.85 1.60 1.166667 1.600000 1.033333 1.500000 0.977778 1.377778 0.0 1.0 0.000000 0.666667 0.4 0.4 0.4 0.4 0.40 0.30 0.4 0.266667 0.450000 0.233333 0.388889 0.333333 0.0 0.0 0.000000 0.666667 0.2 0.6 0.1 0.6 0.15 0.45 0.233333 0.466667 0.233333 0.516667 0.255556 0.477778
25975 1992092 1.0 0.0 1.666667 1.333333 1.4 1.4 1.1 1.6 0.95 1.55 1.133333 1.866667 NaN NaN NaN NaN 1.0 0.0 2.333333 0.666667 1.8 0.6 2.0 0.7 1.80 1.25 1.566667 1.533333 1.450000 1.633333 1.377778 1.588889 1.0 0.0 0.333333 0.000000 0.2 0.2 0.2 0.5 0.20 0.45 0.2 0.500000 NaN NaN NaN NaN 1.0 0.0 1.000000 0.000000 0.8 0.0 0.8 0.1 0.50 0.30 0.400000 0.433333 0.350000 0.450000 0.311111 0.455556
25976 1992093 3.0 2.0 3.666667 2.000000 2.6 1.2 2.0 1.0 1.70 1.25 1.900000 1.333333 1.616667 1.133333 1.488889 1.222222 2.0 0.0 1.000000 1.333333 1.0 1.8 0.9 1.1 0.90 1.35 0.800000 1.366667 0.816667 1.550000 0.966667 1.355556 1.0 0.0 0.666667 0.333333 0.6 0.2 0.5 0.1 0.45 0.30 0.5 0.300000 0.500000 0.283333 0.466667 0.333333 1.0 0.0 0.333333 0.333333 0.2 0.4 0.4 0.3 0.25 0.45 0.233333 0.500000 0.233333 0.550000 0.266667 0.466667
25977 1992094 0.0 1.0 0.666667 1.333333 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 3.0 1.333333 2.000000 1.4 1.6 1.2 1.9 1.60 1.45 1.566667 1.500000 1.583333 1.500000 1.566667 1.455556 0.0 1.0 0.333333 0.666667 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 1.0 0.333333 0.666667 0.4 0.4 0.3 0.4 0.45 0.30 0.466667 0.366667 0.450000 0.350000 0.455556 0.355556
25978 1992095 4.0 0.0 2.333333 0.666667 2.0 0.8 2.2 0.8 2.40 1.10 2.166667 1.266667 2.016667 1.166667 2.011111 1.222222 3.0 1.0 3.000000 1.333333 2.4 1.6 2.4 1.3 2.40 1.45 2.200000 1.266667 1.883333 1.166667 1.877778 1.188889 1.0 0.0 0.666667 0.333333 0.6 0.2 0.6 0.2 0.70 0.20 0.6 0.200000 0.566667 0.233333 0.522222 0.211111 1.0 0.0 1.000000 0.000000 0.6 0.2 0.7 0.1 0.70 0.20 0.666667 0.166667 0.533333 0.183333 0.511111 0.155556

25979 rows × 65 columns

  • Save the table.
df_team_win_goal_rolling_features.to_csv("../data/df_team_win_goal_rolling_features.csv", index = False)

1.3. Elapsed time of goals

Q. Are there specific times when goals are scored most?

plt.figure(figsize = (20, 10))
sns.histplot(data = df_match_ingame_shot_goal, x = "elapsed")
plt.xlabel("Elapsed time of goals", fontsize = 20)
plt.ylabel("count", fontsize = 20)
plt.vlines(x = 45, ymin = 0, ymax = 3500, colors = "red", linewidth = 3)
plt.vlines(x = 90, ymin = 0, ymax = 3500, colors = "red", linewidth = 3)
plt.text(x = 33, y = 3550, s = "End of the first half", color = "red", fontsize = 15)
plt.text(x = 75, y = 3550, s = "End of the second half", color = "red", fontsize = 15)
plt.title("Histogram of the elapsed time of goals", fontsize = 25)
Text(0.5, 1.0, 'Histogram of the elapsed time of goals')

png

  • Elapsed time of goals is almost uniformly distributed except for the right before the end of first and second half where a lot of goals are scored.

Q: Is there any difference in the distribution of the elapsed time of goals between winning and losing team ?

df_match_ingame_shot_goal
match_api_id event_id elapsed team_api_id category type subtype player1_api_id player2_api_id match_date match_result season home_away league_name
0 489042 378998 22 10261.0 goal n header 37799.0 38807.0 2008-08-17 draw 2008/2009 away England Premier League
1 489042 379019 24 10260.0 goal n shot 24148.0 24154.0 2008-08-17 draw 2008/2009 home England Premier League
2 489043 375546 4 9825.0 goal n shot 26181.0 39297.0 2008-08-16 home_win 2008/2009 home England Premier League
3 489044 378041 83 8650.0 goal n distance 30853.0 30889.0 2008-08-16 away_win 2008/2009 away England Premier League
4 489045 376060 4 8654.0 goal n shot 23139.0 36394.0 2008-08-16 home_win 2008/2009 home England Premier League
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
39975 1992228 5640015 71 10192.0 goal n NaN 37554.0 NaN 2016-05-25 away_win 2015/2016 away Switzerland Super League
39976 1992229 5639993 58 9824.0 goal n NaN 493418.0 NaN 2016-05-25 home_win 2015/2016 away Switzerland Super League
39977 1992229 5640008 67 10243.0 goal n NaN 197757.0 NaN 2016-05-25 home_win 2015/2016 home Switzerland Super League
39978 1992229 5640010 69 10243.0 goal n NaN 198082.0 NaN 2016-05-25 home_win 2015/2016 home Switzerland Super League
39979 1992229 5640020 76 10243.0 goal n NaN 121080.0 NaN 2016-05-25 home_win 2015/2016 home Switzerland Super League

39980 rows × 14 columns

win_team = pd.concat([df_match_ingame_shot_goal[(df_match_ingame_shot_goal.match_result == "home_win") & (df_match_ingame_shot_goal.home_away == "home")], 
                      df_match_ingame_shot_goal[(df_match_ingame_shot_goal.match_result == "away_win") & (df_match_ingame_shot_goal.home_away == "away")]])

lose_team = pd.concat([df_match_ingame_shot_goal[(df_match_ingame_shot_goal.match_result == "home_win") & (df_match_ingame_shot_goal.home_away == "away")], 
                      df_match_ingame_shot_goal[(df_match_ingame_shot_goal.match_result == "away_win") & (df_match_ingame_shot_goal.home_away == "home")]])
win_team["win_lose"] = "win_team"
lose_team["win_lose"] = "lose_team"

elapsed_win_lose = pd.concat([win_team[["elapsed", "win_lose"]], lose_team[["elapsed", "win_lose"]]])
plt.figure(figsize = (20, 10))
sns.histplot(elapsed_win_lose, x = "elapsed", hue = "win_lose")
plt.xlabel("Elapsed time of goals", fontsize = 20)
plt.ylabel("count", fontsize = 20)
plt.vlines(x = 45, ymin = 0, ymax = 3500, colors = "red", linewidth = 3)
plt.vlines(x = 90, ymin = 0, ymax = 3500, colors = "red", linewidth = 3)
plt.text(x = 33, y = 3550, s = "End of the first half", color = "red", fontsize = 15)
plt.text(x = 75, y = 3550, s = "End of the second half", color = "red", fontsize = 15)
plt.title("Histogram of the elapsed time of goals", fontsize = 25)

top = mpatches.Patch(color = "skyblue", label = 'Winning team')
bottom = mpatches.Patch(color = "orange", label = 'Losing team')
plt.legend(handles=[top, bottom], loc = "upper left", fontsize = 15)

<matplotlib.legend.Legend at 0x7ff683f6e140>

png

  • The losing team scores fewer goals, but there is no difference in the overall shape of the distribution of elapsed time of goals between the winning and losing team.

Q: Is there any difference in the distribution of the elapsed time of goals between home and away team ?

plt.figure(figsize = (20, 10))
sns.histplot(df_match_ingame_shot_goal, x = "elapsed", hue = "home_away")
plt.xlabel("Elapsed time of goals", fontsize = 20)
plt.ylabel("count", fontsize = 20)
plt.vlines(x = 45, ymin = 0, ymax = 3500, colors = "red", linewidth = 3)
plt.vlines(x = 90, ymin = 0, ymax = 3500, colors = "red", linewidth = 3)
plt.text(x = 33, y = 3550, s = "End of the first half", color = "red", fontsize = 15)
plt.text(x = 75, y = 3550, s = "End of the second half", color = "red", fontsize = 15)
plt.title("Histogram of the elapsed time of goals", fontsize = 25)

top = mpatches.Patch(color = "orange", label = 'Home team')
bottom = mpatches.Patch(color = "skyblue", label = 'Away team')
plt.legend(handles=[top, bottom], loc = "upper left", fontsize = 15)



<matplotlib.legend.Legend at 0x7ff683f4a740>

png

  • The away team scores fewer goals, but there is no difference in the overall shape of the distribution of elapsed time of goals between the home and away team.

$\color{magenta} \quad \rightarrow$ The elapsed time of goals doesn't seem to make much of a difference between home & away teams or winning & losing teams. Therefore, we decided not to use it in modeling.

2. Shot

df_match_ingame_shot_shot = df_match_ingame_shot[df_match_ingame_shot.category == "shot"]
df_match_ingame_shot_shot
match_api_id event_id elapsed team_api_id category type subtype player1_api_id player2_api_id match_date match_result season home_away league_name
39980 489042 378828 3 10260.0 shot shoton NaN 24154.0 NaN 2008-08-17 draw 2008/2009 home England Premier League
39981 489042 378866 7 10260.0 shot shoton NaN 24157.0 NaN 2008-08-17 draw 2008/2009 home England Premier League
39982 489042 378922 14 10260.0 shot shoton NaN 30829.0 NaN 2008-08-17 draw 2008/2009 home England Premier League
39983 489042 378923 14 10260.0 shot shoton NaN 30373.0 NaN 2008-08-17 draw 2008/2009 home England Premier League
39984 489042 378951 17 10260.0 shot shoton NaN 30373.0 NaN 2008-08-17 draw 2008/2009 home England Premier League
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
229033 2030171 4940379 19 8370.0 shot shotoff NaN 36130.0 NaN 2015-10-23 home_win 2015/2016 home Spain LIGA BBVA
229034 2030171 4940624 44 8370.0 shot shotoff NaN 34104.0 NaN 2015-10-23 home_win 2015/2016 home Spain LIGA BBVA
229035 2030171 4940738 49 8558.0 shot shotoff NaN 107930.0 NaN 2015-10-23 home_win 2015/2016 away Spain LIGA BBVA
229036 2030171 4940963 71 8370.0 shot shotoff NaN 210065.0 NaN 2015-10-23 home_win 2015/2016 home Spain LIGA BBVA
229037 2030171 4941072 83 8558.0 shot shotoff NaN 629579.0 NaN 2015-10-23 home_win 2015/2016 away Spain LIGA BBVA

189058 rows × 14 columns

df_match_ingame_shot_shot.type.value_counts()
shotoff    95303
shoton     93755
Name: type, dtype: int64
  • There are two kinds of shot type:
    • shoton: It means shot on target. Definition of the shot on target is as follow.
      • Goes into the net regardless of intent.
      • Is a clear attempt to score that would have gone into the net but for being saved by the goalkeeper or is stopped by a player who is the last line of defence with the goalkeeper having no chance of preventing the goal (last line block).
    • shotoff: It means shot off target. Definition of the shot on target is as follow.
      • Goes over or wide of the goal without making contact with another player.
      • Would have gone over or wide of the goal but for being stopped by a goalkeeper’s save or by an outfield player.
      • Directly hits the frame of the goal and a goal is not scored.
sns.countplot(x = df_match_ingame_shot_shot.type)
<AxesSubplot:xlabel='type', ylabel='count'>

png

  • The portion of the shot on target and the shot off target out of total shot are almost same.

Q. Are there any relationship between total number of shot, total number of shot on target, and total number of shot off target and number of goals.

goal_info = df_match_ingame_shot_goal.groupby(["match_api_id", "team_api_id"]).event_id.count().reset_index().rename(columns = {"event_id": "num_goals"})
shot_info = df_match_ingame_shot_shot.groupby(["match_api_id", "team_api_id"]).event_id.count().reset_index().rename(columns = {"event_id": "num_shots"})
shot_on_info = df_match_ingame_shot_shot[df_match_ingame_shot_shot.type == "shoton"].groupby(["match_api_id", "team_api_id"]).event_id.count().reset_index().rename(columns = {"event_id": "num_shots_on"})
shot_off_info = df_match_ingame_shot_shot[df_match_ingame_shot_shot.type == "shotoff"].groupby(["match_api_id", "team_api_id"]).event_id.count().reset_index().rename(columns = {"event_id": "num_shots_off"})
goal_shot_info = shot_info.merge(goal_info, how = "left", on = ["match_api_id", "team_api_id"]) \
                          .merge(shot_on_info, how = "left", on = ["match_api_id", "team_api_id"]) \
                          .merge(shot_off_info, how = "left", on = ["match_api_id", "team_api_id"]) \
                          [["match_api_id", "team_api_id", "num_shots_on", "num_shots_off", "num_shots", "num_goals"]] \
                          .fillna(0)
    
goal_shot_info[["num_shots_on", "num_shots_off", "num_shots", "num_goals"]].corr()
num_shots_on num_shots_off num_shots num_goals
num_shots_on 1.000000 0.328353 0.827919 0.105083
num_shots_off 0.328353 1.000000 0.801601 0.061156
num_shots 0.827919 0.801601 1.000000 0.102825
num_goals 0.105083 0.061156 0.102825 1.000000
  • If we check the correlations, the number of shot on target and the number of shots have higher correlation with the number of goals than the number of shot off.

Q. Is there any difference in the distribution of number of shots between winning team and losing team ?

target_bool = ((df_match_ingame_shot_shot.match_result == "home_win") & (df_match_ingame_shot_shot.home_away == "home")) | ((df_match_ingame_shot_shot.match_result == "away_win") & (df_match_ingame_shot_shot.home_away == "away"))
win_shot_info = df_match_ingame_shot_shot[target_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "values"})
win_shot_info["category"] = "shots"
win_shot_info["win_loss"] = "win"
target_bool = ((df_match_ingame_shot_shot.match_result == "home_win") & (df_match_ingame_shot_shot.home_away == "away")) | ((df_match_ingame_shot_shot.match_result == "away_win") & (df_match_ingame_shot_shot.home_away == "home"))
lose_shot_info = df_match_ingame_shot_shot[target_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "values"})
lose_shot_info["category"] = "shots"
lose_shot_info["win_loss"] = "lose"
target_bool = ((df_match_ingame_shot_shot.match_result == "home_win") & (df_match_ingame_shot_shot.home_away == "home")) | ((df_match_ingame_shot_shot.match_result == "away_win") & (df_match_ingame_shot_shot.home_away == "away"))
shot_on_bool = df_match_ingame_shot_shot.type == "shoton"
win_shot_on_info = df_match_ingame_shot_shot[target_bool & shot_on_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "values"})
win_shot_on_info["category"] = "shot_on"
win_shot_on_info["win_loss"] = "win"
target_bool = ((df_match_ingame_shot_shot.match_result == "home_win") & (df_match_ingame_shot_shot.home_away == "away")) | ((df_match_ingame_shot_shot.match_result == "away_win") & (df_match_ingame_shot_shot.home_away == "home"))
shot_on_bool = df_match_ingame_shot_shot.type == "shoton"
lose_shot_on_info = df_match_ingame_shot_shot[target_bool & shot_on_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "values"})
lose_shot_on_info["category"] = "shot_on"
lose_shot_on_info["win_loss"] = "lose"
target_bool = ((df_match_ingame_shot_shot.match_result == "home_win") & (df_match_ingame_shot_shot.home_away == "home")) | ((df_match_ingame_shot_shot.match_result == "away_win") & (df_match_ingame_shot_shot.home_away == "away"))
shot_on_bool = df_match_ingame_shot_shot.type == "shotoff"
win_shot_off_info = df_match_ingame_shot_shot[target_bool & shot_on_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "values"})
win_shot_off_info["category"] = "shot_off"
win_shot_off_info["win_loss"] = "win"
target_bool = ((df_match_ingame_shot_shot.match_result == "home_win") & (df_match_ingame_shot_shot.home_away == "away")) | ((df_match_ingame_shot_shot.match_result == "away_win") & (df_match_ingame_shot_shot.home_away == "home"))
shot_on_bool = df_match_ingame_shot_shot.type == "shotoff"
lose_shot_off_info = df_match_ingame_shot_shot[target_bool & shot_on_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "values"})
lose_shot_off_info["category"] = "shot_off"
lose_shot_off_info["win_loss"] = "lose"
win_lose_shot_info = pd.concat([win_shot_info, lose_shot_info, win_shot_on_info, lose_shot_on_info, win_shot_off_info, lose_shot_off_info])
fig, axes = plt.subplots(1, 3, figsize = (18, 5))

sns.histplot(data = win_lose_shot_info[win_lose_shot_info.category == "shots"], x = "values", hue = "win_loss", ax = axes[0])
axes[0].set_xlabel("Total number of shots", fontsize = 15)
axes[0].set_ylabel("Count", fontsize = 15)

sns.histplot(data = win_lose_shot_info[win_lose_shot_info.category == "shot_on"], x = "values", hue = "win_loss", ax = axes[1])
axes[1].set_xlabel("Total number of shots on target", fontsize = 15)
axes[1].set_ylabel("Count", fontsize = 15)

sns.histplot(data = win_lose_shot_info[win_lose_shot_info.category == "shot_off"], x = "values", hue = "win_loss", ax = axes[2])
axes[2].set_xlabel("Total number of shots off target", fontsize = 15)
axes[2].set_ylabel("Count", fontsize = 15)


Text(0, 0.5, 'Count')

png

  • The shape of the distribution of the winning team and the losing team is the same, but the distribution of the winning team is shifted slightly to the right.
  • It can be seen that the winning teams took slightly more shots overall.

Q. Is there any difference of shot on target ratio(the number of shots on target / the number of shots) between winnint team and losing team ?

win_lose_shot_info = win_shot_info[["match_api_id", "values"]].rename(columns = {"values": "win_num_shots"}).merge(lose_shot_info[["match_api_id", "values"]].rename(columns = {"values": "lose_num_shots"}), how = "left", on = "match_api_id") \
                                                                                                            .merge(win_shot_on_info[["match_api_id", "values"]].rename(columns = {"values": "win_num_shot_on"}), how = "left", on = "match_api_id") \
                                                                                                            .merge(lose_shot_on_info[["match_api_id", "values"]].rename(columns = {"values": "lose_num_shot_on"}), how = "left", on = "match_api_id") \
                                                                                                            .merge(win_shot_off_info[["match_api_id", "values"]].rename(columns = {"values": "win_num_shot_off"}), how = "left", on = "match_api_id") \
                                                                                                            .merge(lose_shot_off_info[["match_api_id", "values"]].rename(columns = {"values": "lose_num_shot_off"}), how = "left", on = "match_api_id")    
win_lose_shot_info["win_shot_ratio"] = win_lose_shot_info.win_num_shot_on / win_lose_shot_info.win_num_shots
win_lose_shot_info["lose_shot_ratio"] = win_lose_shot_info.lose_num_shot_on / win_lose_shot_info.lose_num_shots
win_lose_shot_info["shot_on_target_ratio_difference"] = win_lose_shot_info.win_shot_ratio - win_lose_shot_info.lose_shot_ratio

my_histogram(win_lose_shot_info, "shot_on_target_ratio_difference")

png

  • Above plot shows the histogram of the shot on target ratio difference between the winning team and the losing team.
  • There are also half of the matches where the losing team had the higher shot on target ratio.
  • Since the distribution is symmetric, it is difficult to say that the winning team had the particularly higher shot on target ratio than that of the losing team.

Q. Is there any difference in the distribution of number of shots between home team and away team ?

home_shot_info = df_match_ingame_shot_shot[df_match_ingame_shot_shot.home_away == "home"].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "home_num_shots"}).drop("team_api_id", axis = 1)
away_shot_info = df_match_ingame_shot_shot[df_match_ingame_shot_shot.home_away == "away"].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "away_num_shots"}).drop("team_api_id", axis = 1)

target_bool = (df_match_ingame_shot_shot.home_away == "home") & (df_match_ingame_shot_shot.type == "shoton")
home_shot_on_info = df_match_ingame_shot_shot[target_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "home_num_shots_on"}).drop("team_api_id", axis = 1)

target_bool = (df_match_ingame_shot_shot.home_away == "away") & (df_match_ingame_shot_shot.type == "shoton")
away_shot_on_info = df_match_ingame_shot_shot[target_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "away_num_shots_on"}).drop("team_api_id", axis = 1)

target_bool = (df_match_ingame_shot_shot.home_away == "home") & (df_match_ingame_shot_shot.type == "shotoff")
home_shot_off_info = df_match_ingame_shot_shot[target_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "home_num_shots_off"}).drop("team_api_id", axis = 1)

target_bool = (df_match_ingame_shot_shot.home_away == "away") & (df_match_ingame_shot_shot.type == "shotoff")
away_shot_off_info = df_match_ingame_shot_shot[target_bool].groupby(["match_api_id", "team_api_id"]).count().event_id.reset_index().rename(columns = {"event_id": "away_num_shots_off"}).drop("team_api_id", axis = 1)
home_away_shot_info = home_shot_info.merge(away_shot_info, how = "left", on = "match_api_id") \
                                    .merge(home_shot_on_info, how = "left", on = "match_api_id") \
                                    .merge(away_shot_on_info, how = "left", on = "match_api_id") \
                                    .merge(home_shot_off_info, how = "left", on = "match_api_id") \
                                    .merge(away_shot_off_info, how = "left", on = "match_api_id")
    
home_away_shot_info
match_api_id home_num_shots away_num_shots home_num_shots_on away_num_shots_on home_num_shots_off away_num_shots_off
0 489042 21 10.0 11.0 1.0 10.0 9.0
1 489043 25 5.0 12.0 2.0 13.0 3.0
2 489044 7 16.0 4.0 11.0 3.0 5.0
3 489045 12 22.0 5.0 7.0 7.0 15.0
4 489046 9 14.0 5.0 9.0 4.0 5.0
... ... ... ... ... ... ... ...
8457 2060642 10 8.0 2.0 5.0 8.0 3.0
8458 2060643 17 12.0 8.0 8.0 9.0 4.0
8459 2060644 16 6.0 8.0 3.0 8.0 3.0
8460 2060645 8 15.0 3.0 7.0 5.0 8.0
8461 2118418 10 8.0 6.0 6.0 4.0 2.0

8462 rows × 7 columns

fig, axes = plt.subplots(1, 3, figsize = (18, 5))

sns.histplot(data = home_away_shot_info[["home_num_shots", "away_num_shots"]].melt(), x = "value", hue = "variable", ax = axes[0])
axes[0].set_xlabel("Total number of shots", fontsize = 15)
axes[0].set_ylabel("Count", fontsize = 15)

sns.histplot(data = home_away_shot_info[["home_num_shots_on", "away_num_shots_on"]].melt(), x = "value", hue = "variable", ax = axes[1])
axes[1].set_xlabel("Total number of shots on target", fontsize = 15)
axes[1].set_ylabel("Count", fontsize = 15)

sns.histplot(data = home_away_shot_info[["home_num_shots_off", "away_num_shots_off"]].melt(), x = "value", hue = "variable", ax = axes[2])
axes[2].set_xlabel("Total number of shots off target", fontsize = 15)
axes[2].set_ylabel("Count", fontsize = 15)
Text(0, 0.5, 'Count')

png

  • The shape of the distribution of the home team and the away team is the same, but the distribution of the home team is shifted slightly to the right.
  • It can be seen that the home teams took slightly more shots overall.

Q. Is there any difference of shot on target ratio(the number of shots on target / the number of shots) between home team and away team ?

home_away_shot_info["home_shot_on_target_ratio"] = home_away_shot_info.home_num_shots_on / home_away_shot_info.home_num_shots
home_away_shot_info["away_shot_on_target_ratio"] = home_away_shot_info.away_num_shots_on / home_away_shot_info.away_num_shots
home_away_shot_info["shot_on_target_ratio_difference"] = home_away_shot_info.home_shot_on_target_ratio - home_away_shot_info.away_shot_on_target_ratio
my_histogram(home_away_shot_info, "shot_on_target_ratio_difference")

png

  • Above plot shows the histogram of the shot on target ratio difference between the home team and the away team.
  • There are also half of the matches where the away team had the higher shot on target ratio.
  • Since the distribution is symmetric, it is difficult to say that the home team had the particularly higher shot on target ratio than that of the away team.

  • We have found that:
    • The number of shots and the number of shots on target have higher correlation with the number of goals than the number of shot off target

    • The shot on target ratio (the number of shots on target / total number of shots) have no difference between winning team and the losing team.

    • The shot on target ratio (the number of shots on target / total number of shots) have no difference between home team and the away team.

    • The winning teams usually have more shots, shots on target, and shots off target than those of the losing teams. $\$

    • The home teams usually have more shots, shots on target, and shots off target than those of the away teams. $\$

$\ \quad \rightarrow$ Of the three shot related variables(totla number of shots, number of shots on target, number of shots off target), consider only the total number of shots.

  • Let’s check whether we can use the variables related to the number of shots for modeling.
df_match_ingame_shot_shot.match_api_id.drop_duplicates()
39980      489042
39992      489043
40006      489044
40021      489045
40033      489046
           ...   
133702    2030168
133709    2030169
133714    2030170
133724    2030171
177951     499537
Name: match_api_id, Length: 8464, dtype: int64
df_match_basic.match_api_id.drop_duplicates()
0         492473
1         492474
2         492475
3         492476
4         492477
          ...   
25974    1992091
25975    1992092
25976    1992093
25977    1992094
25978    1992095
Name: match_api_id, Length: 25979, dtype: int64
  • Total number of matches: 25,979
  • The number of matches that we have information about shots: 8,464 $\rightarrow$ That is, there are only 30% matches that have shots information. So it is hard to use the information related to the shots for training the model.

$\color{magenta} \quad \rightarrow$ So, we can show information related to number of shots into a dashboard later, but it is hard to use this information for modeling because there are too many matches that we don't have the shots information.

3. Summary

  • Create a variable for each team that indicates how well they score goals when home, how well they block goals when home, how well they score goals when away, and how well they block goals when away:
    • Each team’s last 1 / 3 / 5 / 10 / 20 / 30 / 60 / 90 matches
      • average goal at home
      • average opponent’s goal at home
      • average goal at away
      • average opponent’s goal at away
  • Also, use the each team’s win or lose percentage at home and away for recent 1 / 3 / 5 / 10 / 20 / 30 / 60 / 90 matches.

  • We can show information related to number of shots into a dashboard later, but it is hard to use this information for modeling because there are too many matches that we don’t have the shots information.

2.4.summary