1) Load and prepare data

107 minute read

import numpy as np 
import pandas as pd 

import sqlite3 as sql 
import xml.etree.ElementTree as ET
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
  • The european soccer database consists of the data related to the soccer matches of 10 european soccer leagues from 2008/2009 season to 2015/2016 season.
con = sql.connect("../data/database.sqlite")
sql_master_tables = pd.read_sql('select * from sqlite_master where type="table";',con)
sql_master_tables
type name tbl_name rootpage sql
0 table sqlite_sequence sqlite_sequence 4 CREATE TABLE sqlite_sequence(name,seq)
1 table Player_Attributes Player_Attributes 11 CREATE TABLE "Player_Attributes" (\n\t`id`\tIN...
2 table Player Player 14 CREATE TABLE `Player` (\n\t`id`\tINTEGER PRIMA...
3 table Match Match 18 CREATE TABLE `Match` (\n\t`id`\tINTEGER PRIMAR...
4 table League League 24 CREATE TABLE `League` (\n\t`id`\tINTEGER PRIMA...
5 table Country Country 26 CREATE TABLE `Country` (\n\t`id`\tINTEGER PRIM...
6 table Team Team 29 CREATE TABLE "Team" (\n\t`id`\tINTEGER PRIMARY...
7 table Team_Attributes Team_Attributes 2 CREATE TABLE `Team_Attributes` (\n\t`id`\tINTE...
pd.read_sql(
    "select * from sqlite_sequence", con
    )
name seq
0 Team 103916
1 Country 51958
2 League 51958
3 Match 51958
4 Player 11075
5 Player_Attributes 183978
6 Team_Attributes 1458
  • There are 7 tables.
    • Player related tables
      • Player table
      • Player_Attributes table
    • Team related tables
      • Country table
      • League table
      • Team table
      • Team_Attributes table
    • Match related tables
      • Match table

1. Player related tables: Player, Player_Attributes

1.1. Player

org_player = pd.read_sql(
    "select * from Player", con
    )

org_player.head()
id player_api_id player_name player_fifa_api_id birthday height weight
0 1 505942 Aaron Appindangoye 218353 1992-02-29 00:00:00 182.88 187
1 2 155782 Aaron Cresswell 189615 1989-12-15 00:00:00 170.18 146
2 3 162549 Aaron Doran 186170 1991-05-13 00:00:00 170.18 163
3 4 30572 Aaron Galindo 140161 1982-05-08 00:00:00 182.88 198
4 5 23780 Aaron Hughes 17725 1979-11-08 00:00:00 182.88 154
org_player.shape
(11060, 7)
len(org_player.player_api_id.unique())
11060
len(org_player.player_fifa_api_id.unique())
11060
sum(org_player.groupby(["player_api_id"]).nunique().player_fifa_api_id != 1)
0
  • There are total 11,060 soccer players and their birthday, height, and weight information.
  • Each player has exactly one player_api_id, one player_fifa_api_id, and one (player_api_id, player_fifa_api_id) set. So, we can use any player id to identify a player.

1.2. Player_Attributes

org_player_attr = pd.read_sql(
    "select * from Player_Attributes", con
    )

org_player_attr.head()
id player_fifa_api_id player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 1 218353 505942 2016-02-18 00:00:00 67.0 71.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
1 2 218353 505942 2015-11-19 00:00:00 67.0 71.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
2 3 218353 505942 2015-09-21 00:00:00 62.0 66.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 63.0 41.0 45.0 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 4 218353 505942 2015-03-20 00:00:00 61.0 65.0 right medium medium 48.0 43.0 70.0 60.0 43.0 50.0 44.0 38.0 63.0 48.0 60.0 64.0 59.0 46.0 65.0 54.0 58.0 54.0 76.0 34.0 62.0 40.0 44.0 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
4 5 218353 505942 2007-02-22 00:00:00 61.0 65.0 right medium medium 48.0 43.0 70.0 60.0 43.0 50.0 44.0 38.0 63.0 48.0 60.0 64.0 59.0 46.0 65.0 54.0 58.0 54.0 76.0 34.0 62.0 40.0 44.0 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
org_player_attr.shape
(183978, 42)
  • There are total 38 player attributes.
  • player_fifa_api_id or player_api_id can be used to merge the Player and Player_Attribute tables.
len(org_player_attr.player_fifa_api_id.unique())
11062
len(org_player_attr.player_api_id.unique())
11060
  • There are different number of unique player_api_id and player_fifa_api_id.
  • It means that some player_api_id have different player_fifa_api_id.
target_bool = org_player_attr.groupby(["player_api_id"]).nunique().player_fifa_api_id != 1
target_player_api = org_player_attr.groupby(["player_api_id"]).nunique().player_fifa_api_id[target_bool].index
org_player_attr[org_player_attr.player_api_id.isin(target_player_api)].sort_values(["player_api_id", "date"]).head(15)
id player_fifa_api_id player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
104862 104863 202486 11285 2007-02-22 00:00:00 62.0 68.0 right high medium 62.0 68.0 47.0 56.0 35.0 63.0 56.0 53.0 47.0 57.0 84.0 81.0 74.0 73.0 71.0 54.0 68.0 64.0 52.0 56.0 53.0 22.0 71.0 41.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104881 104882 208618 11285 2007-02-22 00:00:00 NaN NaN None None None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
104861 104862 202486 11285 2011-02-22 00:00:00 68.0 77.0 right medium medium 41.0 26.0 72.0 58.0 38.0 49.0 33.0 51.0 59.0 52.0 59.0 66.0 52.0 63.0 76.0 68.0 75.0 64.0 81.0 53.0 71.0 68.0 31.0 49.0 47.0 70.0 63.0 64.0 11.0 13.0 8.0 8.0 12.0
104860 104861 202486 11285 2011-08-30 00:00:00 65.0 69.0 right medium medium 41.0 26.0 72.0 58.0 38.0 49.0 33.0 51.0 59.0 51.0 54.0 51.0 51.0 58.0 41.0 68.0 47.0 50.0 73.0 53.0 71.0 65.0 31.0 49.0 47.0 69.0 63.0 64.0 11.0 13.0 8.0 8.0 12.0
104859 104860 202486 11285 2012-02-22 00:00:00 66.0 72.0 right medium medium 41.0 26.0 74.0 63.0 38.0 49.0 33.0 51.0 59.0 61.0 54.0 51.0 51.0 58.0 41.0 68.0 47.0 51.0 75.0 53.0 71.0 65.0 31.0 49.0 47.0 65.0 69.0 64.0 11.0 13.0 8.0 8.0 12.0
104880 104881 208618 11285 2012-08-31 00:00:00 62.0 68.0 right high medium 62.0 68.0 47.0 56.0 35.0 63.0 56.0 53.0 47.0 57.0 84.0 81.0 74.0 73.0 71.0 54.0 68.0 64.0 52.0 56.0 53.0 22.0 71.0 41.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104879 104880 208618 11285 2014-03-21 00:00:00 62.0 68.0 right high medium 62.0 68.0 47.0 56.0 35.0 63.0 56.0 53.0 47.0 57.0 84.0 81.0 74.0 73.0 71.0 54.0 68.0 64.0 52.0 56.0 53.0 22.0 71.0 41.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104878 104879 208618 11285 2014-04-11 00:00:00 63.0 68.0 right high medium 63.0 68.0 47.0 57.0 35.0 64.0 56.0 53.0 49.0 58.0 84.0 81.0 74.0 74.0 71.0 54.0 68.0 64.0 52.0 56.0 53.0 22.0 71.0 45.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104877 104878 208618 11285 2014-04-25 00:00:00 65.0 68.0 right high medium 63.0 65.0 47.0 62.0 35.0 65.0 56.0 67.0 49.0 64.0 84.0 81.0 74.0 74.0 71.0 54.0 68.0 64.0 52.0 56.0 53.0 22.0 71.0 51.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104858 104859 202486 11285 2014-09-18 00:00:00 65.0 68.0 right high medium 63.0 65.0 47.0 64.0 35.0 65.0 56.0 67.0 49.0 64.0 84.0 78.0 74.0 74.0 71.0 54.0 67.0 64.0 53.0 56.0 53.0 22.0 71.0 51.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104876 104877 208618 11285 2014-09-18 00:00:00 NaN NaN None None None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
104875 104876 208618 11285 2014-11-14 00:00:00 65.0 68.0 right high medium 63.0 65.0 47.0 64.0 35.0 65.0 56.0 67.0 49.0 64.0 84.0 78.0 74.0 74.0 71.0 54.0 67.0 64.0 53.0 56.0 53.0 22.0 71.0 51.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104874 104875 208618 11285 2014-11-28 00:00:00 65.0 68.0 right high medium 63.0 65.0 47.0 64.0 35.0 65.0 56.0 67.0 49.0 64.0 84.0 78.0 74.0 74.0 71.0 54.0 67.0 64.0 53.0 56.0 53.0 22.0 71.0 51.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104873 104874 208618 11285 2015-02-13 00:00:00 70.0 77.0 right high medium 70.0 65.0 47.0 72.0 35.0 70.0 56.0 68.0 62.0 69.0 84.0 78.0 74.0 77.0 71.0 54.0 67.0 64.0 53.0 56.0 53.0 22.0 76.0 58.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
104872 104873 208618 11285 2015-02-20 00:00:00 70.0 77.0 right high medium 70.0 65.0 47.0 72.0 43.0 70.0 56.0 68.0 62.0 69.0 84.0 78.0 74.0 77.0 71.0 54.0 67.0 64.0 53.0 56.0 53.0 22.0 76.0 58.0 55.0 22.0 24.0 20.0 8.0 15.0 8.0 8.0 6.0
  • For example, player_api_id 11285 have two player_fifa_api_id 208618 and 202486.
  • So it is better to use the player_api_id when merge tables.
  • There are 11060 players’ attributes based on the number of unique player_api_id.
  • Let’s make one player table by merging the Player and Player_Attributes table.
  • We drop the id and player_fifa_api_id columns because we will only use player_api_id to identify a player later.
df_player = org_player_attr.drop(["id", "player_fifa_api_id"], axis = 1) \
                           .merge(org_player.drop(["id", "player_fifa_api_id"], axis = 1), 
                                  how = "left", on = "player_api_id") 
               
df_player
player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes player_name birthday height weight
0 505942 2016-02-18 00:00:00 67.0 71.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187
1 505942 2015-11-19 00:00:00 67.0 71.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187
2 505942 2015-09-21 00:00:00 62.0 66.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 63.0 41.0 45.0 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187
3 505942 2015-03-20 00:00:00 61.0 65.0 right medium medium 48.0 43.0 70.0 60.0 43.0 50.0 44.0 38.0 63.0 48.0 60.0 64.0 59.0 46.0 65.0 54.0 58.0 54.0 76.0 34.0 62.0 40.0 44.0 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187
4 505942 2007-02-22 00:00:00 61.0 65.0 right medium medium 48.0 43.0 70.0 60.0 43.0 50.0 44.0 38.0 63.0 48.0 60.0 64.0 59.0 46.0 65.0 54.0 58.0 54.0 76.0 34.0 62.0 40.0 44.0 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
183973 39902 2009-08-30 00:00:00 83.0 85.0 right medium low 84.0 77.0 59.0 89.0 77.0 84.0 86.0 78.0 84.0 85.0 66.0 72.0 77.0 86.0 73.0 76.0 58.0 72.0 67.0 81.0 56.0 78.0 86.0 88.0 83.0 22.0 31.0 30.0 9.0 20.0 84.0 20.0 20.0 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176
183974 39902 2009-02-22 00:00:00 78.0 80.0 right medium low 74.0 76.0 53.0 84.0 77.0 85.0 86.0 74.0 73.0 86.0 66.0 67.0 77.0 74.0 73.0 75.0 58.0 66.0 65.0 73.0 61.0 64.0 72.0 88.0 70.0 32.0 31.0 30.0 9.0 20.0 73.0 20.0 20.0 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176
183975 39902 2008-08-30 00:00:00 77.0 80.0 right medium low 74.0 71.0 53.0 84.0 77.0 85.0 86.0 74.0 73.0 86.0 66.0 67.0 77.0 74.0 73.0 75.0 58.0 66.0 65.0 73.0 67.0 64.0 72.0 88.0 70.0 32.0 31.0 30.0 9.0 20.0 73.0 20.0 20.0 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176
183976 39902 2007-08-30 00:00:00 78.0 81.0 right medium low 74.0 64.0 57.0 86.0 77.0 87.0 86.0 73.0 73.0 91.0 61.0 60.0 77.0 69.0 73.0 72.0 58.0 67.0 59.0 78.0 63.0 63.0 68.0 88.0 53.0 28.0 32.0 30.0 9.0 20.0 73.0 20.0 20.0 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176
183977 39902 2007-02-22 00:00:00 80.0 81.0 right medium low 74.0 68.0 57.0 88.0 77.0 87.0 86.0 53.0 78.0 91.0 58.0 64.0 77.0 66.0 73.0 72.0 58.0 67.0 59.0 78.0 63.0 63.0 68.0 88.0 53.0 38.0 32.0 30.0 9.0 9.0 78.0 7.0 15.0 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176

183978 rows × 44 columns

df_player.dtypes
player_api_id            int64
date                    object
overall_rating         float64
potential              float64
preferred_foot          object
attacking_work_rate     object
defensive_work_rate     object
crossing               float64
finishing              float64
heading_accuracy       float64
short_passing          float64
volleys                float64
dribbling              float64
curve                  float64
free_kick_accuracy     float64
long_passing           float64
ball_control           float64
acceleration           float64
sprint_speed           float64
agility                float64
reactions              float64
balance                float64
shot_power             float64
jumping                float64
stamina                float64
strength               float64
long_shots             float64
aggression             float64
interceptions          float64
positioning            float64
vision                 float64
penalties              float64
marking                float64
standing_tackle        float64
sliding_tackle         float64
gk_diving              float64
gk_handling            float64
gk_kicking             float64
gk_positioning         float64
gk_reflexes            float64
player_name             object
birthday                object
height                 float64
weight                   int64
dtype: object
  • birthday and date columns need to be changed to the datetime type.
  • Since there are other date columns that have different meanings in other tables, let’s change the column name date in this table to the player_measured_date.
df_player["player_measured_date"] = pd.to_datetime(df_player.date)
df_player["birthday"] = pd.to_datetime(df_player.birthday)
# just change the columns order
df_player = df_player[["player_api_id", "player_name", "birthday", "height", "weight", "player_measured_date", 
                       'overall_rating', 'potential', 'preferred_foot', 'attacking_work_rate', 'defensive_work_rate', 
                       'crossing', 'finishing', 'heading_accuracy', 'short_passing', 'volleys', 
                       'dribbling', 'curve', 'free_kick_accuracy', 'long_passing', 'ball_control', 
                       'acceleration', 'sprint_speed', 'agility', 'reactions', 'balance', 
                       'shot_power', 'jumping', 'stamina', 'strength', 'long_shots', 
                       'aggression', 'interceptions', 'positioning', 'vision', 'penalties', 
                       'marking', 'standing_tackle', 'sliding_tackle', 'gk_diving', 'gk_handling', 
                       'gk_kicking', 'gk_positioning', 'gk_reflexes']]

df_player
player_api_id player_name birthday height weight player_measured_date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 505942 Aaron Appindangoye 1992-02-29 182.88 187 2016-02-18 67.0 71.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
1 505942 Aaron Appindangoye 1992-02-29 182.88 187 2015-11-19 67.0 71.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
2 505942 Aaron Appindangoye 1992-02-29 182.88 187 2015-09-21 62.0 66.0 right medium medium 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 63.0 41.0 45.0 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 505942 Aaron Appindangoye 1992-02-29 182.88 187 2015-03-20 61.0 65.0 right medium medium 48.0 43.0 70.0 60.0 43.0 50.0 44.0 38.0 63.0 48.0 60.0 64.0 59.0 46.0 65.0 54.0 58.0 54.0 76.0 34.0 62.0 40.0 44.0 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
4 505942 Aaron Appindangoye 1992-02-29 182.88 187 2007-02-22 61.0 65.0 right medium medium 48.0 43.0 70.0 60.0 43.0 50.0 44.0 38.0 63.0 48.0 60.0 64.0 59.0 46.0 65.0 54.0 58.0 54.0 76.0 34.0 62.0 40.0 44.0 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
183973 39902 Zvjezdan Misimovic 1982-06-05 180.34 176 2009-08-30 83.0 85.0 right medium low 84.0 77.0 59.0 89.0 77.0 84.0 86.0 78.0 84.0 85.0 66.0 72.0 77.0 86.0 73.0 76.0 58.0 72.0 67.0 81.0 56.0 78.0 86.0 88.0 83.0 22.0 31.0 30.0 9.0 20.0 84.0 20.0 20.0
183974 39902 Zvjezdan Misimovic 1982-06-05 180.34 176 2009-02-22 78.0 80.0 right medium low 74.0 76.0 53.0 84.0 77.0 85.0 86.0 74.0 73.0 86.0 66.0 67.0 77.0 74.0 73.0 75.0 58.0 66.0 65.0 73.0 61.0 64.0 72.0 88.0 70.0 32.0 31.0 30.0 9.0 20.0 73.0 20.0 20.0
183975 39902 Zvjezdan Misimovic 1982-06-05 180.34 176 2008-08-30 77.0 80.0 right medium low 74.0 71.0 53.0 84.0 77.0 85.0 86.0 74.0 73.0 86.0 66.0 67.0 77.0 74.0 73.0 75.0 58.0 66.0 65.0 73.0 67.0 64.0 72.0 88.0 70.0 32.0 31.0 30.0 9.0 20.0 73.0 20.0 20.0
183976 39902 Zvjezdan Misimovic 1982-06-05 180.34 176 2007-08-30 78.0 81.0 right medium low 74.0 64.0 57.0 86.0 77.0 87.0 86.0 73.0 73.0 91.0 61.0 60.0 77.0 69.0 73.0 72.0 58.0 67.0 59.0 78.0 63.0 63.0 68.0 88.0 53.0 28.0 32.0 30.0 9.0 20.0 73.0 20.0 20.0
183977 39902 Zvjezdan Misimovic 1982-06-05 180.34 176 2007-02-22 80.0 81.0 right medium low 74.0 68.0 57.0 88.0 77.0 87.0 86.0 53.0 78.0 91.0 58.0 64.0 77.0 66.0 73.0 72.0 58.0 67.0 59.0 78.0 63.0 63.0 68.0 88.0 53.0 38.0 32.0 30.0 9.0 9.0 78.0 7.0 15.0

183978 rows × 44 columns

df_player.to_csv("../data/df_player.csv", index = False)

1.4. Summary

df_player_summary

2. Match table

org_match = pd.read_sql(
    "select * from Match", con
    )

org_match.head()
id country_id league_id season stage date match_api_id home_team_api_id away_team_api_id home_team_goal away_team_goal home_player_X1 home_player_X2 home_player_X3 home_player_X4 home_player_X5 home_player_X6 home_player_X7 home_player_X8 home_player_X9 home_player_X10 home_player_X11 away_player_X1 away_player_X2 away_player_X3 away_player_X4 away_player_X5 away_player_X6 away_player_X7 away_player_X8 away_player_X9 away_player_X10 away_player_X11 home_player_Y1 home_player_Y2 home_player_Y3 home_player_Y4 home_player_Y5 home_player_Y6 home_player_Y7 home_player_Y8 home_player_Y9 home_player_Y10 home_player_Y11 away_player_Y1 away_player_Y2 away_player_Y3 away_player_Y4 away_player_Y5 away_player_Y6 away_player_Y7 away_player_Y8 away_player_Y9 away_player_Y10 away_player_Y11 home_player_1 home_player_2 home_player_3 home_player_4 home_player_5 home_player_6 home_player_7 home_player_8 home_player_9 home_player_10 home_player_11 away_player_1 away_player_2 away_player_3 away_player_4 away_player_5 away_player_6 away_player_7 away_player_8 away_player_9 away_player_10 away_player_11 goal shoton shotoff foulcommit card cross corner possession B365H B365D B365A BWH BWD BWA IWH IWD IWA LBH LBD LBA PSH PSD PSA WHH WHD WHA SJH SJD SJA VCH VCD VCA GBH GBD GBA BSH BSD BSA
0 1 1 1 2008/2009 1 2008-08-17 00:00:00 492473 9987 9993 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None None None None None None None None 1.73 3.40 5.00 1.75 3.35 4.20 1.85 3.2 3.5 1.80 3.3 3.75 NaN NaN NaN 1.70 3.30 4.33 1.90 3.3 4.00 1.65 3.40 4.50 1.78 3.25 4.00 1.73 3.40 4.20
1 2 1 1 2008/2009 1 2008-08-16 00:00:00 492474 10000 9994 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None None None None None None None None 1.95 3.20 3.60 1.80 3.30 3.95 1.90 3.2 3.5 1.90 3.2 3.50 NaN NaN NaN 1.83 3.30 3.60 1.95 3.3 3.80 2.00 3.25 3.25 1.85 3.25 3.75 1.91 3.25 3.60
2 3 1 1 2008/2009 1 2008-08-16 00:00:00 492475 9984 8635 0 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None None None None None None None None 2.38 3.30 2.75 2.40 3.30 2.55 2.60 3.1 2.3 2.50 3.2 2.50 NaN NaN NaN 2.50 3.25 2.40 2.63 3.3 2.50 2.35 3.25 2.65 2.50 3.20 2.50 2.30 3.20 2.75
3 4 1 1 2008/2009 1 2008-08-17 00:00:00 492476 9991 9998 5 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None None None None None None None None 1.44 3.75 7.50 1.40 4.00 6.80 1.40 3.9 6.0 1.44 3.6 6.50 NaN NaN NaN 1.44 3.75 6.00 1.44 4.0 7.50 1.45 3.75 6.50 1.50 3.75 5.50 1.44 3.75 6.50
4 5 1 1 2008/2009 1 2008-08-16 00:00:00 492477 7947 9985 1 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None None None None None None None None 5.00 3.50 1.65 5.00 3.50 1.60 4.00 3.3 1.7 4.00 3.4 1.72 NaN NaN NaN 4.20 3.40 1.70 4.50 3.5 1.73 4.50 3.40 1.65 4.50 3.50 1.65 4.75 3.30 1.67
org_match.shape
(25979, 115)
len(org_match.match_api_id.unique())
25979
  • There are total 25,979 matches and 115 match information variables.
  • Since there are 25,979 number of unique match_api_id, let’s use the match_api_id to identify a match.

  • Match information variables can be categorized in 4 categories
    • basic match information
      • country id, league id
      • home team id, away team id
      • season, stage, date
    • player information
      • home team players, away team players
    • betting information
    • in-game information
      • goal, shoton, shotoff
      • foulcommit, card
      • cross, corner
      • possession
      • home_team_goal, away_team_goal
  • Let’s divide the match table into 4 different tables to make analysis easier later.

2.1. Basic match information

basic_match_information_col = ["match_api_id", "country_id", "league_id", "home_team_api_id", "away_team_api_id", "season", "stage", "date", "home_team_goal", "away_team_goal"]

df_match_basic = org_match[basic_match_information_col]
df_match_basic
match_api_id country_id league_id home_team_api_id away_team_api_id season stage date home_team_goal away_team_goal
0 492473 1 1 9987 9993 2008/2009 1 2008-08-17 00:00:00 1 1
1 492474 1 1 10000 9994 2008/2009 1 2008-08-16 00:00:00 0 0
2 492475 1 1 9984 8635 2008/2009 1 2008-08-16 00:00:00 0 3
3 492476 1 1 9991 9998 2008/2009 1 2008-08-17 00:00:00 5 0
4 492477 1 1 7947 9985 2008/2009 1 2008-08-16 00:00:00 1 3
... ... ... ... ... ... ... ... ... ... ...
25974 1992091 24558 24558 10190 10191 2015/2016 9 2015-09-22 00:00:00 1 0
25975 1992092 24558 24558 9824 10199 2015/2016 9 2015-09-23 00:00:00 1 2
25976 1992093 24558 24558 9956 10179 2015/2016 9 2015-09-23 00:00:00 2 0
25977 1992094 24558 24558 7896 10243 2015/2016 9 2015-09-22 00:00:00 0 0
25978 1992095 24558 24558 10192 9931 2015/2016 9 2015-09-23 00:00:00 4 3

25979 rows × 10 columns

df_match_basic.dtypes
match_api_id         int64
country_id           int64
league_id            int64
home_team_api_id     int64
away_team_api_id     int64
season              object
stage                int64
date                object
home_team_goal       int64
away_team_goal       int64
dtype: object
  • The date column need to be converted to the datetime dtype.
  • Since there are other date columns that have different meanings in other tables, let’s change the column name date in this table to the match_date.
df_match_basic["match_date"] = pd.to_datetime(df_match_basic.date)
# Just change the columns order
df_match_basic = df_match_basic[["match_api_id", "country_id", "league_id", "season", "stage", 
                                 "match_date", "home_team_api_id", "away_team_api_id", "home_team_goal", "away_team_goal"]]
df_match_basic
match_api_id country_id league_id season stage match_date home_team_api_id away_team_api_id home_team_goal away_team_goal
0 492473 1 1 2008/2009 1 2008-08-17 9987 9993 1 1
1 492474 1 1 2008/2009 1 2008-08-16 10000 9994 0 0
2 492475 1 1 2008/2009 1 2008-08-16 9984 8635 0 3
3 492476 1 1 2008/2009 1 2008-08-17 9991 9998 5 0
4 492477 1 1 2008/2009 1 2008-08-16 7947 9985 1 3
... ... ... ... ... ... ... ... ... ... ...
25974 1992091 24558 24558 2015/2016 9 2015-09-22 10190 10191 1 0
25975 1992092 24558 24558 2015/2016 9 2015-09-23 9824 10199 1 2
25976 1992093 24558 24558 2015/2016 9 2015-09-23 9956 10179 2 0
25977 1992094 24558 24558 2015/2016 9 2015-09-22 7896 10243 0 0
25978 1992095 24558 24558 2015/2016 9 2015-09-23 10192 9931 4 3

25979 rows × 10 columns

  • Make the match result column that will be our target variable in the modeling.
    • match_result: home_win / draw / away_win
home_win_bool = df_match_basic.home_team_goal > df_match_basic.away_team_goal
draw_bool = df_match_basic.home_team_goal == df_match_basic.away_team_goal
away_win_bool = df_match_basic.home_team_goal < df_match_basic.away_team_goal

df_match_basic.loc[home_win_bool, "match_result"] = "home_win"
df_match_basic.loc[draw_bool, "match_result"] = "draw"
df_match_basic.loc[away_win_bool, "match_result"] = "away_win"

df_match_basic
match_api_id country_id league_id season stage match_date home_team_api_id away_team_api_id home_team_goal away_team_goal match_result
0 492473 1 1 2008/2009 1 2008-08-17 9987 9993 1 1 draw
1 492474 1 1 2008/2009 1 2008-08-16 10000 9994 0 0 draw
2 492475 1 1 2008/2009 1 2008-08-16 9984 8635 0 3 away_win
3 492476 1 1 2008/2009 1 2008-08-17 9991 9998 5 0 home_win
4 492477 1 1 2008/2009 1 2008-08-16 7947 9985 1 3 away_win
... ... ... ... ... ... ... ... ... ... ... ...
25974 1992091 24558 24558 2015/2016 9 2015-09-22 10190 10191 1 0 home_win
25975 1992092 24558 24558 2015/2016 9 2015-09-23 9824 10199 1 2 away_win
25976 1992093 24558 24558 2015/2016 9 2015-09-23 9956 10179 2 0 home_win
25977 1992094 24558 24558 2015/2016 9 2015-09-22 7896 10243 0 0 draw
25978 1992095 24558 24558 2015/2016 9 2015-09-23 10192 9931 4 3 home_win

25979 rows × 11 columns

df_match_basic.to_csv("../data/df_match_basic.csv", index = False)

2.2. Player information

player_col = ["match_api_id", 'home_player_X1', 'home_player_X2', 'home_player_X3', 'home_player_X4', 'home_player_X5', 
              'home_player_X6', 'home_player_X7', 'home_player_X8', 'home_player_X9', 'home_player_X10', 
              'home_player_X11', 'away_player_X1', 'away_player_X2', 'away_player_X3', 'away_player_X4', 
              'away_player_X5', 'away_player_X6', 'away_player_X7', 'away_player_X8', 'away_player_X9', 
              'away_player_X10', 'away_player_X11', 'home_player_Y1', 'home_player_Y2', 'home_player_Y3', 
              'home_player_Y4', 'home_player_Y5', 'home_player_Y6', 'home_player_Y7', 'home_player_Y8', 
              'home_player_Y9', 'home_player_Y10', 'home_player_Y11', 'away_player_Y1', 'away_player_Y2', 
              'away_player_Y3', 'away_player_Y4', 'away_player_Y5', 'away_player_Y6', 'away_player_Y7', 
              'away_player_Y8', 'away_player_Y9', 'away_player_Y10', 'away_player_Y11', 'home_player_1', 
              'home_player_2', 'home_player_3', 'home_player_4', 'home_player_5', 'home_player_6', 
              'home_player_7', 'home_player_8', 'home_player_9', 'home_player_10', 'home_player_11', 
              'away_player_1', 'away_player_2', 'away_player_3', 'away_player_4', 'away_player_5', 
              'away_player_6', 'away_player_7', 'away_player_8', 'away_player_9', 'away_player_10', 'away_player_11']
df_match_player = org_match[player_col]
df_match_player
match_api_id home_player_X1 home_player_X2 home_player_X3 home_player_X4 home_player_X5 home_player_X6 home_player_X7 home_player_X8 home_player_X9 home_player_X10 home_player_X11 away_player_X1 away_player_X2 away_player_X3 away_player_X4 away_player_X5 away_player_X6 away_player_X7 away_player_X8 away_player_X9 away_player_X10 away_player_X11 home_player_Y1 home_player_Y2 home_player_Y3 home_player_Y4 home_player_Y5 home_player_Y6 home_player_Y7 home_player_Y8 home_player_Y9 home_player_Y10 home_player_Y11 away_player_Y1 away_player_Y2 away_player_Y3 away_player_Y4 away_player_Y5 away_player_Y6 away_player_Y7 away_player_Y8 away_player_Y9 away_player_Y10 away_player_Y11 home_player_1 home_player_2 home_player_3 home_player_4 home_player_5 home_player_6 home_player_7 home_player_8 home_player_9 home_player_10 home_player_11 away_player_1 away_player_2 away_player_3 away_player_4 away_player_5 away_player_6 away_player_7 away_player_8 away_player_9 away_player_10 away_player_11
0 492473 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 492474 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 492475 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 492476 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 492477 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25974 1992091 1.0 2.0 4.0 6.0 8.0 4.0 6.0 3.0 5.0 7.0 5.0 1.0 2.0 4.0 6.0 8.0 4.0 6.0 3.0 5.0 7.0 5.0 1.0 3.0 3.0 3.0 3.0 6.0 6.0 8.0 8.0 8.0 11.0 1.0 3.0 3.0 3.0 3.0 6.0 6.0 8.0 8.0 8.0 11.0 42231.0 678384.0 95220.0 638592.0 413155.0 45780.0 171229.0 67333.0 119839.0 143790.0 195215.0 462944.0 563066.0 8800.0 67304.0 158253.0 133126.0 186524.0 93223.0 121115.0 232110.0 289732.0
25975 1992092 1.0 3.0 5.0 7.0 2.0 4.0 6.0 8.0 5.0 4.0 6.0 1.0 2.0 4.0 6.0 8.0 3.0 5.0 7.0 3.0 5.0 7.0 1.0 3.0 3.0 3.0 7.0 7.0 7.0 7.0 9.0 11.0 11.0 1.0 3.0 3.0 3.0 3.0 7.0 7.0 7.0 10.0 10.0 10.0 33272.0 41621.0 25813.0 257845.0 114735.0 42237.0 113227.0 358156.0 32343.0 531309.0 37257.0 42276.0 114792.0 150007.0 178119.0 27232.0 570830.0 260708.0 201704.0 36382.0 34082.0 95257.0
25976 1992093 1.0 2.0 4.0 6.0 8.0 4.0 6.0 3.0 5.0 7.0 5.0 1.0 2.0 4.0 6.0 8.0 4.0 6.0 3.0 5.0 7.0 5.0 1.0 3.0 3.0 3.0 3.0 6.0 6.0 8.0 8.0 8.0 11.0 1.0 3.0 3.0 3.0 3.0 6.0 6.0 8.0 8.0 8.0 11.0 157856.0 274779.0 177689.0 294256.0 42258.0 39979.0 173936.0 147959.0 451983.0 80564.0 289472.0 10637.0 67349.0 202663.0 32597.0 114794.0 188114.0 25840.0 482200.0 95230.0 451335.0 275122.0
25977 1992094 1.0 2.0 4.0 6.0 8.0 3.0 7.0 5.0 3.0 5.0 7.0 1.0 2.0 4.0 6.0 8.0 4.0 6.0 3.0 5.0 7.0 5.0 1.0 3.0 3.0 3.0 3.0 7.0 7.0 7.0 10.0 10.0 10.0 1.0 3.0 3.0 3.0 3.0 6.0 6.0 8.0 8.0 8.0 11.0 NaN 8881.0 173534.0 39646.0 282287.0 340790.0 393337.0 8893.0 614454.0 93229.0 178142.0 274776.0 121080.0 197757.0 260964.0 231614.0 113235.0 41116.0 462608.0 42262.0 92252.0 194532.0
25978 1992095 1.0 2.0 4.0 6.0 8.0 2.0 4.0 6.0 8.0 4.0 6.0 1.0 2.0 4.0 6.0 8.0 4.0 6.0 3.0 5.0 7.0 5.0 1.0 3.0 3.0 3.0 3.0 7.0 7.0 7.0 7.0 10.0 10.0 1.0 3.0 3.0 3.0 3.0 6.0 6.0 8.0 8.0 8.0 11.0 274787.0 492132.0 108451.0 25815.0 94553.0 384376.0 598355.0 36785.0 45174.0 302079.0 71764.0 156175.0 95216.0 172768.0 22834.0 458806.0 207234.0 25772.0 40274.0 34035.0 41726.0 527103.0

25979 rows × 67 columns

df_match_player.to_csv("../data/df_match_player.csv", index = False)

2.3. Betting information

betting_col = ["match_api_id", 'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'LBH', 
               'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 
               'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH', 'BSD', 'BSA']
df_match_betting = org_match[betting_col]
df_match_betting
match_api_id B365H B365D B365A BWH BWD BWA IWH IWD IWA LBH LBD LBA PSH PSD PSA WHH WHD WHA SJH SJD SJA VCH VCD VCA GBH GBD GBA BSH BSD BSA
0 492473 1.73 3.40 5.00 1.75 3.35 4.20 1.85 3.2 3.5 1.80 3.3 3.75 NaN NaN NaN 1.70 3.30 4.33 1.90 3.3 4.00 1.65 3.40 4.50 1.78 3.25 4.00 1.73 3.40 4.20
1 492474 1.95 3.20 3.60 1.80 3.30 3.95 1.90 3.2 3.5 1.90 3.2 3.50 NaN NaN NaN 1.83 3.30 3.60 1.95 3.3 3.80 2.00 3.25 3.25 1.85 3.25 3.75 1.91 3.25 3.60
2 492475 2.38 3.30 2.75 2.40 3.30 2.55 2.60 3.1 2.3 2.50 3.2 2.50 NaN NaN NaN 2.50 3.25 2.40 2.63 3.3 2.50 2.35 3.25 2.65 2.50 3.20 2.50 2.30 3.20 2.75
3 492476 1.44 3.75 7.50 1.40 4.00 6.80 1.40 3.9 6.0 1.44 3.6 6.50 NaN NaN NaN 1.44 3.75 6.00 1.44 4.0 7.50 1.45 3.75 6.50 1.50 3.75 5.50 1.44 3.75 6.50
4 492477 5.00 3.50 1.65 5.00 3.50 1.60 4.00 3.3 1.7 4.00 3.4 1.72 NaN NaN NaN 4.20 3.40 1.70 4.50 3.5 1.73 4.50 3.40 1.65 4.50 3.50 1.65 4.75 3.30 1.67
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25974 1992091 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25975 1992092 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25976 1992093 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25977 1992094 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25978 1992095 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

25979 rows × 31 columns

df_match_betting.to_csv("../data/df_match_betting.csv", index = False)

2.4. In-game information

in_game_col = ["match_api_id", 'goal', "home_team_goal", "away_team_goal", 'shoton', 
               'shotoff', 'foulcommit', 'card', 'cross', 'corner', 
               'possession']

df_match_ingame = org_match[in_game_col]
df_match_ingame
match_api_id goal home_team_goal away_team_goal shoton shotoff foulcommit card cross corner possession
0 492473 None 1 1 None None None None None None None
1 492474 None 0 0 None None None None None None None
2 492475 None 0 3 None None None None None None None
3 492476 None 5 0 None None None None None None None
4 492477 None 1 3 None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ...
25974 1992091 None 1 0 None None None None None None None
25975 1992092 None 1 2 None None None None None None None
25976 1992093 None 2 0 None None None None None None None
25977 1992094 None 0 0 None None None None None None None
25978 1992095 None 4 3 None None None None None None None

25979 rows × 11 columns

df_match_ingame[~df_match_ingame.goal.isna()]
match_api_id goal home_team_goal away_team_goal shoton shotoff foulcommit card cross corner possession
1728 489042 <goal><value><comment>n</comment><stats><goals... 1 1 <shoton><value><stats><blocked>1</blocked></st... <shotoff><value><stats><shotoff>1</shotoff></s... <foulcommit><value><stats><foulscommitted>1</f... <card><value><comment>y</comment><stats><ycard... <cross><value><stats><crosses>1</crosses></sta... <corner><value><stats><corners>1</corners></st... <possession><value><comment>56</comment><event...
1729 489043 <goal><value><comment>n</comment><stats><goals... 1 0 <shoton><value><stats><blocked>1</blocked></st... <shotoff><value><stats><shotoff>1</shotoff></s... <foulcommit><value><stats><foulscommitted>1</f... <card /> <cross><value><stats><crosses>1</crosses></sta... <corner><value><stats><corners>1</corners></st... <possession><value><comment>65</comment><event...
1730 489044 <goal><value><comment>n</comment><stats><goals... 0 1 <shoton><value><stats><blocked>1</blocked></st... <shotoff><value><stats><shotoff>1</shotoff></s... <foulcommit><value><stats><foulscommitted>1</f... <card><value><comment>y</comment><stats><ycard... <cross><value><stats><crosses>1</crosses></sta... <corner><value><stats><corners>1</corners></st... <possession><value><comment>45</comment><event...
1731 489045 <goal><value><comment>n</comment><stats><goals... 2 1 <shoton><value><stats><shoton>1</shoton></stat... <shotoff><value><stats><shotoff>1</shotoff></s... <foulcommit><value><stats><foulscommitted>1</f... <card><value><comment>y</comment><stats><ycard... <cross><value><stats><crosses>1</crosses></sta... <corner><value><stats><corners>1</corners></st... <possession><value><comment>50</comment><event...
1732 489046 <goal><value><comment>n</comment><stats><goals... 4 2 <shoton><value><stats><blocked>1</blocked></st... <shotoff><value><stats><shotoff>1</shotoff></s... <foulcommit><value><stats><foulscommitted>1</f... <card><value><comment>y</comment><stats><ycard... <cross><value><stats><corners>1</corners></sta... <corner><value><stats><corners>1</corners></st... <possession><value><comment>51</comment><event...
... ... ... ... ... ... ... ... ... ... ... ...
25944 1992225 <goal><value><comment>o</comment><stats><owngo... 0 1 <shoton /> <shotoff /> <foulcommit /> <card><value><comment>y</comment><stats><ycard... <cross /> <corner /> <possession />
25945 1992226 <goal><value><comment>n</comment><stats><goals... 3 0 <shoton /> <shotoff /> <foulcommit /> <card><value><comment>y</comment><stats><ycard... <cross /> <corner /> <possession />
25946 1992227 <goal><value><comment>n</comment><stats><goals... 2 2 <shoton /> <shotoff /> <foulcommit /> <card><value><comment>y</comment><stats><ycard... <cross /> <corner /> <possession />
25947 1992228 <goal><value><comment>n</comment><stats><goals... 0 3 <shoton /> <shotoff /> <foulcommit /> <card><value><comment>y</comment><stats><ycard... <cross /> <corner /> <possession />
25948 1992229 <goal><value><comment>n</comment><stats><goals... 3 1 <shoton /> <shotoff /> <foulcommit /> <card><value><comment>y</comment><stats><ycard... <cross /> <corner /> <possession />

14217 rows × 11 columns

  • in-game related columns have xml format.
  • So we need to extract information from the xml data.
  • Here is the information that we can extract from the in-game related columns: https://docs.google.com/spreadsheets/d/1EIwJc1YxI5Uf1-gNGdjgmreI0JLSSicH/edit?usp=sharing&ouid=103945379872439212801&rtpof=true&sd=true

  • Extract the xml data from the goal column.
df_match_ingame_goal = pd.DataFrame(columns = ["match_api_id", "goal_id", "goal_elapsed", "goal_team_api_id", "goal_player1_id", "goal_player2_id",
                                               "goal_type", "goal_subtype"])

for match_api_id in df_match_ingame[~df_match_ingame.goal.isna()].match_api_id.unique():
    xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].goal.values[0]
    
    goal_id = []
    goal_elapsed = []
    goal_team_id = []
    goal_player1_id = []
    goal_palyer2_id = []
    goal_type = []
    goal_subtype = []
    num_elmt = 0
    
    for item in ET.fromstring(xml).findall(".value"):
        num_elmt += 1
        
        for child in item:
            if child.tag == "id": goal_id.append(child.text)
            elif child.tag == "elapsed": goal_elapsed.append(child.text)
            elif child.tag == "team": goal_team_id.append(child.text)
            elif child.tag == "player1": goal_player1_id.append(child.text)
            elif child.tag == "player2": goal_palyer2_id.append(child.text)
            elif child.tag == "goal_type": goal_type.append(child.text)
            elif child.tag == "subtype": goal_subtype.append(child.text)
            
        if len(goal_id) != num_elmt: goal_id.append(np.nan)
        if len(goal_elapsed) != num_elmt: goal_elapsed.append(np.nan)
        if len(goal_team_id) != num_elmt: goal_team_id.append(np.nan)
        if len(goal_player1_id) != num_elmt: goal_player1_id.append(np.nan)
        if len(goal_palyer2_id) != num_elmt: goal_palyer2_id.append(np.nan)
        if len(goal_type) != num_elmt: goal_type.append(np.nan)
        if len(goal_subtype) != num_elmt: goal_subtype.append(np.nan)
        
    temp = pd.DataFrame({"match_api_id": [match_api_id] * len(goal_id),
                         "goal_id": goal_id,
                         "goal_elapsed": goal_elapsed,
                         "goal_team_api_id": goal_team_id,
                         "goal_player1_id": goal_player1_id,
                         "goal_player2_id": goal_palyer2_id,
                         "goal_type": goal_type,
                         "goal_subtype": goal_subtype})
    
    df_match_ingame_goal = pd.concat([df_match_ingame_goal, temp])
    
df_match_ingame_goal
match_api_id goal_id goal_elapsed goal_team_api_id goal_player1_id goal_player2_id goal_type goal_subtype
0 489042 378998 22 10261 37799 38807 n header
1 489042 379019 24 10260 24148 24154 n shot
0 489043 375546 4 9825 26181 39297 n shot
0 489044 378041 83 8650 30853 30889 n distance
0 489045 376060 4 8654 23139 36394 n shot
... ... ... ... ... ... ... ... ...
2 1992228 5640015 71 10192 37554 NaN n NaN
0 1992229 5639993 58 9824 493418 NaN n NaN
1 1992229 5640008 67 10243 197757 NaN n NaN
2 1992229 5640010 69 10243 198082 NaN n NaN
3 1992229 5640020 76 10243 121080 NaN n NaN

39980 rows × 8 columns

  • Extract the xml data from the shoton and shotoff columns.
df_match_ingame_shoton = pd.DataFrame(columns = ["match_api_id", "shot_id", "shot_elapsed", "shot_type", "shot_team_api_id", "shot_player1_id"])

for match_api_id in df_match_ingame[~df_match_ingame.shoton.isna()].match_api_id.unique():
    xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].shoton.values[0]
    
    shot_id = []
    shot_elapsed = []
    shot_type = []
    shot_team_id = []
    shot_player1_id = []
    num_elmt = 0
    
    for item in ET.fromstring(xml).findall(".value"):
        num_elmt += 1
        
        for child in item:
            if child.tag == "id": shot_id.append(child.text)
            elif child.tag == "elapsed": shot_elapsed.append(child.text)
            elif child.tag == "team": shot_team_id.append(child.text)
            elif child.tag == "player1": shot_player1_id.append(child.text)
            elif child.tag == "type": shot_type.append(child.text)
             
        if len(shot_id) != num_elmt: shot_id.append(np.nan)
        if len(shot_elapsed) != num_elmt: shot_elapsed.append(np.nan)
        if len(shot_team_id) != num_elmt: shot_team_id.append(np.nan)
        if len(shot_player1_id) != num_elmt: shot_player1_id.append(np.nan)
        if len(shot_type) != num_elmt: shot_type.append(np.nan)
        
    temp = pd.DataFrame({"match_api_id": [match_api_id] * len(shot_id),
                         "shot_id": shot_id,
                         "shot_elapsed": shot_elapsed,
                         "shot_type": shot_type,
                         "shot_team_api_id": shot_team_id,
                         "shot_player1_id": shot_player1_id})
    
    df_match_ingame_shoton = pd.concat([df_match_ingame_shoton, temp])
df_match_ingame_shoton
match_api_id shot_id shot_elapsed shot_type shot_team_api_id shot_player1_id
0 489042 378828 3 shoton 10260 24154
1 489042 378866 7 shoton 10260 24157
2 489042 378922 14 shoton 10260 30829
3 489042 378923 14 shoton 10260 30373
4 489042 378951 17 shoton 10260 30373
... ... ... ... ... ... ...
6 2030171 4940770 53 shoton 8370 256727
7 2030171 4940869 60 shoton 8558 41622
8 2030171 4940889 62 shoton 8370 482331
9 2030171 4940940 68 shoton 8370 210065
10 2030171 4941144 90 shoton 8558 452243

93755 rows × 6 columns

df_match_ingame_shotoff = pd.DataFrame(columns = ["match_api_id", "shot_id", "shot_elapsed", "shot_type", "shot_team_api_id", "shot_player1_id"])

for match_api_id in df_match_ingame[~df_match_ingame.shotoff.isna()].match_api_id.unique():
    xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].shotoff.values[0]
    
    shot_id = []
    shot_elapsed = []
    shot_type = []
    shot_team_id = []
    shot_player1_id = []
    num_elmt = 0
    
    for item in ET.fromstring(xml).findall(".value"):
        num_elmt += 1
        
        for child in item:
            if child.tag == "id": shot_id.append(child.text)
            elif child.tag == "elapsed": shot_elapsed.append(child.text)
            elif child.tag == "team": shot_team_id.append(child.text)
            elif child.tag == "player1": shot_player1_id.append(child.text)
            elif child.tag == "type": shot_type.append(child.text)
             
        if len(shot_id) != num_elmt: shot_id.append(np.nan)
        if len(shot_elapsed) != num_elmt: shot_elapsed.append(np.nan)
        if len(shot_team_id) != num_elmt: shot_team_id.append(np.nan)
        if len(shot_player1_id) != num_elmt: shot_player1_id.append(np.nan)
        if len(shot_type) != num_elmt: shot_type.append(np.nan)
        
    temp = pd.DataFrame({"match_api_id": [match_api_id] * len(shot_id),
                         "shot_id": shot_id,
                         "shot_elapsed": shot_elapsed,
                         "shot_type": shot_type,
                         "shot_team_api_id": shot_team_id,
                         "shot_player1_id": shot_player1_id})
    
    df_match_ingame_shotoff = pd.concat([df_match_ingame_shotoff, temp])
df_match_ingame_shotoff
match_api_id shot_id shot_elapsed shot_type shot_team_api_id shot_player1_id
0 489042 378835 4 shotoff 10260 30373
1 489042 378845 5 shotoff 10261 37799
2 489042 378995 22 shotoff 10261 24228
3 489042 379075 31 shotoff 10261 38807
4 489042 379111 34 shotoff 10260 24154
... ... ... ... ... ... ...
1 2030171 4940379 19 shotoff 8370 36130
2 2030171 4940624 44 shotoff 8370 34104
3 2030171 4940738 49 shotoff 8558 107930
4 2030171 4940963 71 shotoff 8370 210065
5 2030171 4941072 83 shotoff 8558 629579

95303 rows × 6 columns

  • Merge the shoton and shotoff tables because they have same columns.
df_match_ingame_shot = pd.concat([df_match_ingame_shoton, df_match_ingame_shotoff])
df_match_ingame_shot
match_api_id shot_id shot_elapsed shot_type shot_team_api_id shot_player1_id
0 489042 378828 3 shoton 10260 24154
1 489042 378866 7 shoton 10260 24157
2 489042 378922 14 shoton 10260 30829
3 489042 378923 14 shoton 10260 30373
4 489042 378951 17 shoton 10260 30373
... ... ... ... ... ... ...
1 2030171 4940379 19 shotoff 8370 36130
2 2030171 4940624 44 shotoff 8370 34104
3 2030171 4940738 49 shotoff 8558 107930
4 2030171 4940963 71 shotoff 8370 210065
5 2030171 4941072 83 shotoff 8558 629579

189058 rows × 6 columns

len(df_match_ingame_shot.shot_id.unique())
189058
len(df_match_ingame_goal.goal_id.unique())
39980
len(pd.concat([df_match_ingame_goal.goal_id, df_match_ingame_shot.shot_id]).unique())
229038
  • Since goal and shot information are similar, let’s combine two tables.
df_match_ingame_goal
match_api_id goal_id goal_elapsed goal_team_api_id goal_player1_id goal_player2_id goal_type goal_subtype
0 489042 378998 22 10261 37799 38807 n header
1 489042 379019 24 10260 24148 24154 n shot
0 489043 375546 4 9825 26181 39297 n shot
0 489044 378041 83 8650 30853 30889 n distance
0 489045 376060 4 8654 23139 36394 n shot
... ... ... ... ... ... ... ... ...
2 1992228 5640015 71 10192 37554 NaN n NaN
0 1992229 5639993 58 9824 493418 NaN n NaN
1 1992229 5640008 67 10243 197757 NaN n NaN
2 1992229 5640010 69 10243 198082 NaN n NaN
3 1992229 5640020 76 10243 121080 NaN n NaN

39980 rows × 8 columns

df_match_ingame_goal = df_match_ingame_goal.rename(columns = {"goal_id": "event_id", 
                                                              "goal_elapsed": "elapsed", 
                                                              "goal_team_api_id": "team_api_id",
                                                              "goal_player1_id": "player1_api_id",
                                                              "goal_player2_id": "player2_api_id",
                                                              "goal_type": "type", 
                                                              "goal_subtype": "subtype"})
df_match_ingame_goal["category"] = "goal"
df_match_ingame_goal= df_match_ingame_goal[["match_api_id", "event_id", "elapsed", "team_api_id", "category", 
                                            "type", "subtype", "player1_api_id", "player2_api_id"]]
df_match_ingame_goal
match_api_id event_id elapsed team_api_id category type subtype player1_api_id player2_api_id
0 489042 378998 22 10261 goal n header 37799 38807
1 489042 379019 24 10260 goal n shot 24148 24154
0 489043 375546 4 9825 goal n shot 26181 39297
0 489044 378041 83 8650 goal n distance 30853 30889
0 489045 376060 4 8654 goal n shot 23139 36394
... ... ... ... ... ... ... ... ... ...
2 1992228 5640015 71 10192 goal n NaN 37554 NaN
0 1992229 5639993 58 9824 goal n NaN 493418 NaN
1 1992229 5640008 67 10243 goal n NaN 197757 NaN
2 1992229 5640010 69 10243 goal n NaN 198082 NaN
3 1992229 5640020 76 10243 goal n NaN 121080 NaN

39980 rows × 9 columns

df_match_ingame_shot = df_match_ingame_shot.rename(columns = {"shot_id": "event_id", 
                                                              "shot_elapsed": "elapsed", 
                                                              "shot_team_api_id": "team_api_id",
                                                              "shot_player1_id": "player1_api_id",
                                                              "shot_type": "type"})
df_match_ingame_shot["category"] = "shot"
df_match_ingame_shot["subtype"] = np.nan 
df_match_ingame_shot["player2_api_id"] = np.nan 


df_match_ingame_shot= df_match_ingame_shot[["match_api_id", "event_id", "elapsed", "team_api_id", "category", 
                                            "type", "subtype", "player1_api_id", "player2_api_id"]]
df_match_ingame_shot
match_api_id event_id elapsed team_api_id category type subtype player1_api_id player2_api_id
0 489042 378828 3 10260 shot shoton NaN 24154 NaN
1 489042 378866 7 10260 shot shoton NaN 24157 NaN
2 489042 378922 14 10260 shot shoton NaN 30829 NaN
3 489042 378923 14 10260 shot shoton NaN 30373 NaN
4 489042 378951 17 10260 shot shoton NaN 30373 NaN
... ... ... ... ... ... ... ... ... ...
1 2030171 4940379 19 8370 shot shotoff NaN 36130 NaN
2 2030171 4940624 44 8370 shot shotoff NaN 34104 NaN
3 2030171 4940738 49 8558 shot shotoff NaN 107930 NaN
4 2030171 4940963 71 8370 shot shotoff NaN 210065 NaN
5 2030171 4941072 83 8558 shot shotoff NaN 629579 NaN

189058 rows × 9 columns

df_match_ingame_shot = pd.concat([df_match_ingame_goal, df_match_ingame_shot])
df_match_ingame_shot
match_api_id event_id elapsed team_api_id category type subtype player1_api_id player2_api_id
0 489042 378998 22 10261 goal n header 37799 38807
1 489042 379019 24 10260 goal n shot 24148 24154
0 489043 375546 4 9825 goal n shot 26181 39297
0 489044 378041 83 8650 goal n distance 30853 30889
0 489045 376060 4 8654 goal n shot 23139 36394
... ... ... ... ... ... ... ... ... ...
1 2030171 4940379 19 8370 shot shotoff NaN 36130 NaN
2 2030171 4940624 44 8370 shot shotoff NaN 34104 NaN
3 2030171 4940738 49 8558 shot shotoff NaN 107930 NaN
4 2030171 4940963 71 8370 shot shotoff NaN 210065 NaN
5 2030171 4941072 83 8558 shot shotoff NaN 629579 NaN

229038 rows × 9 columns

df_match_ingame_shot.to_csv("../data/df_match_ingame_shot.csv", index = False)
  • Extract the xml data from the foulcommit and card columns.
df_match_ingame_foul = pd.DataFrame(columns = ["match_api_id", "foul_id", "foul_elapsed", "foul_type", "foul_team_api_id", 
                                               "foul_player1_id", "foul_player2_id"])

for match_api_id in df_match_ingame[~df_match_ingame.foulcommit.isna()].match_api_id.unique():
    xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].foulcommit.values[0]
    
    foul_id = []
    foul_elapsed = []
    foul_type = []
    foul_team_id = []
    foul_player1_id = []
    foul_player2_id = []
    num_elmt = 0
    
    for item in ET.fromstring(xml).findall(".value"):
        num_elmt += 1
        
        for child in item:
            if child.tag == "id": foul_id.append(child.text)
            elif child.tag == "elapsed": foul_elapsed.append(child.text)
            elif child.tag == "type": foul_type.append(child.text)
            elif child.tag == "team": foul_team_id.append(child.text)
            elif child.tag == "player1": foul_player1_id.append(child.text)
            elif child.tag == "player2": foul_player2_id.append(child.text)
            
             
        if len(foul_id) != num_elmt: foul_id.append(np.nan)
        if len(foul_elapsed) != num_elmt: foul_elapsed.append(np.nan)
        if len(foul_type) != num_elmt: foul_type.append(np.nan)
        if len(foul_team_id) != num_elmt: foul_team_id.append(np.nan)
        if len(foul_player1_id) != num_elmt: foul_player1_id.append(np.nan)
        if len(foul_player2_id) != num_elmt: foul_player2_id.append(np.nan)
    
    temp = pd.DataFrame({"match_api_id": [match_api_id] * len(foul_id),
                         "foul_id": foul_id,
                         "foul_elapsed": foul_elapsed,
                         "foul_type": foul_type,
                         "foul_team_api_id": foul_team_id,
                         "foul_player1_id": foul_player1_id,
                         "foul_player2_id": foul_player2_id,})
    
    df_match_ingame_foul = pd.concat([df_match_ingame_foul, temp])
df_match_ingame_foul
match_api_id foul_id foul_elapsed foul_type foul_team_api_id foul_player1_id foul_player2_id
0 489042 378824 1 foulcommit 10261 25518 32569
1 489042 378826 2 foulcommit 10261 30929 24157
2 489042 378830 3 foulcommit 10261 29581 24148
3 489042 378841 5 foulcommit 10260 30373 40565
4 489042 378894 10 foulcommit 10261 29581 30829
... ... ... ... ... ... ... ...
21 2030171 4941024 77 foulcommit 8558 213814 40220
22 2030171 4941052 79 foulcommit 8558 452243 75004
23 2030171 4941084 81 foulcommit 8558 629579 75004
24 2030171 4941106 88 foulcommit 8370 210065 37654
25 2030171 4941145 90 foulcommit 8370 183449 NaN

219234 rows × 7 columns

df_match_ingame_card = pd.DataFrame(columns = ["match_api_id", "card_id", "card_elapsed", "card_type", "card_team_api_id", "card_player1_id", "card_player2_id"])

for match_api_id in df_match_ingame[~df_match_ingame.card.isna()].match_api_id.unique():
    xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].card.values[0]
    
    card_id = []
    card_elapsed = []
    card_type = []
    card_team_id = []
    card_player1_id = []
    card_player2_id = []
    num_elmt = 0
    
    for item in ET.fromstring(xml).findall(".value"):
        num_elmt += 1
        
        for child in item:
            if child.tag == "id": card_id.append(child.text)
            elif child.tag == "elapsed": card_elapsed.append(child.text)
            elif child.tag == "card_type": card_type.append(child.text)
            elif child.tag == "team": card_team_id.append(child.text)
            elif child.tag == "player1": card_player1_id.append(child.text)
            elif child.tag == "player2": card_player2_id.append(child.text)
            
             
        if len(card_id) != num_elmt: card_id.append(np.nan)
        if len(card_elapsed) != num_elmt: card_elapsed.append(np.nan)
        if len(card_type) != num_elmt: card_type.append(np.nan)
        if len(card_team_id) != num_elmt: card_team_id.append(np.nan)
        if len(card_player1_id) != num_elmt: card_player1_id.append(np.nan)
        if len(card_player2_id) != num_elmt: card_player2_id.append(np.nan)
    
    temp = pd.DataFrame({"match_api_id": [match_api_id] * len(card_id),
                         "card_id": card_id,
                         "card_elapsed": card_elapsed,
                         "card_type": card_type,
                         "card_team_api_id": card_team_id,
                         "card_player1_id": card_player1_id,
                         "card_player2_id": card_player2_id,})
    
    df_match_ingame_card = pd.concat([df_match_ingame_card, temp])
df_match_ingame_card
match_api_id card_id card_elapsed card_type card_team_api_id card_player1_id card_player2_id
0 489042 379481 78 y 10260 24157 NaN
1 489042 379503 82 y 10260 30362 NaN
2 489042 379547 90 y 10260 30829 NaN
0 489044 377978 56 y 8650 37442 NaN
1 489044 378060 90 y 8650 46621 NaN
... ... ... ... ... ... ... ...
2 1992228 5640034 82 y 10192 25815 NaN
0 1992229 5639986 52 y 10243 198082 NaN
1 1992229 5639991 56 y 10243 121080 NaN
2 1992229 5639996 60 y 9824 214344 NaN
3 1992229 5639999 61 y 9824 294256 NaN

62114 rows × 7 columns

  • Since foul and card have similar information, let’s combine two tables.
df_match_ingame_foul = df_match_ingame_foul.rename(columns = {"foul_id": "event_id", 
                                                              "foul_elapsed": "elapsed",
                                                              "foul_team_api_id": "team_api_id", 
                                                              "foul_player1_id": "player1_api_id",
                                                              "foul_player2_id": "player2_api_id"})
df_match_ingame_foul
match_api_id event_id elapsed foul_type team_api_id player1_api_id player2_api_id
0 489042 378824 1 foulcommit 10261 25518 32569
1 489042 378826 2 foulcommit 10261 30929 24157
2 489042 378830 3 foulcommit 10261 29581 24148
3 489042 378841 5 foulcommit 10260 30373 40565
4 489042 378894 10 foulcommit 10261 29581 30829
... ... ... ... ... ... ... ...
21 2030171 4941024 77 foulcommit 8558 213814 40220
22 2030171 4941052 79 foulcommit 8558 452243 75004
23 2030171 4941084 81 foulcommit 8558 629579 75004
24 2030171 4941106 88 foulcommit 8370 210065 37654
25 2030171 4941145 90 foulcommit 8370 183449 NaN

219234 rows × 7 columns

df_match_ingame_card
match_api_id card_id card_elapsed card_type card_team_api_id card_player1_id card_player2_id
0 489042 379481 78 y 10260 24157 NaN
1 489042 379503 82 y 10260 30362 NaN
2 489042 379547 90 y 10260 30829 NaN
0 489044 377978 56 y 8650 37442 NaN
1 489044 378060 90 y 8650 46621 NaN
... ... ... ... ... ... ... ...
2 1992228 5640034 82 y 10192 25815 NaN
0 1992229 5639986 52 y 10243 198082 NaN
1 1992229 5639991 56 y 10243 121080 NaN
2 1992229 5639996 60 y 9824 214344 NaN
3 1992229 5639999 61 y 9824 294256 NaN

62114 rows × 7 columns

df_match_ingame_card = df_match_ingame_card.rename(columns = {"card_id": "event_id", 
                                                              "card_elapsed": "elapsed",
                                                              "card_type": "foul_type",
                                                              "card_team_api_id": "team_api_id", 
                                                              "card_player1_id": "player1_api_id",
                                                              "card_player2_id": "player2_api_id"})
df_match_ingame_card
match_api_id event_id elapsed foul_type team_api_id player1_api_id player2_api_id
0 489042 379481 78 y 10260 24157 NaN
1 489042 379503 82 y 10260 30362 NaN
2 489042 379547 90 y 10260 30829 NaN
0 489044 377978 56 y 8650 37442 NaN
1 489044 378060 90 y 8650 46621 NaN
... ... ... ... ... ... ... ...
2 1992228 5640034 82 y 10192 25815 NaN
0 1992229 5639986 52 y 10243 198082 NaN
1 1992229 5639991 56 y 10243 121080 NaN
2 1992229 5639996 60 y 9824 214344 NaN
3 1992229 5639999 61 y 9824 294256 NaN

62114 rows × 7 columns

df_match_ingame_foul = pd.concat([df_match_ingame_foul, df_match_ingame_card])
df_match_ingame_foul
match_api_id event_id elapsed foul_type team_api_id player1_api_id player2_api_id
0 489042 378824 1 foulcommit 10261 25518 32569
1 489042 378826 2 foulcommit 10261 30929 24157
2 489042 378830 3 foulcommit 10261 29581 24148
3 489042 378841 5 foulcommit 10260 30373 40565
4 489042 378894 10 foulcommit 10261 29581 30829
... ... ... ... ... ... ... ...
2 1992228 5640034 82 y 10192 25815 NaN
0 1992229 5639986 52 y 10243 198082 NaN
1 1992229 5639991 56 y 10243 121080 NaN
2 1992229 5639996 60 y 9824 214344 NaN
3 1992229 5639999 61 y 9824 294256 NaN

281348 rows × 7 columns

df_match_ingame_foul.to_csv("../data/df_match_ingame_foul.csv", index = False)
  • Extract the xml data from the cross and corner columns.
df_match_ingame_cross = pd.DataFrame(columns = ["match_api_id", "cross_id", "cross_elapsed", "cross_type", "cross_team_api_id", "cross_player1_id"])

for match_api_id in df_match_ingame[~df_match_ingame.cross.isna()].match_api_id.unique():
    xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].cross.values[0]
    
    cross_id = []
    cross_elapsed = []
    cross_type = []
    cross_team_id = []
    cross_player1_id = []
    num_elmt = 0
    
    for item in ET.fromstring(xml).findall(".value"):
        num_elmt += 1
        
        for child in item:
            if child.tag == "id": cross_id.append(child.text)
            elif child.tag == "elapsed": cross_elapsed.append(child.text)
            elif child.tag == "type": cross_type.append(child.text)
            elif child.tag == "team": cross_team_id.append(child.text)
            elif child.tag == "player1": cross_player1_id.append(child.text)
            
             
        if len(cross_id) != num_elmt: cross_id.append(np.nan)
        if len(cross_elapsed) != num_elmt: cross_elapsed.append(np.nan)
        if len(cross_type) != num_elmt: cross_type.append(np.nan)
        if len(cross_team_id) != num_elmt: cross_team_id.append(np.nan)
        if len(cross_player1_id) != num_elmt: cross_player1_id.append(np.nan)
    
    temp = pd.DataFrame({"match_api_id": [match_api_id] * len(cross_id),
                         "cross_id": cross_id,
                         "cross_elapsed": cross_elapsed,
                         "cross_type": cross_type,
                         "cross_team_api_id": cross_team_id,
                         "cross_player1_id": cross_player1_id})
    
    df_match_ingame_cross = pd.concat([df_match_ingame_cross, temp])
df_match_ingame_cross
match_api_id cross_id cross_elapsed cross_type cross_team_api_id cross_player1_id
0 489042 378863 7 cross 10260 30829
1 489042 378921 14 cross 10260 24148
2 489042 378960 19 corner 10261 38807
3 489042 378970 19 cross 10261 30929
4 489042 378978 20 cross 10260 24148
... ... ... ... ... ... ...
23 2030171 4940945 69 cross 8370 210065
24 2030171 4940996 74 cross 8370 46808
25 2030171 4941050 80 cross 8370 210065
26 2030171 4941089 85 cross 8370 40220
27 2030171 4941100 87 cross 8370 40220

284775 rows × 6 columns

df_match_ingame_cross.cross_type.unique()
array(['cross', 'corner', 'throwin'], dtype=object)
  • It seems that cross information alread include the corner information.
df_match_ingame_corner = pd.DataFrame(columns = ["match_api_id", "corner_id", "corner_elapsed", "corner_type", "corner_team_api_id", "corner_player1_id"])

for match_api_id in df_match_ingame[~df_match_ingame.corner.isna()].match_api_id.unique():
    xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].corner.values[0]
    
    corner_id = []
    corner_elapsed = []
    corner_type = []
    corner_team_id = []
    corner_player1_id = []
    num_elmt = 0
    
    for item in ET.fromstring(xml).findall(".value"):
        num_elmt += 1
        
        for child in item:
            if child.tag == "id": corner_id.append(child.text)
            elif child.tag == "elapsed": corner_elapsed.append(child.text)
            elif child.tag == "type": corner_type.append(child.text)
            elif child.tag == "team": corner_team_id.append(child.text)
            elif child.tag == "player1": corner_player1_id.append(child.text)
            
             
        if len(corner_id) != num_elmt: corner_id.append(np.nan)
        if len(corner_elapsed) != num_elmt: corner_elapsed.append(np.nan)
        if len(corner_type) != num_elmt: corner_type.append(np.nan)
        if len(corner_team_id) != num_elmt: corner_team_id.append(np.nan)
        if len(corner_player1_id) != num_elmt: corner_player1_id.append(np.nan)
    
    temp = pd.DataFrame({"match_api_id": [match_api_id] * len(corner_id),
                         "corner_id": corner_id,
                         "corner_elapsed": corner_elapsed,
                         "corner_type": corner_type,
                         "corner_team_api_id": corner_team_id,
                         "corner_player1_id": corner_player1_id})
    
    df_match_ingame_corner = pd.concat([df_match_ingame_corner, temp])
df_match_ingame_corner
match_api_id corner_id corner_elapsed corner_type corner_team_api_id corner_player1_id
0 489042 378960 19 corner 10261 38807
1 489042 378992 22 corner 10261 40565
2 489042 378996 22 corner 10261 38807
3 489042 379352 48 corner 10260 24154
4 489042 379366 51 corner 10261 38807
... ... ... ... ... ... ...
6 2030171 4940619 44 corner 8370 36130
7 2030171 4940734 49 corner 8558 498033
8 2030171 4940742 50 corner 8558 498033
9 2030171 4940990 73 corner 8558 498033
10 2030171 4941137 90 corner 8370 75004

87839 rows × 6 columns

temp = df_match_ingame_cross.merge(df_match_ingame_corner.rename(columns = {"corner_elapsed": "cross_elapsed"}), how = "left", on = ["match_api_id", "cross_elapsed"])
temp
match_api_id cross_id cross_elapsed cross_type cross_team_api_id cross_player1_id corner_id corner_type corner_team_api_id corner_player1_id
0 489042 378863 7 cross 10260 30829 NaN NaN NaN NaN
1 489042 378921 14 cross 10260 24148 NaN NaN NaN NaN
2 489042 378960 19 corner 10261 38807 378960 corner 10261 38807
3 489042 378970 19 cross 10261 30929 378960 corner 10261 38807
4 489042 378978 20 cross 10260 24148 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
296857 2030171 4940945 69 cross 8370 210065 NaN NaN NaN NaN
296858 2030171 4940996 74 cross 8370 46808 NaN NaN NaN NaN
296859 2030171 4941050 80 cross 8370 210065 NaN NaN NaN NaN
296860 2030171 4941089 85 cross 8370 40220 NaN NaN NaN NaN
296861 2030171 4941100 87 cross 8370 40220 NaN NaN NaN NaN

296862 rows × 10 columns

len(df_match_ingame_corner.corner_id.unique())
87839
len(temp.corner_id.unique())
63736
  • 63736 corner information out of 87839 corner information alread included in the cross information.
  • So we will only use the cross information combined with corner information that are not in the cross information.
temp = df_match_ingame_corner[~df_match_ingame_corner.corner_id.isin(df_match_ingame_cross.cross_id)]
temp.columns = df_match_ingame_cross.columns 
temp
match_api_id cross_id cross_elapsed cross_type cross_team_api_id cross_player1_id
1 489042 378992 22 corner 10261 40565
6 489042 379411 59 corner 10260 24148
0 489043 375535 3 corner 9825 39297
2 489043 375562 9 corner 9825 30960
6 489045 376655 40 corner 8528 35327
... ... ... ... ... ... ...
6 2030171 4940619 44 corner 8370 36130
7 2030171 4940734 49 corner 8558 498033
8 2030171 4940742 50 corner 8558 498033
9 2030171 4940990 73 corner 8558 498033
10 2030171 4941137 90 corner 8370 75004

45588 rows × 6 columns

df_match_ingame_cross = pd.concat([df_match_ingame_cross, temp])
df_match_ingame_cross = df_match_ingame_cross.rename(columns = {"cross_id": "event_id", 
                                                                "cross_elapsed": "elapsed", 
                                                                "cross_team_api_id": "team_api_id", 
                                                                "cross_player1_id": "player1_api_id"})
df_match_ingame_cross
match_api_id event_id elapsed cross_type team_api_id player1_api_id
0 489042 378863 7 cross 10260 30829
1 489042 378921 14 cross 10260 24148
2 489042 378960 19 corner 10261 38807
3 489042 378970 19 cross 10261 30929
4 489042 378978 20 cross 10260 24148
... ... ... ... ... ... ...
6 2030171 4940619 44 corner 8370 36130
7 2030171 4940734 49 corner 8558 498033
8 2030171 4940742 50 corner 8558 498033
9 2030171 4940990 73 corner 8558 498033
10 2030171 4941137 90 corner 8370 75004

330363 rows × 6 columns

df_match_ingame_cross.to_csv("../data/df_match_ingame_cross.csv", index = False)
  • Extract the xml data from the possesion column.
df_match_ingame_possesion = pd.DataFrame(columns = ["match_api_id", "pos_id", "pos_elapsed", "pos_home", "pos_away"])

for match_api_id in df_match_ingame[~df_match_ingame.possession.isna()].match_api_id.unique():
    xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].possession.values[0]
    
    pos_id = []
    pos_elapsed = []
    pos_home = [] 
    pos_away = []
    num_elmt = 0
    
    for item in ET.fromstring(xml).findall(".value"):
        num_elmt += 1
        
        for child in item:
            if child.tag == "id": pos_id.append(child.text)
            elif child.tag == "elapsed": pos_elapsed.append(child.text)
            elif child.tag == "homepos": pos_home.append(child.text)
            elif child.tag == "awaypos": pos_away.append(child.text)
            
            
             
        if len(pos_id) != num_elmt: pos_id.append(np.nan)
        if len(pos_elapsed) != num_elmt: pos_elapsed.append(np.nan)
        if len(pos_home) != num_elmt: pos_home.append(np.nan)
        if len(pos_away) != num_elmt: pos_away.append(np.nan)
    
    temp = pd.DataFrame({"match_api_id": [match_api_id] * len(pos_id),
                         "pos_id": pos_id,
                         "pos_elapsed": pos_elapsed,
                         "pos_home": pos_home,
                         "pos_away": pos_away})
    
    df_match_ingame_possesion = pd.concat([df_match_ingame_possesion, temp])
df_match_ingame_possesion = df_match_ingame_possesion.rename(columns = {"pos_id": "event_id", 
                                                          "pos_elapsed": "elapsed", 
                                                          "pos_home": "home_possession", 
                                                          "pos_away": "away_possession"})
df_match_ingame_possesion
match_api_id event_id elapsed home_possession away_possession
0 489042 379029 25 56 44
1 489042 379251 45 54 46
2 489042 379443 70 54 46
3 489042 379575 90 55 45
0 489043 375608 27 65 35
... ... ... ... ... ...
3 2030170 4947481 85 45 55
0 2030171 4940420 22 59 41
1 2030171 4940639 45 61 39
2 2030171 4940941 68 60 40
3 2030171 4941128 90 62 38

34815 rows × 5 columns

  • Since we need home team and away team api id, find and combine team ids by the match api id.
df_match_basic
match_api_id country_id league_id season stage match_date home_team_api_id away_team_api_id
0 492473 1 1 2008/2009 1 2008-08-17 9987 9993
1 492474 1 1 2008/2009 1 2008-08-16 10000 9994
2 492475 1 1 2008/2009 1 2008-08-16 9984 8635
3 492476 1 1 2008/2009 1 2008-08-17 9991 9998
4 492477 1 1 2008/2009 1 2008-08-16 7947 9985
... ... ... ... ... ... ... ... ...
25974 1992091 24558 24558 2015/2016 9 2015-09-22 10190 10191
25975 1992092 24558 24558 2015/2016 9 2015-09-23 9824 10199
25976 1992093 24558 24558 2015/2016 9 2015-09-23 9956 10179
25977 1992094 24558 24558 2015/2016 9 2015-09-22 7896 10243
25978 1992095 24558 24558 2015/2016 9 2015-09-23 10192 9931

25979 rows × 8 columns

df_match_ingame_possesion = df_match_ingame_possesion.merge(df_match_basic[["match_api_id", "home_team_api_id", "away_team_api_id"]], how = "left", on = "match_api_id") \
    [["match_api_id", "event_id", "elapsed", "home_team_api_id", "home_possession", "away_team_api_id", "away_possession"]]
df_match_ingame_possesion
match_api_id event_id elapsed home_team_api_id home_possession away_team_api_id away_possession
0 489042 379029 25 10260 56 10261 44
1 489042 379251 45 10260 54 10261 46
2 489042 379443 70 10260 54 10261 46
3 489042 379575 90 10260 55 10261 45
4 489043 375608 27 9825 65 8659 35
... ... ... ... ... ... ... ...
34810 2030170 4947481 85 7878 45 8603 55
34811 2030171 4940420 22 8370 59 8558 41
34812 2030171 4940639 45 8370 61 8558 39
34813 2030171 4940941 68 8370 60 8558 40
34814 2030171 4941128 90 8370 62 8558 38

34815 rows × 7 columns

df_match_ingame_possesion.to_csv("../data/df_match_ingame_possesion.csv", index = False)

2.5. Summary

df_match_1

df_match_2

3. Team related tables: Country, League, Team, Team_Attributes

3.1. Country

org_country = pd.read_sql(
    "select * from Country", con
    )

org_country
id name
0 1 Belgium
1 1729 England
2 4769 France
3 7809 Germany
4 10257 Italy
5 13274 Netherlands
6 15722 Poland
7 17642 Portugal
8 19694 Scotland
9 21518 Spain
10 24558 Switzerland
  • There are total 11 countries.

3.2. League

org_league = pd.read_sql(
    "select * from League", con
    )

org_league
id country_id name
0 1 1 Belgium Jupiler League
1 1729 1729 England Premier League
2 4769 4769 France Ligue 1
3 7809 7809 Germany 1. Bundesliga
4 10257 10257 Italy Serie A
5 13274 13274 Netherlands Eredivisie
6 15722 15722 Poland Ekstraklasa
7 17642 17642 Portugal Liga ZON Sagres
8 19694 19694 Scotland Premier League
9 21518 21518 Spain LIGA BBVA
10 24558 24558 Switzerland Super League
  • There are total 11 leagues from different countries.
  • Since there is only one league per country, country and league represent the same information.
  • Since we can get the country information from the league name, let’s do not use the country and only use the league table later.

3.3. Team

org_team = pd.read_sql(
    "select * from Team", con
    )

org_team.head()
id team_api_id team_fifa_api_id team_long_name team_short_name
0 1 9987 673.0 KRC Genk GEN
1 2 9993 675.0 Beerschot AC BAC
2 3 10000 15005.0 SV Zulte-Waregem ZUL
3 4 9994 2007.0 Sporting Lokeren LOK
4 5 9984 1750.0 KSV Cercle Brugge CEB
  • Since team_long_name and team_short_name have same information, we will only use the team_long_name.
org_team.shape
(299, 5)
len(org_team.team_api_id.unique())
299
len(org_team.team_fifa_api_id.unique())
286
org_team.isna().sum()
id                   0
team_api_id          0
team_fifa_api_id    11
team_long_name       0
team_short_name      0
dtype: int64
  • team_api_id and team_fifa_api_id have different number of unique values because team_fifa_api_id has 11 missing values.
  • So it is better to use team_api rather than use team_fifa_api_id to merge tables later.
  • There are 299 teams.

3.4. Team_Attributes

org_team_attr = pd.read_sql(
    "select * from Team_Attributes", con
    )

org_team_attr.head()
id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass chanceCreationPassing chanceCreationPassingClass chanceCreationCrossing chanceCreationCrossingClass chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1 434 9930 2010-02-22 00:00:00 60 Balanced NaN Little 50 Mixed Organised 60 Normal 65 Normal 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 2 434 9930 2014-09-19 00:00:00 52 Balanced 48.0 Normal 56 Mixed Organised 54 Normal 63 Normal 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
2 3 434 9930 2015-09-10 00:00:00 47 Balanced 41.0 Normal 54 Mixed Organised 54 Normal 63 Normal 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
3 4 77 8485 2010-02-22 00:00:00 70 Fast NaN Little 70 Long Organised 70 Risky 70 Lots 70 Lots Organised 60 Medium 70 Double 70 Wide Cover
4 5 77 8485 2011-02-22 00:00:00 47 Balanced NaN Little 52 Mixed Organised 53 Normal 48 Normal 52 Normal Organised 47 Medium 47 Press 52 Normal Cover
org_team_attr.shape
(1458, 25)
  • There are 21 team attributes.
len(org_team_attr.team_api_id.unique())
288
  • The number of teams that have team attributes are 288 out of total 299 teams.
  • This is because the team attributes were collected by team_fifa_api_id, which has 11 missing values.
  • **Note**: Team attributes are important features, so I think matches of teams without team attributes should be excluded from the analysis later.
team_country_info = pd.concat([df_match_basic[["home_team_api_id", "country_id"]].rename(columns = {"home_team_api_id": "team_api_id"}) ,
                               df_match_basic[["away_team_api_id", "country_id"]].rename(columns = {"away_team_api_id": "team_api_id"})]).drop_duplicates()
df_team = org_team_attr.merge(team_country_info, how = "left", on = "team_api_id") \
                       .merge(org_league.rename(columns = {"name": "league_name"}), how = "left", on = "country_id") \
                       .merge(org_team, how = "left", on = "team_api_id") 
df_team.date
0       2010-02-22 00:00:00
1       2014-09-19 00:00:00
2       2015-09-10 00:00:00
3       2010-02-22 00:00:00
4       2011-02-22 00:00:00
               ...         
1453    2011-02-22 00:00:00
1454    2012-02-22 00:00:00
1455    2013-09-20 00:00:00
1456    2014-09-19 00:00:00
1457    2015-09-10 00:00:00
Name: date, Length: 1458, dtype: object
  • Since there are other date columns that have different meanings in other tables, let’s change the column name date in this table to the player_measured_date.
df_team["team_measured_date"] = pd.to_datetime(df_team.date)
df_team = df_team[["team_api_id", "team_long_name", "league_name", "team_measured_date",
                  'buildUpPlaySpeed', 'buildUpPlaySpeedClass', 'buildUpPlayDribbling','buildUpPlayDribblingClass', 
                  'buildUpPlayPassing', 'buildUpPlayPassingClass', 'buildUpPlayPositioningClass', 'chanceCreationPassing', 
                  'chanceCreationPassingClass','chanceCreationCrossing', 'chanceCreationCrossingClass', 'chanceCreationShooting', 
                  'chanceCreationShootingClass','chanceCreationPositioningClass', 'defencePressure', 'defencePressureClass', 
                  'defenceAggression', 'defenceAggressionClass', 'defenceTeamWidth', 'defenceTeamWidthClass',
                  'defenceDefenderLineClass']]
df_team
team_api_id team_long_name league_name team_measured_date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass chanceCreationPassing chanceCreationPassingClass chanceCreationCrossing chanceCreationCrossingClass chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 9930 FC Aarau Switzerland Super League 2010-02-22 60 Balanced NaN Little 50 Mixed Organised 60 Normal 65 Normal 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 9930 FC Aarau Switzerland Super League 2014-09-19 52 Balanced 48.0 Normal 56 Mixed Organised 54 Normal 63 Normal 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
2 9930 FC Aarau Switzerland Super League 2015-09-10 47 Balanced 41.0 Normal 54 Mixed Organised 54 Normal 63 Normal 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
3 8485 Aberdeen Scotland Premier League 2010-02-22 70 Fast NaN Little 70 Long Organised 70 Risky 70 Lots 70 Lots Organised 60 Medium 70 Double 70 Wide Cover
4 8485 Aberdeen Scotland Premier League 2011-02-22 47 Balanced NaN Little 52 Mixed Organised 53 Normal 48 Normal 52 Normal Organised 47 Medium 47 Press 52 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 10000 SV Zulte-Waregem Belgium Jupiler League 2011-02-22 52 Balanced NaN Little 52 Mixed Organised 52 Normal 48 Normal 53 Normal Organised 46 Medium 48 Press 53 Normal Cover
1454 10000 SV Zulte-Waregem Belgium Jupiler League 2012-02-22 54 Balanced NaN Little 51 Mixed Organised 47 Normal 52 Normal 50 Normal Organised 44 Medium 55 Press 53 Normal Cover
1455 10000 SV Zulte-Waregem Belgium Jupiler League 2013-09-20 54 Balanced NaN Little 51 Mixed Organised 47 Normal 52 Normal 32 Little Organised 44 Medium 58 Press 37 Normal Cover
1456 10000 SV Zulte-Waregem Belgium Jupiler League 2014-09-19 54 Balanced 42.0 Normal 51 Mixed Organised 47 Normal 52 Normal 32 Little Organised 44 Medium 58 Press 37 Normal Cover
1457 10000 SV Zulte-Waregem Belgium Jupiler League 2015-09-10 54 Balanced 42.0 Normal 51 Mixed Organised 47 Normal 52 Normal 32 Little Organised 44 Medium 58 Press 37 Normal Cover

1458 rows × 25 columns

team_api_id_grouped_count = df_team.groupby(["league_name", "team_long_name"]).team_api_id.nunique().reset_index()
team_api_id_grouped_count[team_api_id_grouped_count.team_api_id != 1]
league_name team_long_name team_api_id
15 Belgium Jupiler League Royal Excel Mouscron 2
191 Poland Ekstraklasa Polonia Bytom 2
193 Poland Ekstraklasa Widzew Łódź 2
  • Each of the 3 teams has 2 different team_api_id. Let’s use only one of the two team_api_id
df_team.loc[df_team.team_long_name == "Royal Excel Mouscron", "team_api_id"] = 9996 # 274581 -> 9996
df_team.loc[df_team.team_long_name == "Polonia Bytom", "team_api_id"] = 8020 # 8031 -> 8020
df_team.loc[df_team.team_long_name == "Widzew Łódź", "team_api_id"] = 8024 # 8244 -> 8024

df_team = df_team.drop_duplicates()
team_api_id_grouped_count = df_team.groupby(["league_name", "team_long_name"]).team_api_id.nunique().reset_index()
team_api_id_grouped_count[team_api_id_grouped_count.team_api_id != 1]
league_name team_long_name team_api_id
  • Now, all team has only one team_api_id.
  • Save the changed df_team table.
df_team.to_csv("../data/df_team.csv", index = False)
  • Also, fix the above team_api_ids in the all other tables.
df_match_basic.loc[df_match_basic.home_team_api_id == 274581, "home_team_api_id"] = 9996
df_match_basic.loc[df_match_basic.away_team_api_id == 274581, "away_team_api_id"] = 9996

df_match_basic.loc[df_match_basic.home_team_api_id == 8031, "home_team_api_id"] = 8020
df_match_basic.loc[df_match_basic.away_team_api_id == 8031, "away_team_api_id"] = 8020

df_match_basic.loc[df_match_basic.home_team_api_id == 8244, "home_team_api_id"] = 8024
df_match_basic.loc[df_match_basic.away_team_api_id == 8244, "away_team_api_id"] = 8024

df_match_basic.to_csv("../data/df_match_basic.csv", index = False)
df_match_ingame_cross.loc[df_match_ingame_cross.team_api_id == 274581, "team_api_id"] = 9996
df_match_ingame_cross.loc[df_match_ingame_cross.team_api_id == 8031, "team_api_id"] = 8020
df_match_ingame_cross.loc[df_match_ingame_cross.team_api_id == 8244, "team_api_id"] = 8024

df_match_ingame_cross.to_csv("../data/df_match_ingame_cross.csv", index = False)
df_match_ingame_foul.loc[df_match_ingame_foul.team_api_id == 274581, "team_api_id"] = 9996
df_match_ingame_foul.loc[df_match_ingame_foul.team_api_id == 8031, "team_api_id"] = 8020
df_match_ingame_foul.loc[df_match_ingame_foul.team_api_id == 8244, "team_api_id"] = 8024

df_match_ingame_foul.to_csv("../data/df_match_ingame_foul.csv", index = False)
df_match_ingame_possesion.loc[df_match_ingame_possesion.home_team_api_id == 274581, "home_team_api_id"] = 9996
df_match_ingame_possesion.loc[df_match_ingame_possesion.away_team_api_id == 274581, "away_team_api_id"] = 9996

df_match_ingame_possesion.loc[df_match_ingame_possesion.home_team_api_id == 8031, "home_team_api_id"] = 8020
df_match_ingame_possesion.loc[df_match_ingame_possesion.away_team_api_id == 8031, "away_team_api_id"] = 8020

df_match_ingame_possesion.loc[df_match_ingame_possesion.home_team_api_id == 8244, "home_team_api_id"] = 8024
df_match_ingame_possesion.loc[df_match_ingame_possesion.away_team_api_id == 8244, "away_team_api_id"] = 8024

df_match_ingame_possesion.to_csv("../data/df_match_ingame_possesion.csv", index = False)
df_match_ingame_shot.loc[df_match_ingame_shot.team_api_id == 274581, "team_api_id"] = 9996
df_match_ingame_shot.loc[df_match_ingame_shot.team_api_id == 8031, "team_api_id"] = 8020
df_match_ingame_shot.loc[df_match_ingame_shot.team_api_id == 8244, "team_api_id"] = 8024

df_match_ingame_shot.to_csv("../data/df_match_ingame_shot.csv", index = False)

3.6. Summary

df_team_summary

4. Overall Summary

df_overall_summary

  • **Note**: Team attributes are important features, so I think matches of teams without team attributes from the df_team should be excluded from the analysis later.