1) Load and prepare data
import numpy as np
import pandas as pd
import sqlite3 as sql
import xml.etree.ElementTree as ET
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
- The european soccer database consists of the data related to the soccer matches of 10 european soccer leagues from 2008/2009 season to 2015/2016 season.
con = sql.connect("../data/database.sqlite")
sql_master_tables = pd.read_sql('select * from sqlite_master where type="table";',con)
sql_master_tables
| type | name | tbl_name | rootpage | sql | |
|---|---|---|---|---|---|
| 0 | table | sqlite_sequence | sqlite_sequence | 4 | CREATE TABLE sqlite_sequence(name,seq) |
| 1 | table | Player_Attributes | Player_Attributes | 11 | CREATE TABLE "Player_Attributes" (\n\t`id`\tIN... |
| 2 | table | Player | Player | 14 | CREATE TABLE `Player` (\n\t`id`\tINTEGER PRIMA... |
| 3 | table | Match | Match | 18 | CREATE TABLE `Match` (\n\t`id`\tINTEGER PRIMAR... |
| 4 | table | League | League | 24 | CREATE TABLE `League` (\n\t`id`\tINTEGER PRIMA... |
| 5 | table | Country | Country | 26 | CREATE TABLE `Country` (\n\t`id`\tINTEGER PRIM... |
| 6 | table | Team | Team | 29 | CREATE TABLE "Team" (\n\t`id`\tINTEGER PRIMARY... |
| 7 | table | Team_Attributes | Team_Attributes | 2 | CREATE TABLE `Team_Attributes` (\n\t`id`\tINTE... |
pd.read_sql(
"select * from sqlite_sequence", con
)
| name | seq | |
|---|---|---|
| 0 | Team | 103916 |
| 1 | Country | 51958 |
| 2 | League | 51958 |
| 3 | Match | 51958 |
| 4 | Player | 11075 |
| 5 | Player_Attributes | 183978 |
| 6 | Team_Attributes | 1458 |
- There are 7 tables.
- Player related tables
- Player table
- Player_Attributes table
- Team related tables
- Country table
- League table
- Team table
- Team_Attributes table
- Match related tables
- Match table
- Player related tables
1. Player related tables: Player, Player_Attributes
1.1. Player
org_player = pd.read_sql(
"select * from Player", con
)
org_player.head()
| id | player_api_id | player_name | player_fifa_api_id | birthday | height | weight | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 505942 | Aaron Appindangoye | 218353 | 1992-02-29 00:00:00 | 182.88 | 187 |
| 1 | 2 | 155782 | Aaron Cresswell | 189615 | 1989-12-15 00:00:00 | 170.18 | 146 |
| 2 | 3 | 162549 | Aaron Doran | 186170 | 1991-05-13 00:00:00 | 170.18 | 163 |
| 3 | 4 | 30572 | Aaron Galindo | 140161 | 1982-05-08 00:00:00 | 182.88 | 198 |
| 4 | 5 | 23780 | Aaron Hughes | 17725 | 1979-11-08 00:00:00 | 182.88 | 154 |
org_player.shape
(11060, 7)
len(org_player.player_api_id.unique())
11060
len(org_player.player_fifa_api_id.unique())
11060
sum(org_player.groupby(["player_api_id"]).nunique().player_fifa_api_id != 1)
0
- There are total 11,060 soccer players and their birthday, height, and weight information.
- Each player has exactly one player_api_id, one player_fifa_api_id, and one (player_api_id, player_fifa_api_id) set. So, we can use any player id to identify a player.
1.2. Player_Attributes
org_player_attr = pd.read_sql(
"select * from Player_Attributes", con
)
org_player_attr.head()
| id | player_fifa_api_id | player_api_id | date | overall_rating | potential | preferred_foot | attacking_work_rate | defensive_work_rate | crossing | finishing | heading_accuracy | short_passing | volleys | dribbling | curve | free_kick_accuracy | long_passing | ball_control | acceleration | sprint_speed | agility | reactions | balance | shot_power | jumping | stamina | strength | long_shots | aggression | interceptions | positioning | vision | penalties | marking | standing_tackle | sliding_tackle | gk_diving | gk_handling | gk_kicking | gk_positioning | gk_reflexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 218353 | 505942 | 2016-02-18 00:00:00 | 67.0 | 71.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 71.0 | 70.0 | 45.0 | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 1 | 2 | 218353 | 505942 | 2015-11-19 00:00:00 | 67.0 | 71.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 71.0 | 70.0 | 45.0 | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 2 | 3 | 218353 | 505942 | 2015-09-21 00:00:00 | 62.0 | 66.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 63.0 | 41.0 | 45.0 | 54.0 | 48.0 | 65.0 | 66.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 3 | 4 | 218353 | 505942 | 2015-03-20 00:00:00 | 61.0 | 65.0 | right | medium | medium | 48.0 | 43.0 | 70.0 | 60.0 | 43.0 | 50.0 | 44.0 | 38.0 | 63.0 | 48.0 | 60.0 | 64.0 | 59.0 | 46.0 | 65.0 | 54.0 | 58.0 | 54.0 | 76.0 | 34.0 | 62.0 | 40.0 | 44.0 | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 |
| 4 | 5 | 218353 | 505942 | 2007-02-22 00:00:00 | 61.0 | 65.0 | right | medium | medium | 48.0 | 43.0 | 70.0 | 60.0 | 43.0 | 50.0 | 44.0 | 38.0 | 63.0 | 48.0 | 60.0 | 64.0 | 59.0 | 46.0 | 65.0 | 54.0 | 58.0 | 54.0 | 76.0 | 34.0 | 62.0 | 40.0 | 44.0 | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 |
org_player_attr.shape
(183978, 42)
- There are total 38 player attributes.
- player_fifa_api_id or player_api_id can be used to merge the Player and Player_Attribute tables.
len(org_player_attr.player_fifa_api_id.unique())
11062
len(org_player_attr.player_api_id.unique())
11060
- There are different number of unique player_api_id and player_fifa_api_id.
- It means that some player_api_id have different player_fifa_api_id.
target_bool = org_player_attr.groupby(["player_api_id"]).nunique().player_fifa_api_id != 1
target_player_api = org_player_attr.groupby(["player_api_id"]).nunique().player_fifa_api_id[target_bool].index
org_player_attr[org_player_attr.player_api_id.isin(target_player_api)].sort_values(["player_api_id", "date"]).head(15)
| id | player_fifa_api_id | player_api_id | date | overall_rating | potential | preferred_foot | attacking_work_rate | defensive_work_rate | crossing | finishing | heading_accuracy | short_passing | volleys | dribbling | curve | free_kick_accuracy | long_passing | ball_control | acceleration | sprint_speed | agility | reactions | balance | shot_power | jumping | stamina | strength | long_shots | aggression | interceptions | positioning | vision | penalties | marking | standing_tackle | sliding_tackle | gk_diving | gk_handling | gk_kicking | gk_positioning | gk_reflexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 104862 | 104863 | 202486 | 11285 | 2007-02-22 00:00:00 | 62.0 | 68.0 | right | high | medium | 62.0 | 68.0 | 47.0 | 56.0 | 35.0 | 63.0 | 56.0 | 53.0 | 47.0 | 57.0 | 84.0 | 81.0 | 74.0 | 73.0 | 71.0 | 54.0 | 68.0 | 64.0 | 52.0 | 56.0 | 53.0 | 22.0 | 71.0 | 41.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104881 | 104882 | 208618 | 11285 | 2007-02-22 00:00:00 | NaN | NaN | None | None | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 104861 | 104862 | 202486 | 11285 | 2011-02-22 00:00:00 | 68.0 | 77.0 | right | medium | medium | 41.0 | 26.0 | 72.0 | 58.0 | 38.0 | 49.0 | 33.0 | 51.0 | 59.0 | 52.0 | 59.0 | 66.0 | 52.0 | 63.0 | 76.0 | 68.0 | 75.0 | 64.0 | 81.0 | 53.0 | 71.0 | 68.0 | 31.0 | 49.0 | 47.0 | 70.0 | 63.0 | 64.0 | 11.0 | 13.0 | 8.0 | 8.0 | 12.0 |
| 104860 | 104861 | 202486 | 11285 | 2011-08-30 00:00:00 | 65.0 | 69.0 | right | medium | medium | 41.0 | 26.0 | 72.0 | 58.0 | 38.0 | 49.0 | 33.0 | 51.0 | 59.0 | 51.0 | 54.0 | 51.0 | 51.0 | 58.0 | 41.0 | 68.0 | 47.0 | 50.0 | 73.0 | 53.0 | 71.0 | 65.0 | 31.0 | 49.0 | 47.0 | 69.0 | 63.0 | 64.0 | 11.0 | 13.0 | 8.0 | 8.0 | 12.0 |
| 104859 | 104860 | 202486 | 11285 | 2012-02-22 00:00:00 | 66.0 | 72.0 | right | medium | medium | 41.0 | 26.0 | 74.0 | 63.0 | 38.0 | 49.0 | 33.0 | 51.0 | 59.0 | 61.0 | 54.0 | 51.0 | 51.0 | 58.0 | 41.0 | 68.0 | 47.0 | 51.0 | 75.0 | 53.0 | 71.0 | 65.0 | 31.0 | 49.0 | 47.0 | 65.0 | 69.0 | 64.0 | 11.0 | 13.0 | 8.0 | 8.0 | 12.0 |
| 104880 | 104881 | 208618 | 11285 | 2012-08-31 00:00:00 | 62.0 | 68.0 | right | high | medium | 62.0 | 68.0 | 47.0 | 56.0 | 35.0 | 63.0 | 56.0 | 53.0 | 47.0 | 57.0 | 84.0 | 81.0 | 74.0 | 73.0 | 71.0 | 54.0 | 68.0 | 64.0 | 52.0 | 56.0 | 53.0 | 22.0 | 71.0 | 41.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104879 | 104880 | 208618 | 11285 | 2014-03-21 00:00:00 | 62.0 | 68.0 | right | high | medium | 62.0 | 68.0 | 47.0 | 56.0 | 35.0 | 63.0 | 56.0 | 53.0 | 47.0 | 57.0 | 84.0 | 81.0 | 74.0 | 73.0 | 71.0 | 54.0 | 68.0 | 64.0 | 52.0 | 56.0 | 53.0 | 22.0 | 71.0 | 41.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104878 | 104879 | 208618 | 11285 | 2014-04-11 00:00:00 | 63.0 | 68.0 | right | high | medium | 63.0 | 68.0 | 47.0 | 57.0 | 35.0 | 64.0 | 56.0 | 53.0 | 49.0 | 58.0 | 84.0 | 81.0 | 74.0 | 74.0 | 71.0 | 54.0 | 68.0 | 64.0 | 52.0 | 56.0 | 53.0 | 22.0 | 71.0 | 45.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104877 | 104878 | 208618 | 11285 | 2014-04-25 00:00:00 | 65.0 | 68.0 | right | high | medium | 63.0 | 65.0 | 47.0 | 62.0 | 35.0 | 65.0 | 56.0 | 67.0 | 49.0 | 64.0 | 84.0 | 81.0 | 74.0 | 74.0 | 71.0 | 54.0 | 68.0 | 64.0 | 52.0 | 56.0 | 53.0 | 22.0 | 71.0 | 51.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104858 | 104859 | 202486 | 11285 | 2014-09-18 00:00:00 | 65.0 | 68.0 | right | high | medium | 63.0 | 65.0 | 47.0 | 64.0 | 35.0 | 65.0 | 56.0 | 67.0 | 49.0 | 64.0 | 84.0 | 78.0 | 74.0 | 74.0 | 71.0 | 54.0 | 67.0 | 64.0 | 53.0 | 56.0 | 53.0 | 22.0 | 71.0 | 51.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104876 | 104877 | 208618 | 11285 | 2014-09-18 00:00:00 | NaN | NaN | None | None | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 104875 | 104876 | 208618 | 11285 | 2014-11-14 00:00:00 | 65.0 | 68.0 | right | high | medium | 63.0 | 65.0 | 47.0 | 64.0 | 35.0 | 65.0 | 56.0 | 67.0 | 49.0 | 64.0 | 84.0 | 78.0 | 74.0 | 74.0 | 71.0 | 54.0 | 67.0 | 64.0 | 53.0 | 56.0 | 53.0 | 22.0 | 71.0 | 51.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104874 | 104875 | 208618 | 11285 | 2014-11-28 00:00:00 | 65.0 | 68.0 | right | high | medium | 63.0 | 65.0 | 47.0 | 64.0 | 35.0 | 65.0 | 56.0 | 67.0 | 49.0 | 64.0 | 84.0 | 78.0 | 74.0 | 74.0 | 71.0 | 54.0 | 67.0 | 64.0 | 53.0 | 56.0 | 53.0 | 22.0 | 71.0 | 51.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104873 | 104874 | 208618 | 11285 | 2015-02-13 00:00:00 | 70.0 | 77.0 | right | high | medium | 70.0 | 65.0 | 47.0 | 72.0 | 35.0 | 70.0 | 56.0 | 68.0 | 62.0 | 69.0 | 84.0 | 78.0 | 74.0 | 77.0 | 71.0 | 54.0 | 67.0 | 64.0 | 53.0 | 56.0 | 53.0 | 22.0 | 76.0 | 58.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
| 104872 | 104873 | 208618 | 11285 | 2015-02-20 00:00:00 | 70.0 | 77.0 | right | high | medium | 70.0 | 65.0 | 47.0 | 72.0 | 43.0 | 70.0 | 56.0 | 68.0 | 62.0 | 69.0 | 84.0 | 78.0 | 74.0 | 77.0 | 71.0 | 54.0 | 67.0 | 64.0 | 53.0 | 56.0 | 53.0 | 22.0 | 76.0 | 58.0 | 55.0 | 22.0 | 24.0 | 20.0 | 8.0 | 15.0 | 8.0 | 8.0 | 6.0 |
- For example, player_api_id 11285 have two player_fifa_api_id 208618 and 202486.
- So it is better to use the player_api_id when merge tables.
- There are 11060 players’ attributes based on the number of unique player_api_id.
1.3. Merge all player related tables.
- Let’s make one player table by merging the Player and Player_Attributes table.
- We drop the id and player_fifa_api_id columns because we will only use player_api_id to identify a player later.
df_player = org_player_attr.drop(["id", "player_fifa_api_id"], axis = 1) \
.merge(org_player.drop(["id", "player_fifa_api_id"], axis = 1),
how = "left", on = "player_api_id")
df_player
| player_api_id | date | overall_rating | potential | preferred_foot | attacking_work_rate | defensive_work_rate | crossing | finishing | heading_accuracy | short_passing | volleys | dribbling | curve | free_kick_accuracy | long_passing | ball_control | acceleration | sprint_speed | agility | reactions | balance | shot_power | jumping | stamina | strength | long_shots | aggression | interceptions | positioning | vision | penalties | marking | standing_tackle | sliding_tackle | gk_diving | gk_handling | gk_kicking | gk_positioning | gk_reflexes | player_name | birthday | height | weight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 505942 | 2016-02-18 00:00:00 | 67.0 | 71.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 71.0 | 70.0 | 45.0 | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 |
| 1 | 505942 | 2015-11-19 00:00:00 | 67.0 | 71.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 71.0 | 70.0 | 45.0 | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 |
| 2 | 505942 | 2015-09-21 00:00:00 | 62.0 | 66.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 63.0 | 41.0 | 45.0 | 54.0 | 48.0 | 65.0 | 66.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 |
| 3 | 505942 | 2015-03-20 00:00:00 | 61.0 | 65.0 | right | medium | medium | 48.0 | 43.0 | 70.0 | 60.0 | 43.0 | 50.0 | 44.0 | 38.0 | 63.0 | 48.0 | 60.0 | 64.0 | 59.0 | 46.0 | 65.0 | 54.0 | 58.0 | 54.0 | 76.0 | 34.0 | 62.0 | 40.0 | 44.0 | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 |
| 4 | 505942 | 2007-02-22 00:00:00 | 61.0 | 65.0 | right | medium | medium | 48.0 | 43.0 | 70.0 | 60.0 | 43.0 | 50.0 | 44.0 | 38.0 | 63.0 | 48.0 | 60.0 | 64.0 | 59.0 | 46.0 | 65.0 | 54.0 | 58.0 | 54.0 | 76.0 | 34.0 | 62.0 | 40.0 | 44.0 | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 | Aaron Appindangoye | 1992-02-29 00:00:00 | 182.88 | 187 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 183973 | 39902 | 2009-08-30 00:00:00 | 83.0 | 85.0 | right | medium | low | 84.0 | 77.0 | 59.0 | 89.0 | 77.0 | 84.0 | 86.0 | 78.0 | 84.0 | 85.0 | 66.0 | 72.0 | 77.0 | 86.0 | 73.0 | 76.0 | 58.0 | 72.0 | 67.0 | 81.0 | 56.0 | 78.0 | 86.0 | 88.0 | 83.0 | 22.0 | 31.0 | 30.0 | 9.0 | 20.0 | 84.0 | 20.0 | 20.0 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 |
| 183974 | 39902 | 2009-02-22 00:00:00 | 78.0 | 80.0 | right | medium | low | 74.0 | 76.0 | 53.0 | 84.0 | 77.0 | 85.0 | 86.0 | 74.0 | 73.0 | 86.0 | 66.0 | 67.0 | 77.0 | 74.0 | 73.0 | 75.0 | 58.0 | 66.0 | 65.0 | 73.0 | 61.0 | 64.0 | 72.0 | 88.0 | 70.0 | 32.0 | 31.0 | 30.0 | 9.0 | 20.0 | 73.0 | 20.0 | 20.0 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 |
| 183975 | 39902 | 2008-08-30 00:00:00 | 77.0 | 80.0 | right | medium | low | 74.0 | 71.0 | 53.0 | 84.0 | 77.0 | 85.0 | 86.0 | 74.0 | 73.0 | 86.0 | 66.0 | 67.0 | 77.0 | 74.0 | 73.0 | 75.0 | 58.0 | 66.0 | 65.0 | 73.0 | 67.0 | 64.0 | 72.0 | 88.0 | 70.0 | 32.0 | 31.0 | 30.0 | 9.0 | 20.0 | 73.0 | 20.0 | 20.0 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 |
| 183976 | 39902 | 2007-08-30 00:00:00 | 78.0 | 81.0 | right | medium | low | 74.0 | 64.0 | 57.0 | 86.0 | 77.0 | 87.0 | 86.0 | 73.0 | 73.0 | 91.0 | 61.0 | 60.0 | 77.0 | 69.0 | 73.0 | 72.0 | 58.0 | 67.0 | 59.0 | 78.0 | 63.0 | 63.0 | 68.0 | 88.0 | 53.0 | 28.0 | 32.0 | 30.0 | 9.0 | 20.0 | 73.0 | 20.0 | 20.0 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 |
| 183977 | 39902 | 2007-02-22 00:00:00 | 80.0 | 81.0 | right | medium | low | 74.0 | 68.0 | 57.0 | 88.0 | 77.0 | 87.0 | 86.0 | 53.0 | 78.0 | 91.0 | 58.0 | 64.0 | 77.0 | 66.0 | 73.0 | 72.0 | 58.0 | 67.0 | 59.0 | 78.0 | 63.0 | 63.0 | 68.0 | 88.0 | 53.0 | 38.0 | 32.0 | 30.0 | 9.0 | 9.0 | 78.0 | 7.0 | 15.0 | Zvjezdan Misimovic | 1982-06-05 00:00:00 | 180.34 | 176 |
183978 rows × 44 columns
df_player.dtypes
player_api_id int64
date object
overall_rating float64
potential float64
preferred_foot object
attacking_work_rate object
defensive_work_rate object
crossing float64
finishing float64
heading_accuracy float64
short_passing float64
volleys float64
dribbling float64
curve float64
free_kick_accuracy float64
long_passing float64
ball_control float64
acceleration float64
sprint_speed float64
agility float64
reactions float64
balance float64
shot_power float64
jumping float64
stamina float64
strength float64
long_shots float64
aggression float64
interceptions float64
positioning float64
vision float64
penalties float64
marking float64
standing_tackle float64
sliding_tackle float64
gk_diving float64
gk_handling float64
gk_kicking float64
gk_positioning float64
gk_reflexes float64
player_name object
birthday object
height float64
weight int64
dtype: object
- birthday and date columns need to be changed to the datetime type.
- Since there are other date columns that have different meanings in other tables, let’s change the column name date in this table to the player_measured_date.
df_player["player_measured_date"] = pd.to_datetime(df_player.date)
df_player["birthday"] = pd.to_datetime(df_player.birthday)
# just change the columns order
df_player = df_player[["player_api_id", "player_name", "birthday", "height", "weight", "player_measured_date",
'overall_rating', 'potential', 'preferred_foot', 'attacking_work_rate', 'defensive_work_rate',
'crossing', 'finishing', 'heading_accuracy', 'short_passing', 'volleys',
'dribbling', 'curve', 'free_kick_accuracy', 'long_passing', 'ball_control',
'acceleration', 'sprint_speed', 'agility', 'reactions', 'balance',
'shot_power', 'jumping', 'stamina', 'strength', 'long_shots',
'aggression', 'interceptions', 'positioning', 'vision', 'penalties',
'marking', 'standing_tackle', 'sliding_tackle', 'gk_diving', 'gk_handling',
'gk_kicking', 'gk_positioning', 'gk_reflexes']]
df_player
| player_api_id | player_name | birthday | height | weight | player_measured_date | overall_rating | potential | preferred_foot | attacking_work_rate | defensive_work_rate | crossing | finishing | heading_accuracy | short_passing | volleys | dribbling | curve | free_kick_accuracy | long_passing | ball_control | acceleration | sprint_speed | agility | reactions | balance | shot_power | jumping | stamina | strength | long_shots | aggression | interceptions | positioning | vision | penalties | marking | standing_tackle | sliding_tackle | gk_diving | gk_handling | gk_kicking | gk_positioning | gk_reflexes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 505942 | Aaron Appindangoye | 1992-02-29 | 182.88 | 187 | 2016-02-18 | 67.0 | 71.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 71.0 | 70.0 | 45.0 | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 1 | 505942 | Aaron Appindangoye | 1992-02-29 | 182.88 | 187 | 2015-11-19 | 67.0 | 71.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 71.0 | 70.0 | 45.0 | 54.0 | 48.0 | 65.0 | 69.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 2 | 505942 | Aaron Appindangoye | 1992-02-29 | 182.88 | 187 | 2015-09-21 | 62.0 | 66.0 | right | medium | medium | 49.0 | 44.0 | 71.0 | 61.0 | 44.0 | 51.0 | 45.0 | 39.0 | 64.0 | 49.0 | 60.0 | 64.0 | 59.0 | 47.0 | 65.0 | 55.0 | 58.0 | 54.0 | 76.0 | 35.0 | 63.0 | 41.0 | 45.0 | 54.0 | 48.0 | 65.0 | 66.0 | 69.0 | 6.0 | 11.0 | 10.0 | 8.0 | 8.0 |
| 3 | 505942 | Aaron Appindangoye | 1992-02-29 | 182.88 | 187 | 2015-03-20 | 61.0 | 65.0 | right | medium | medium | 48.0 | 43.0 | 70.0 | 60.0 | 43.0 | 50.0 | 44.0 | 38.0 | 63.0 | 48.0 | 60.0 | 64.0 | 59.0 | 46.0 | 65.0 | 54.0 | 58.0 | 54.0 | 76.0 | 34.0 | 62.0 | 40.0 | 44.0 | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 |
| 4 | 505942 | Aaron Appindangoye | 1992-02-29 | 182.88 | 187 | 2007-02-22 | 61.0 | 65.0 | right | medium | medium | 48.0 | 43.0 | 70.0 | 60.0 | 43.0 | 50.0 | 44.0 | 38.0 | 63.0 | 48.0 | 60.0 | 64.0 | 59.0 | 46.0 | 65.0 | 54.0 | 58.0 | 54.0 | 76.0 | 34.0 | 62.0 | 40.0 | 44.0 | 53.0 | 47.0 | 62.0 | 63.0 | 66.0 | 5.0 | 10.0 | 9.0 | 7.0 | 7.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 183973 | 39902 | Zvjezdan Misimovic | 1982-06-05 | 180.34 | 176 | 2009-08-30 | 83.0 | 85.0 | right | medium | low | 84.0 | 77.0 | 59.0 | 89.0 | 77.0 | 84.0 | 86.0 | 78.0 | 84.0 | 85.0 | 66.0 | 72.0 | 77.0 | 86.0 | 73.0 | 76.0 | 58.0 | 72.0 | 67.0 | 81.0 | 56.0 | 78.0 | 86.0 | 88.0 | 83.0 | 22.0 | 31.0 | 30.0 | 9.0 | 20.0 | 84.0 | 20.0 | 20.0 |
| 183974 | 39902 | Zvjezdan Misimovic | 1982-06-05 | 180.34 | 176 | 2009-02-22 | 78.0 | 80.0 | right | medium | low | 74.0 | 76.0 | 53.0 | 84.0 | 77.0 | 85.0 | 86.0 | 74.0 | 73.0 | 86.0 | 66.0 | 67.0 | 77.0 | 74.0 | 73.0 | 75.0 | 58.0 | 66.0 | 65.0 | 73.0 | 61.0 | 64.0 | 72.0 | 88.0 | 70.0 | 32.0 | 31.0 | 30.0 | 9.0 | 20.0 | 73.0 | 20.0 | 20.0 |
| 183975 | 39902 | Zvjezdan Misimovic | 1982-06-05 | 180.34 | 176 | 2008-08-30 | 77.0 | 80.0 | right | medium | low | 74.0 | 71.0 | 53.0 | 84.0 | 77.0 | 85.0 | 86.0 | 74.0 | 73.0 | 86.0 | 66.0 | 67.0 | 77.0 | 74.0 | 73.0 | 75.0 | 58.0 | 66.0 | 65.0 | 73.0 | 67.0 | 64.0 | 72.0 | 88.0 | 70.0 | 32.0 | 31.0 | 30.0 | 9.0 | 20.0 | 73.0 | 20.0 | 20.0 |
| 183976 | 39902 | Zvjezdan Misimovic | 1982-06-05 | 180.34 | 176 | 2007-08-30 | 78.0 | 81.0 | right | medium | low | 74.0 | 64.0 | 57.0 | 86.0 | 77.0 | 87.0 | 86.0 | 73.0 | 73.0 | 91.0 | 61.0 | 60.0 | 77.0 | 69.0 | 73.0 | 72.0 | 58.0 | 67.0 | 59.0 | 78.0 | 63.0 | 63.0 | 68.0 | 88.0 | 53.0 | 28.0 | 32.0 | 30.0 | 9.0 | 20.0 | 73.0 | 20.0 | 20.0 |
| 183977 | 39902 | Zvjezdan Misimovic | 1982-06-05 | 180.34 | 176 | 2007-02-22 | 80.0 | 81.0 | right | medium | low | 74.0 | 68.0 | 57.0 | 88.0 | 77.0 | 87.0 | 86.0 | 53.0 | 78.0 | 91.0 | 58.0 | 64.0 | 77.0 | 66.0 | 73.0 | 72.0 | 58.0 | 67.0 | 59.0 | 78.0 | 63.0 | 63.0 | 68.0 | 88.0 | 53.0 | 38.0 | 32.0 | 30.0 | 9.0 | 9.0 | 78.0 | 7.0 | 15.0 |
183978 rows × 44 columns
df_player.to_csv("../data/df_player.csv", index = False)
1.4. Summary

2. Match table
org_match = pd.read_sql(
"select * from Match", con
)
org_match.head()
| id | country_id | league_id | season | stage | date | match_api_id | home_team_api_id | away_team_api_id | home_team_goal | away_team_goal | home_player_X1 | home_player_X2 | home_player_X3 | home_player_X4 | home_player_X5 | home_player_X6 | home_player_X7 | home_player_X8 | home_player_X9 | home_player_X10 | home_player_X11 | away_player_X1 | away_player_X2 | away_player_X3 | away_player_X4 | away_player_X5 | away_player_X6 | away_player_X7 | away_player_X8 | away_player_X9 | away_player_X10 | away_player_X11 | home_player_Y1 | home_player_Y2 | home_player_Y3 | home_player_Y4 | home_player_Y5 | home_player_Y6 | home_player_Y7 | home_player_Y8 | home_player_Y9 | home_player_Y10 | home_player_Y11 | away_player_Y1 | away_player_Y2 | away_player_Y3 | away_player_Y4 | away_player_Y5 | away_player_Y6 | away_player_Y7 | away_player_Y8 | away_player_Y9 | away_player_Y10 | away_player_Y11 | home_player_1 | home_player_2 | home_player_3 | home_player_4 | home_player_5 | home_player_6 | home_player_7 | home_player_8 | home_player_9 | home_player_10 | home_player_11 | away_player_1 | away_player_2 | away_player_3 | away_player_4 | away_player_5 | away_player_6 | away_player_7 | away_player_8 | away_player_9 | away_player_10 | away_player_11 | goal | shoton | shotoff | foulcommit | card | cross | corner | possession | B365H | B365D | B365A | BWH | BWD | BWA | IWH | IWD | IWA | LBH | LBD | LBA | PSH | PSD | PSA | WHH | WHD | WHA | SJH | SJD | SJA | VCH | VCD | VCA | GBH | GBD | GBA | BSH | BSD | BSA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 2008/2009 | 1 | 2008-08-17 00:00:00 | 492473 | 9987 | 9993 | 1 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | None | None | None | None | None | None | None | 1.73 | 3.40 | 5.00 | 1.75 | 3.35 | 4.20 | 1.85 | 3.2 | 3.5 | 1.80 | 3.3 | 3.75 | NaN | NaN | NaN | 1.70 | 3.30 | 4.33 | 1.90 | 3.3 | 4.00 | 1.65 | 3.40 | 4.50 | 1.78 | 3.25 | 4.00 | 1.73 | 3.40 | 4.20 |
| 1 | 2 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 00:00:00 | 492474 | 10000 | 9994 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | None | None | None | None | None | None | None | 1.95 | 3.20 | 3.60 | 1.80 | 3.30 | 3.95 | 1.90 | 3.2 | 3.5 | 1.90 | 3.2 | 3.50 | NaN | NaN | NaN | 1.83 | 3.30 | 3.60 | 1.95 | 3.3 | 3.80 | 2.00 | 3.25 | 3.25 | 1.85 | 3.25 | 3.75 | 1.91 | 3.25 | 3.60 |
| 2 | 3 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 00:00:00 | 492475 | 9984 | 8635 | 0 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | None | None | None | None | None | None | None | 2.38 | 3.30 | 2.75 | 2.40 | 3.30 | 2.55 | 2.60 | 3.1 | 2.3 | 2.50 | 3.2 | 2.50 | NaN | NaN | NaN | 2.50 | 3.25 | 2.40 | 2.63 | 3.3 | 2.50 | 2.35 | 3.25 | 2.65 | 2.50 | 3.20 | 2.50 | 2.30 | 3.20 | 2.75 |
| 3 | 4 | 1 | 1 | 2008/2009 | 1 | 2008-08-17 00:00:00 | 492476 | 9991 | 9998 | 5 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | None | None | None | None | None | None | None | 1.44 | 3.75 | 7.50 | 1.40 | 4.00 | 6.80 | 1.40 | 3.9 | 6.0 | 1.44 | 3.6 | 6.50 | NaN | NaN | NaN | 1.44 | 3.75 | 6.00 | 1.44 | 4.0 | 7.50 | 1.45 | 3.75 | 6.50 | 1.50 | 3.75 | 5.50 | 1.44 | 3.75 | 6.50 |
| 4 | 5 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 00:00:00 | 492477 | 7947 | 9985 | 1 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None | None | None | None | None | None | None | None | 5.00 | 3.50 | 1.65 | 5.00 | 3.50 | 1.60 | 4.00 | 3.3 | 1.7 | 4.00 | 3.4 | 1.72 | NaN | NaN | NaN | 4.20 | 3.40 | 1.70 | 4.50 | 3.5 | 1.73 | 4.50 | 3.40 | 1.65 | 4.50 | 3.50 | 1.65 | 4.75 | 3.30 | 1.67 |
org_match.shape
(25979, 115)
len(org_match.match_api_id.unique())
25979
- There are total 25,979 matches and 115 match information variables.
-
Since there are 25,979 number of unique match_api_id, let’s use the match_api_id to identify a match.
- Match information variables can be categorized in 4 categories
- basic match information
- country id, league id
- home team id, away team id
- season, stage, date
- player information
- home team players, away team players
- betting information
- in-game information
- goal, shoton, shotoff
- foulcommit, card
- cross, corner
- possession
- home_team_goal, away_team_goal
- basic match information
- Let’s divide the match table into 4 different tables to make analysis easier later.
2.1. Basic match information
basic_match_information_col = ["match_api_id", "country_id", "league_id", "home_team_api_id", "away_team_api_id", "season", "stage", "date", "home_team_goal", "away_team_goal"]
df_match_basic = org_match[basic_match_information_col]
df_match_basic
| match_api_id | country_id | league_id | home_team_api_id | away_team_api_id | season | stage | date | home_team_goal | away_team_goal | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 492473 | 1 | 1 | 9987 | 9993 | 2008/2009 | 1 | 2008-08-17 00:00:00 | 1 | 1 |
| 1 | 492474 | 1 | 1 | 10000 | 9994 | 2008/2009 | 1 | 2008-08-16 00:00:00 | 0 | 0 |
| 2 | 492475 | 1 | 1 | 9984 | 8635 | 2008/2009 | 1 | 2008-08-16 00:00:00 | 0 | 3 |
| 3 | 492476 | 1 | 1 | 9991 | 9998 | 2008/2009 | 1 | 2008-08-17 00:00:00 | 5 | 0 |
| 4 | 492477 | 1 | 1 | 7947 | 9985 | 2008/2009 | 1 | 2008-08-16 00:00:00 | 1 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 1992091 | 24558 | 24558 | 10190 | 10191 | 2015/2016 | 9 | 2015-09-22 00:00:00 | 1 | 0 |
| 25975 | 1992092 | 24558 | 24558 | 9824 | 10199 | 2015/2016 | 9 | 2015-09-23 00:00:00 | 1 | 2 |
| 25976 | 1992093 | 24558 | 24558 | 9956 | 10179 | 2015/2016 | 9 | 2015-09-23 00:00:00 | 2 | 0 |
| 25977 | 1992094 | 24558 | 24558 | 7896 | 10243 | 2015/2016 | 9 | 2015-09-22 00:00:00 | 0 | 0 |
| 25978 | 1992095 | 24558 | 24558 | 10192 | 9931 | 2015/2016 | 9 | 2015-09-23 00:00:00 | 4 | 3 |
25979 rows × 10 columns
df_match_basic.dtypes
match_api_id int64
country_id int64
league_id int64
home_team_api_id int64
away_team_api_id int64
season object
stage int64
date object
home_team_goal int64
away_team_goal int64
dtype: object
- The date column need to be converted to the datetime dtype.
- Since there are other date columns that have different meanings in other tables, let’s change the column name date in this table to the match_date.
df_match_basic["match_date"] = pd.to_datetime(df_match_basic.date)
# Just change the columns order
df_match_basic = df_match_basic[["match_api_id", "country_id", "league_id", "season", "stage",
"match_date", "home_team_api_id", "away_team_api_id", "home_team_goal", "away_team_goal"]]
df_match_basic
| match_api_id | country_id | league_id | season | stage | match_date | home_team_api_id | away_team_api_id | home_team_goal | away_team_goal | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 492473 | 1 | 1 | 2008/2009 | 1 | 2008-08-17 | 9987 | 9993 | 1 | 1 |
| 1 | 492474 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 10000 | 9994 | 0 | 0 |
| 2 | 492475 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 9984 | 8635 | 0 | 3 |
| 3 | 492476 | 1 | 1 | 2008/2009 | 1 | 2008-08-17 | 9991 | 9998 | 5 | 0 |
| 4 | 492477 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 7947 | 9985 | 1 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 1992091 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-22 | 10190 | 10191 | 1 | 0 |
| 25975 | 1992092 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 9824 | 10199 | 1 | 2 |
| 25976 | 1992093 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 9956 | 10179 | 2 | 0 |
| 25977 | 1992094 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-22 | 7896 | 10243 | 0 | 0 |
| 25978 | 1992095 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 10192 | 9931 | 4 | 3 |
25979 rows × 10 columns
- Make the match result column that will be our target variable in the modeling.
- match_result: home_win / draw / away_win
home_win_bool = df_match_basic.home_team_goal > df_match_basic.away_team_goal
draw_bool = df_match_basic.home_team_goal == df_match_basic.away_team_goal
away_win_bool = df_match_basic.home_team_goal < df_match_basic.away_team_goal
df_match_basic.loc[home_win_bool, "match_result"] = "home_win"
df_match_basic.loc[draw_bool, "match_result"] = "draw"
df_match_basic.loc[away_win_bool, "match_result"] = "away_win"
df_match_basic
| match_api_id | country_id | league_id | season | stage | match_date | home_team_api_id | away_team_api_id | home_team_goal | away_team_goal | match_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 492473 | 1 | 1 | 2008/2009 | 1 | 2008-08-17 | 9987 | 9993 | 1 | 1 | draw |
| 1 | 492474 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 10000 | 9994 | 0 | 0 | draw |
| 2 | 492475 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 9984 | 8635 | 0 | 3 | away_win |
| 3 | 492476 | 1 | 1 | 2008/2009 | 1 | 2008-08-17 | 9991 | 9998 | 5 | 0 | home_win |
| 4 | 492477 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 7947 | 9985 | 1 | 3 | away_win |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 1992091 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-22 | 10190 | 10191 | 1 | 0 | home_win |
| 25975 | 1992092 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 9824 | 10199 | 1 | 2 | away_win |
| 25976 | 1992093 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 9956 | 10179 | 2 | 0 | home_win |
| 25977 | 1992094 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-22 | 7896 | 10243 | 0 | 0 | draw |
| 25978 | 1992095 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 10192 | 9931 | 4 | 3 | home_win |
25979 rows × 11 columns
df_match_basic.to_csv("../data/df_match_basic.csv", index = False)
2.2. Player information
player_col = ["match_api_id", 'home_player_X1', 'home_player_X2', 'home_player_X3', 'home_player_X4', 'home_player_X5',
'home_player_X6', 'home_player_X7', 'home_player_X8', 'home_player_X9', 'home_player_X10',
'home_player_X11', 'away_player_X1', 'away_player_X2', 'away_player_X3', 'away_player_X4',
'away_player_X5', 'away_player_X6', 'away_player_X7', 'away_player_X8', 'away_player_X9',
'away_player_X10', 'away_player_X11', 'home_player_Y1', 'home_player_Y2', 'home_player_Y3',
'home_player_Y4', 'home_player_Y5', 'home_player_Y6', 'home_player_Y7', 'home_player_Y8',
'home_player_Y9', 'home_player_Y10', 'home_player_Y11', 'away_player_Y1', 'away_player_Y2',
'away_player_Y3', 'away_player_Y4', 'away_player_Y5', 'away_player_Y6', 'away_player_Y7',
'away_player_Y8', 'away_player_Y9', 'away_player_Y10', 'away_player_Y11', 'home_player_1',
'home_player_2', 'home_player_3', 'home_player_4', 'home_player_5', 'home_player_6',
'home_player_7', 'home_player_8', 'home_player_9', 'home_player_10', 'home_player_11',
'away_player_1', 'away_player_2', 'away_player_3', 'away_player_4', 'away_player_5',
'away_player_6', 'away_player_7', 'away_player_8', 'away_player_9', 'away_player_10', 'away_player_11']
df_match_player = org_match[player_col]
df_match_player
| match_api_id | home_player_X1 | home_player_X2 | home_player_X3 | home_player_X4 | home_player_X5 | home_player_X6 | home_player_X7 | home_player_X8 | home_player_X9 | home_player_X10 | home_player_X11 | away_player_X1 | away_player_X2 | away_player_X3 | away_player_X4 | away_player_X5 | away_player_X6 | away_player_X7 | away_player_X8 | away_player_X9 | away_player_X10 | away_player_X11 | home_player_Y1 | home_player_Y2 | home_player_Y3 | home_player_Y4 | home_player_Y5 | home_player_Y6 | home_player_Y7 | home_player_Y8 | home_player_Y9 | home_player_Y10 | home_player_Y11 | away_player_Y1 | away_player_Y2 | away_player_Y3 | away_player_Y4 | away_player_Y5 | away_player_Y6 | away_player_Y7 | away_player_Y8 | away_player_Y9 | away_player_Y10 | away_player_Y11 | home_player_1 | home_player_2 | home_player_3 | home_player_4 | home_player_5 | home_player_6 | home_player_7 | home_player_8 | home_player_9 | home_player_10 | home_player_11 | away_player_1 | away_player_2 | away_player_3 | away_player_4 | away_player_5 | away_player_6 | away_player_7 | away_player_8 | away_player_9 | away_player_10 | away_player_11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 492473 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 492474 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 492475 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 492476 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 492477 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 1992091 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 4.0 | 6.0 | 3.0 | 5.0 | 7.0 | 5.0 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 4.0 | 6.0 | 3.0 | 5.0 | 7.0 | 5.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 6.0 | 6.0 | 8.0 | 8.0 | 8.0 | 11.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 6.0 | 6.0 | 8.0 | 8.0 | 8.0 | 11.0 | 42231.0 | 678384.0 | 95220.0 | 638592.0 | 413155.0 | 45780.0 | 171229.0 | 67333.0 | 119839.0 | 143790.0 | 195215.0 | 462944.0 | 563066.0 | 8800.0 | 67304.0 | 158253.0 | 133126.0 | 186524.0 | 93223.0 | 121115.0 | 232110.0 | 289732.0 |
| 25975 | 1992092 | 1.0 | 3.0 | 5.0 | 7.0 | 2.0 | 4.0 | 6.0 | 8.0 | 5.0 | 4.0 | 6.0 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 3.0 | 5.0 | 7.0 | 3.0 | 5.0 | 7.0 | 1.0 | 3.0 | 3.0 | 3.0 | 7.0 | 7.0 | 7.0 | 7.0 | 9.0 | 11.0 | 11.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 7.0 | 7.0 | 7.0 | 10.0 | 10.0 | 10.0 | 33272.0 | 41621.0 | 25813.0 | 257845.0 | 114735.0 | 42237.0 | 113227.0 | 358156.0 | 32343.0 | 531309.0 | 37257.0 | 42276.0 | 114792.0 | 150007.0 | 178119.0 | 27232.0 | 570830.0 | 260708.0 | 201704.0 | 36382.0 | 34082.0 | 95257.0 |
| 25976 | 1992093 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 4.0 | 6.0 | 3.0 | 5.0 | 7.0 | 5.0 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 4.0 | 6.0 | 3.0 | 5.0 | 7.0 | 5.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 6.0 | 6.0 | 8.0 | 8.0 | 8.0 | 11.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 6.0 | 6.0 | 8.0 | 8.0 | 8.0 | 11.0 | 157856.0 | 274779.0 | 177689.0 | 294256.0 | 42258.0 | 39979.0 | 173936.0 | 147959.0 | 451983.0 | 80564.0 | 289472.0 | 10637.0 | 67349.0 | 202663.0 | 32597.0 | 114794.0 | 188114.0 | 25840.0 | 482200.0 | 95230.0 | 451335.0 | 275122.0 |
| 25977 | 1992094 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 3.0 | 7.0 | 5.0 | 3.0 | 5.0 | 7.0 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 4.0 | 6.0 | 3.0 | 5.0 | 7.0 | 5.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 7.0 | 7.0 | 7.0 | 10.0 | 10.0 | 10.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 6.0 | 6.0 | 8.0 | 8.0 | 8.0 | 11.0 | NaN | 8881.0 | 173534.0 | 39646.0 | 282287.0 | 340790.0 | 393337.0 | 8893.0 | 614454.0 | 93229.0 | 178142.0 | 274776.0 | 121080.0 | 197757.0 | 260964.0 | 231614.0 | 113235.0 | 41116.0 | 462608.0 | 42262.0 | 92252.0 | 194532.0 |
| 25978 | 1992095 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 2.0 | 4.0 | 6.0 | 8.0 | 4.0 | 6.0 | 1.0 | 2.0 | 4.0 | 6.0 | 8.0 | 4.0 | 6.0 | 3.0 | 5.0 | 7.0 | 5.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 7.0 | 7.0 | 7.0 | 7.0 | 10.0 | 10.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 6.0 | 6.0 | 8.0 | 8.0 | 8.0 | 11.0 | 274787.0 | 492132.0 | 108451.0 | 25815.0 | 94553.0 | 384376.0 | 598355.0 | 36785.0 | 45174.0 | 302079.0 | 71764.0 | 156175.0 | 95216.0 | 172768.0 | 22834.0 | 458806.0 | 207234.0 | 25772.0 | 40274.0 | 34035.0 | 41726.0 | 527103.0 |
25979 rows × 67 columns
df_match_player.to_csv("../data/df_match_player.csv", index = False)
2.3. Betting information
betting_col = ["match_api_id", 'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'LBH',
'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD',
'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH', 'BSD', 'BSA']
df_match_betting = org_match[betting_col]
df_match_betting
| match_api_id | B365H | B365D | B365A | BWH | BWD | BWA | IWH | IWD | IWA | LBH | LBD | LBA | PSH | PSD | PSA | WHH | WHD | WHA | SJH | SJD | SJA | VCH | VCD | VCA | GBH | GBD | GBA | BSH | BSD | BSA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 492473 | 1.73 | 3.40 | 5.00 | 1.75 | 3.35 | 4.20 | 1.85 | 3.2 | 3.5 | 1.80 | 3.3 | 3.75 | NaN | NaN | NaN | 1.70 | 3.30 | 4.33 | 1.90 | 3.3 | 4.00 | 1.65 | 3.40 | 4.50 | 1.78 | 3.25 | 4.00 | 1.73 | 3.40 | 4.20 |
| 1 | 492474 | 1.95 | 3.20 | 3.60 | 1.80 | 3.30 | 3.95 | 1.90 | 3.2 | 3.5 | 1.90 | 3.2 | 3.50 | NaN | NaN | NaN | 1.83 | 3.30 | 3.60 | 1.95 | 3.3 | 3.80 | 2.00 | 3.25 | 3.25 | 1.85 | 3.25 | 3.75 | 1.91 | 3.25 | 3.60 |
| 2 | 492475 | 2.38 | 3.30 | 2.75 | 2.40 | 3.30 | 2.55 | 2.60 | 3.1 | 2.3 | 2.50 | 3.2 | 2.50 | NaN | NaN | NaN | 2.50 | 3.25 | 2.40 | 2.63 | 3.3 | 2.50 | 2.35 | 3.25 | 2.65 | 2.50 | 3.20 | 2.50 | 2.30 | 3.20 | 2.75 |
| 3 | 492476 | 1.44 | 3.75 | 7.50 | 1.40 | 4.00 | 6.80 | 1.40 | 3.9 | 6.0 | 1.44 | 3.6 | 6.50 | NaN | NaN | NaN | 1.44 | 3.75 | 6.00 | 1.44 | 4.0 | 7.50 | 1.45 | 3.75 | 6.50 | 1.50 | 3.75 | 5.50 | 1.44 | 3.75 | 6.50 |
| 4 | 492477 | 5.00 | 3.50 | 1.65 | 5.00 | 3.50 | 1.60 | 4.00 | 3.3 | 1.7 | 4.00 | 3.4 | 1.72 | NaN | NaN | NaN | 4.20 | 3.40 | 1.70 | 4.50 | 3.5 | 1.73 | 4.50 | 3.40 | 1.65 | 4.50 | 3.50 | 1.65 | 4.75 | 3.30 | 1.67 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 1992091 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25975 | 1992092 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25976 | 1992093 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25977 | 1992094 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25978 | 1992095 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25979 rows × 31 columns
df_match_betting.to_csv("../data/df_match_betting.csv", index = False)
2.4. In-game information
in_game_col = ["match_api_id", 'goal', "home_team_goal", "away_team_goal", 'shoton',
'shotoff', 'foulcommit', 'card', 'cross', 'corner',
'possession']
df_match_ingame = org_match[in_game_col]
df_match_ingame
| match_api_id | goal | home_team_goal | away_team_goal | shoton | shotoff | foulcommit | card | cross | corner | possession | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 492473 | None | 1 | 1 | None | None | None | None | None | None | None |
| 1 | 492474 | None | 0 | 0 | None | None | None | None | None | None | None |
| 2 | 492475 | None | 0 | 3 | None | None | None | None | None | None | None |
| 3 | 492476 | None | 5 | 0 | None | None | None | None | None | None | None |
| 4 | 492477 | None | 1 | 3 | None | None | None | None | None | None | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 1992091 | None | 1 | 0 | None | None | None | None | None | None | None |
| 25975 | 1992092 | None | 1 | 2 | None | None | None | None | None | None | None |
| 25976 | 1992093 | None | 2 | 0 | None | None | None | None | None | None | None |
| 25977 | 1992094 | None | 0 | 0 | None | None | None | None | None | None | None |
| 25978 | 1992095 | None | 4 | 3 | None | None | None | None | None | None | None |
25979 rows × 11 columns
df_match_ingame[~df_match_ingame.goal.isna()]
| match_api_id | goal | home_team_goal | away_team_goal | shoton | shotoff | foulcommit | card | cross | corner | possession | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1728 | 489042 | <goal><value><comment>n</comment><stats><goals... | 1 | 1 | <shoton><value><stats><blocked>1</blocked></st... | <shotoff><value><stats><shotoff>1</shotoff></s... | <foulcommit><value><stats><foulscommitted>1</f... | <card><value><comment>y</comment><stats><ycard... | <cross><value><stats><crosses>1</crosses></sta... | <corner><value><stats><corners>1</corners></st... | <possession><value><comment>56</comment><event... |
| 1729 | 489043 | <goal><value><comment>n</comment><stats><goals... | 1 | 0 | <shoton><value><stats><blocked>1</blocked></st... | <shotoff><value><stats><shotoff>1</shotoff></s... | <foulcommit><value><stats><foulscommitted>1</f... | <card /> | <cross><value><stats><crosses>1</crosses></sta... | <corner><value><stats><corners>1</corners></st... | <possession><value><comment>65</comment><event... |
| 1730 | 489044 | <goal><value><comment>n</comment><stats><goals... | 0 | 1 | <shoton><value><stats><blocked>1</blocked></st... | <shotoff><value><stats><shotoff>1</shotoff></s... | <foulcommit><value><stats><foulscommitted>1</f... | <card><value><comment>y</comment><stats><ycard... | <cross><value><stats><crosses>1</crosses></sta... | <corner><value><stats><corners>1</corners></st... | <possession><value><comment>45</comment><event... |
| 1731 | 489045 | <goal><value><comment>n</comment><stats><goals... | 2 | 1 | <shoton><value><stats><shoton>1</shoton></stat... | <shotoff><value><stats><shotoff>1</shotoff></s... | <foulcommit><value><stats><foulscommitted>1</f... | <card><value><comment>y</comment><stats><ycard... | <cross><value><stats><crosses>1</crosses></sta... | <corner><value><stats><corners>1</corners></st... | <possession><value><comment>50</comment><event... |
| 1732 | 489046 | <goal><value><comment>n</comment><stats><goals... | 4 | 2 | <shoton><value><stats><blocked>1</blocked></st... | <shotoff><value><stats><shotoff>1</shotoff></s... | <foulcommit><value><stats><foulscommitted>1</f... | <card><value><comment>y</comment><stats><ycard... | <cross><value><stats><corners>1</corners></sta... | <corner><value><stats><corners>1</corners></st... | <possession><value><comment>51</comment><event... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25944 | 1992225 | <goal><value><comment>o</comment><stats><owngo... | 0 | 1 | <shoton /> | <shotoff /> | <foulcommit /> | <card><value><comment>y</comment><stats><ycard... | <cross /> | <corner /> | <possession /> |
| 25945 | 1992226 | <goal><value><comment>n</comment><stats><goals... | 3 | 0 | <shoton /> | <shotoff /> | <foulcommit /> | <card><value><comment>y</comment><stats><ycard... | <cross /> | <corner /> | <possession /> |
| 25946 | 1992227 | <goal><value><comment>n</comment><stats><goals... | 2 | 2 | <shoton /> | <shotoff /> | <foulcommit /> | <card><value><comment>y</comment><stats><ycard... | <cross /> | <corner /> | <possession /> |
| 25947 | 1992228 | <goal><value><comment>n</comment><stats><goals... | 0 | 3 | <shoton /> | <shotoff /> | <foulcommit /> | <card><value><comment>y</comment><stats><ycard... | <cross /> | <corner /> | <possession /> |
| 25948 | 1992229 | <goal><value><comment>n</comment><stats><goals... | 3 | 1 | <shoton /> | <shotoff /> | <foulcommit /> | <card><value><comment>y</comment><stats><ycard... | <cross /> | <corner /> | <possession /> |
14217 rows × 11 columns
- in-game related columns have xml format.
- So we need to extract information from the xml data.
-
Here is the information that we can extract from the in-game related columns: https://docs.google.com/spreadsheets/d/1EIwJc1YxI5Uf1-gNGdjgmreI0JLSSicH/edit?usp=sharing&ouid=103945379872439212801&rtpof=true&sd=true
- Extract the xml data from the goal column.
df_match_ingame_goal = pd.DataFrame(columns = ["match_api_id", "goal_id", "goal_elapsed", "goal_team_api_id", "goal_player1_id", "goal_player2_id",
"goal_type", "goal_subtype"])
for match_api_id in df_match_ingame[~df_match_ingame.goal.isna()].match_api_id.unique():
xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].goal.values[0]
goal_id = []
goal_elapsed = []
goal_team_id = []
goal_player1_id = []
goal_palyer2_id = []
goal_type = []
goal_subtype = []
num_elmt = 0
for item in ET.fromstring(xml).findall(".value"):
num_elmt += 1
for child in item:
if child.tag == "id": goal_id.append(child.text)
elif child.tag == "elapsed": goal_elapsed.append(child.text)
elif child.tag == "team": goal_team_id.append(child.text)
elif child.tag == "player1": goal_player1_id.append(child.text)
elif child.tag == "player2": goal_palyer2_id.append(child.text)
elif child.tag == "goal_type": goal_type.append(child.text)
elif child.tag == "subtype": goal_subtype.append(child.text)
if len(goal_id) != num_elmt: goal_id.append(np.nan)
if len(goal_elapsed) != num_elmt: goal_elapsed.append(np.nan)
if len(goal_team_id) != num_elmt: goal_team_id.append(np.nan)
if len(goal_player1_id) != num_elmt: goal_player1_id.append(np.nan)
if len(goal_palyer2_id) != num_elmt: goal_palyer2_id.append(np.nan)
if len(goal_type) != num_elmt: goal_type.append(np.nan)
if len(goal_subtype) != num_elmt: goal_subtype.append(np.nan)
temp = pd.DataFrame({"match_api_id": [match_api_id] * len(goal_id),
"goal_id": goal_id,
"goal_elapsed": goal_elapsed,
"goal_team_api_id": goal_team_id,
"goal_player1_id": goal_player1_id,
"goal_player2_id": goal_palyer2_id,
"goal_type": goal_type,
"goal_subtype": goal_subtype})
df_match_ingame_goal = pd.concat([df_match_ingame_goal, temp])
df_match_ingame_goal
| match_api_id | goal_id | goal_elapsed | goal_team_api_id | goal_player1_id | goal_player2_id | goal_type | goal_subtype | |
|---|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378998 | 22 | 10261 | 37799 | 38807 | n | header |
| 1 | 489042 | 379019 | 24 | 10260 | 24148 | 24154 | n | shot |
| 0 | 489043 | 375546 | 4 | 9825 | 26181 | 39297 | n | shot |
| 0 | 489044 | 378041 | 83 | 8650 | 30853 | 30889 | n | distance |
| 0 | 489045 | 376060 | 4 | 8654 | 23139 | 36394 | n | shot |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | 1992228 | 5640015 | 71 | 10192 | 37554 | NaN | n | NaN |
| 0 | 1992229 | 5639993 | 58 | 9824 | 493418 | NaN | n | NaN |
| 1 | 1992229 | 5640008 | 67 | 10243 | 197757 | NaN | n | NaN |
| 2 | 1992229 | 5640010 | 69 | 10243 | 198082 | NaN | n | NaN |
| 3 | 1992229 | 5640020 | 76 | 10243 | 121080 | NaN | n | NaN |
39980 rows × 8 columns
- Extract the xml data from the shoton and shotoff columns.
df_match_ingame_shoton = pd.DataFrame(columns = ["match_api_id", "shot_id", "shot_elapsed", "shot_type", "shot_team_api_id", "shot_player1_id"])
for match_api_id in df_match_ingame[~df_match_ingame.shoton.isna()].match_api_id.unique():
xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].shoton.values[0]
shot_id = []
shot_elapsed = []
shot_type = []
shot_team_id = []
shot_player1_id = []
num_elmt = 0
for item in ET.fromstring(xml).findall(".value"):
num_elmt += 1
for child in item:
if child.tag == "id": shot_id.append(child.text)
elif child.tag == "elapsed": shot_elapsed.append(child.text)
elif child.tag == "team": shot_team_id.append(child.text)
elif child.tag == "player1": shot_player1_id.append(child.text)
elif child.tag == "type": shot_type.append(child.text)
if len(shot_id) != num_elmt: shot_id.append(np.nan)
if len(shot_elapsed) != num_elmt: shot_elapsed.append(np.nan)
if len(shot_team_id) != num_elmt: shot_team_id.append(np.nan)
if len(shot_player1_id) != num_elmt: shot_player1_id.append(np.nan)
if len(shot_type) != num_elmt: shot_type.append(np.nan)
temp = pd.DataFrame({"match_api_id": [match_api_id] * len(shot_id),
"shot_id": shot_id,
"shot_elapsed": shot_elapsed,
"shot_type": shot_type,
"shot_team_api_id": shot_team_id,
"shot_player1_id": shot_player1_id})
df_match_ingame_shoton = pd.concat([df_match_ingame_shoton, temp])
df_match_ingame_shoton
| match_api_id | shot_id | shot_elapsed | shot_type | shot_team_api_id | shot_player1_id | |
|---|---|---|---|---|---|---|
| 0 | 489042 | 378828 | 3 | shoton | 10260 | 24154 |
| 1 | 489042 | 378866 | 7 | shoton | 10260 | 24157 |
| 2 | 489042 | 378922 | 14 | shoton | 10260 | 30829 |
| 3 | 489042 | 378923 | 14 | shoton | 10260 | 30373 |
| 4 | 489042 | 378951 | 17 | shoton | 10260 | 30373 |
| ... | ... | ... | ... | ... | ... | ... |
| 6 | 2030171 | 4940770 | 53 | shoton | 8370 | 256727 |
| 7 | 2030171 | 4940869 | 60 | shoton | 8558 | 41622 |
| 8 | 2030171 | 4940889 | 62 | shoton | 8370 | 482331 |
| 9 | 2030171 | 4940940 | 68 | shoton | 8370 | 210065 |
| 10 | 2030171 | 4941144 | 90 | shoton | 8558 | 452243 |
93755 rows × 6 columns
df_match_ingame_shotoff = pd.DataFrame(columns = ["match_api_id", "shot_id", "shot_elapsed", "shot_type", "shot_team_api_id", "shot_player1_id"])
for match_api_id in df_match_ingame[~df_match_ingame.shotoff.isna()].match_api_id.unique():
xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].shotoff.values[0]
shot_id = []
shot_elapsed = []
shot_type = []
shot_team_id = []
shot_player1_id = []
num_elmt = 0
for item in ET.fromstring(xml).findall(".value"):
num_elmt += 1
for child in item:
if child.tag == "id": shot_id.append(child.text)
elif child.tag == "elapsed": shot_elapsed.append(child.text)
elif child.tag == "team": shot_team_id.append(child.text)
elif child.tag == "player1": shot_player1_id.append(child.text)
elif child.tag == "type": shot_type.append(child.text)
if len(shot_id) != num_elmt: shot_id.append(np.nan)
if len(shot_elapsed) != num_elmt: shot_elapsed.append(np.nan)
if len(shot_team_id) != num_elmt: shot_team_id.append(np.nan)
if len(shot_player1_id) != num_elmt: shot_player1_id.append(np.nan)
if len(shot_type) != num_elmt: shot_type.append(np.nan)
temp = pd.DataFrame({"match_api_id": [match_api_id] * len(shot_id),
"shot_id": shot_id,
"shot_elapsed": shot_elapsed,
"shot_type": shot_type,
"shot_team_api_id": shot_team_id,
"shot_player1_id": shot_player1_id})
df_match_ingame_shotoff = pd.concat([df_match_ingame_shotoff, temp])
df_match_ingame_shotoff
| match_api_id | shot_id | shot_elapsed | shot_type | shot_team_api_id | shot_player1_id | |
|---|---|---|---|---|---|---|
| 0 | 489042 | 378835 | 4 | shotoff | 10260 | 30373 |
| 1 | 489042 | 378845 | 5 | shotoff | 10261 | 37799 |
| 2 | 489042 | 378995 | 22 | shotoff | 10261 | 24228 |
| 3 | 489042 | 379075 | 31 | shotoff | 10261 | 38807 |
| 4 | 489042 | 379111 | 34 | shotoff | 10260 | 24154 |
| ... | ... | ... | ... | ... | ... | ... |
| 1 | 2030171 | 4940379 | 19 | shotoff | 8370 | 36130 |
| 2 | 2030171 | 4940624 | 44 | shotoff | 8370 | 34104 |
| 3 | 2030171 | 4940738 | 49 | shotoff | 8558 | 107930 |
| 4 | 2030171 | 4940963 | 71 | shotoff | 8370 | 210065 |
| 5 | 2030171 | 4941072 | 83 | shotoff | 8558 | 629579 |
95303 rows × 6 columns
- Merge the shoton and shotoff tables because they have same columns.
df_match_ingame_shot = pd.concat([df_match_ingame_shoton, df_match_ingame_shotoff])
df_match_ingame_shot
| match_api_id | shot_id | shot_elapsed | shot_type | shot_team_api_id | shot_player1_id | |
|---|---|---|---|---|---|---|
| 0 | 489042 | 378828 | 3 | shoton | 10260 | 24154 |
| 1 | 489042 | 378866 | 7 | shoton | 10260 | 24157 |
| 2 | 489042 | 378922 | 14 | shoton | 10260 | 30829 |
| 3 | 489042 | 378923 | 14 | shoton | 10260 | 30373 |
| 4 | 489042 | 378951 | 17 | shoton | 10260 | 30373 |
| ... | ... | ... | ... | ... | ... | ... |
| 1 | 2030171 | 4940379 | 19 | shotoff | 8370 | 36130 |
| 2 | 2030171 | 4940624 | 44 | shotoff | 8370 | 34104 |
| 3 | 2030171 | 4940738 | 49 | shotoff | 8558 | 107930 |
| 4 | 2030171 | 4940963 | 71 | shotoff | 8370 | 210065 |
| 5 | 2030171 | 4941072 | 83 | shotoff | 8558 | 629579 |
189058 rows × 6 columns
len(df_match_ingame_shot.shot_id.unique())
189058
len(df_match_ingame_goal.goal_id.unique())
39980
len(pd.concat([df_match_ingame_goal.goal_id, df_match_ingame_shot.shot_id]).unique())
229038
- Since goal and shot information are similar, let’s combine two tables.
df_match_ingame_goal
| match_api_id | goal_id | goal_elapsed | goal_team_api_id | goal_player1_id | goal_player2_id | goal_type | goal_subtype | |
|---|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378998 | 22 | 10261 | 37799 | 38807 | n | header |
| 1 | 489042 | 379019 | 24 | 10260 | 24148 | 24154 | n | shot |
| 0 | 489043 | 375546 | 4 | 9825 | 26181 | 39297 | n | shot |
| 0 | 489044 | 378041 | 83 | 8650 | 30853 | 30889 | n | distance |
| 0 | 489045 | 376060 | 4 | 8654 | 23139 | 36394 | n | shot |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | 1992228 | 5640015 | 71 | 10192 | 37554 | NaN | n | NaN |
| 0 | 1992229 | 5639993 | 58 | 9824 | 493418 | NaN | n | NaN |
| 1 | 1992229 | 5640008 | 67 | 10243 | 197757 | NaN | n | NaN |
| 2 | 1992229 | 5640010 | 69 | 10243 | 198082 | NaN | n | NaN |
| 3 | 1992229 | 5640020 | 76 | 10243 | 121080 | NaN | n | NaN |
39980 rows × 8 columns
df_match_ingame_goal = df_match_ingame_goal.rename(columns = {"goal_id": "event_id",
"goal_elapsed": "elapsed",
"goal_team_api_id": "team_api_id",
"goal_player1_id": "player1_api_id",
"goal_player2_id": "player2_api_id",
"goal_type": "type",
"goal_subtype": "subtype"})
df_match_ingame_goal["category"] = "goal"
df_match_ingame_goal= df_match_ingame_goal[["match_api_id", "event_id", "elapsed", "team_api_id", "category",
"type", "subtype", "player1_api_id", "player2_api_id"]]
df_match_ingame_goal
| match_api_id | event_id | elapsed | team_api_id | category | type | subtype | player1_api_id | player2_api_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378998 | 22 | 10261 | goal | n | header | 37799 | 38807 |
| 1 | 489042 | 379019 | 24 | 10260 | goal | n | shot | 24148 | 24154 |
| 0 | 489043 | 375546 | 4 | 9825 | goal | n | shot | 26181 | 39297 |
| 0 | 489044 | 378041 | 83 | 8650 | goal | n | distance | 30853 | 30889 |
| 0 | 489045 | 376060 | 4 | 8654 | goal | n | shot | 23139 | 36394 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | 1992228 | 5640015 | 71 | 10192 | goal | n | NaN | 37554 | NaN |
| 0 | 1992229 | 5639993 | 58 | 9824 | goal | n | NaN | 493418 | NaN |
| 1 | 1992229 | 5640008 | 67 | 10243 | goal | n | NaN | 197757 | NaN |
| 2 | 1992229 | 5640010 | 69 | 10243 | goal | n | NaN | 198082 | NaN |
| 3 | 1992229 | 5640020 | 76 | 10243 | goal | n | NaN | 121080 | NaN |
39980 rows × 9 columns
df_match_ingame_shot = df_match_ingame_shot.rename(columns = {"shot_id": "event_id",
"shot_elapsed": "elapsed",
"shot_team_api_id": "team_api_id",
"shot_player1_id": "player1_api_id",
"shot_type": "type"})
df_match_ingame_shot["category"] = "shot"
df_match_ingame_shot["subtype"] = np.nan
df_match_ingame_shot["player2_api_id"] = np.nan
df_match_ingame_shot= df_match_ingame_shot[["match_api_id", "event_id", "elapsed", "team_api_id", "category",
"type", "subtype", "player1_api_id", "player2_api_id"]]
df_match_ingame_shot
| match_api_id | event_id | elapsed | team_api_id | category | type | subtype | player1_api_id | player2_api_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378828 | 3 | 10260 | shot | shoton | NaN | 24154 | NaN |
| 1 | 489042 | 378866 | 7 | 10260 | shot | shoton | NaN | 24157 | NaN |
| 2 | 489042 | 378922 | 14 | 10260 | shot | shoton | NaN | 30829 | NaN |
| 3 | 489042 | 378923 | 14 | 10260 | shot | shoton | NaN | 30373 | NaN |
| 4 | 489042 | 378951 | 17 | 10260 | shot | shoton | NaN | 30373 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1 | 2030171 | 4940379 | 19 | 8370 | shot | shotoff | NaN | 36130 | NaN |
| 2 | 2030171 | 4940624 | 44 | 8370 | shot | shotoff | NaN | 34104 | NaN |
| 3 | 2030171 | 4940738 | 49 | 8558 | shot | shotoff | NaN | 107930 | NaN |
| 4 | 2030171 | 4940963 | 71 | 8370 | shot | shotoff | NaN | 210065 | NaN |
| 5 | 2030171 | 4941072 | 83 | 8558 | shot | shotoff | NaN | 629579 | NaN |
189058 rows × 9 columns
df_match_ingame_shot = pd.concat([df_match_ingame_goal, df_match_ingame_shot])
df_match_ingame_shot
| match_api_id | event_id | elapsed | team_api_id | category | type | subtype | player1_api_id | player2_api_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378998 | 22 | 10261 | goal | n | header | 37799 | 38807 |
| 1 | 489042 | 379019 | 24 | 10260 | goal | n | shot | 24148 | 24154 |
| 0 | 489043 | 375546 | 4 | 9825 | goal | n | shot | 26181 | 39297 |
| 0 | 489044 | 378041 | 83 | 8650 | goal | n | distance | 30853 | 30889 |
| 0 | 489045 | 376060 | 4 | 8654 | goal | n | shot | 23139 | 36394 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1 | 2030171 | 4940379 | 19 | 8370 | shot | shotoff | NaN | 36130 | NaN |
| 2 | 2030171 | 4940624 | 44 | 8370 | shot | shotoff | NaN | 34104 | NaN |
| 3 | 2030171 | 4940738 | 49 | 8558 | shot | shotoff | NaN | 107930 | NaN |
| 4 | 2030171 | 4940963 | 71 | 8370 | shot | shotoff | NaN | 210065 | NaN |
| 5 | 2030171 | 4941072 | 83 | 8558 | shot | shotoff | NaN | 629579 | NaN |
229038 rows × 9 columns
df_match_ingame_shot.to_csv("../data/df_match_ingame_shot.csv", index = False)
- Extract the xml data from the foulcommit and card columns.
df_match_ingame_foul = pd.DataFrame(columns = ["match_api_id", "foul_id", "foul_elapsed", "foul_type", "foul_team_api_id",
"foul_player1_id", "foul_player2_id"])
for match_api_id in df_match_ingame[~df_match_ingame.foulcommit.isna()].match_api_id.unique():
xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].foulcommit.values[0]
foul_id = []
foul_elapsed = []
foul_type = []
foul_team_id = []
foul_player1_id = []
foul_player2_id = []
num_elmt = 0
for item in ET.fromstring(xml).findall(".value"):
num_elmt += 1
for child in item:
if child.tag == "id": foul_id.append(child.text)
elif child.tag == "elapsed": foul_elapsed.append(child.text)
elif child.tag == "type": foul_type.append(child.text)
elif child.tag == "team": foul_team_id.append(child.text)
elif child.tag == "player1": foul_player1_id.append(child.text)
elif child.tag == "player2": foul_player2_id.append(child.text)
if len(foul_id) != num_elmt: foul_id.append(np.nan)
if len(foul_elapsed) != num_elmt: foul_elapsed.append(np.nan)
if len(foul_type) != num_elmt: foul_type.append(np.nan)
if len(foul_team_id) != num_elmt: foul_team_id.append(np.nan)
if len(foul_player1_id) != num_elmt: foul_player1_id.append(np.nan)
if len(foul_player2_id) != num_elmt: foul_player2_id.append(np.nan)
temp = pd.DataFrame({"match_api_id": [match_api_id] * len(foul_id),
"foul_id": foul_id,
"foul_elapsed": foul_elapsed,
"foul_type": foul_type,
"foul_team_api_id": foul_team_id,
"foul_player1_id": foul_player1_id,
"foul_player2_id": foul_player2_id,})
df_match_ingame_foul = pd.concat([df_match_ingame_foul, temp])
df_match_ingame_foul
| match_api_id | foul_id | foul_elapsed | foul_type | foul_team_api_id | foul_player1_id | foul_player2_id | |
|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378824 | 1 | foulcommit | 10261 | 25518 | 32569 |
| 1 | 489042 | 378826 | 2 | foulcommit | 10261 | 30929 | 24157 |
| 2 | 489042 | 378830 | 3 | foulcommit | 10261 | 29581 | 24148 |
| 3 | 489042 | 378841 | 5 | foulcommit | 10260 | 30373 | 40565 |
| 4 | 489042 | 378894 | 10 | foulcommit | 10261 | 29581 | 30829 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 21 | 2030171 | 4941024 | 77 | foulcommit | 8558 | 213814 | 40220 |
| 22 | 2030171 | 4941052 | 79 | foulcommit | 8558 | 452243 | 75004 |
| 23 | 2030171 | 4941084 | 81 | foulcommit | 8558 | 629579 | 75004 |
| 24 | 2030171 | 4941106 | 88 | foulcommit | 8370 | 210065 | 37654 |
| 25 | 2030171 | 4941145 | 90 | foulcommit | 8370 | 183449 | NaN |
219234 rows × 7 columns
df_match_ingame_card = pd.DataFrame(columns = ["match_api_id", "card_id", "card_elapsed", "card_type", "card_team_api_id", "card_player1_id", "card_player2_id"])
for match_api_id in df_match_ingame[~df_match_ingame.card.isna()].match_api_id.unique():
xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].card.values[0]
card_id = []
card_elapsed = []
card_type = []
card_team_id = []
card_player1_id = []
card_player2_id = []
num_elmt = 0
for item in ET.fromstring(xml).findall(".value"):
num_elmt += 1
for child in item:
if child.tag == "id": card_id.append(child.text)
elif child.tag == "elapsed": card_elapsed.append(child.text)
elif child.tag == "card_type": card_type.append(child.text)
elif child.tag == "team": card_team_id.append(child.text)
elif child.tag == "player1": card_player1_id.append(child.text)
elif child.tag == "player2": card_player2_id.append(child.text)
if len(card_id) != num_elmt: card_id.append(np.nan)
if len(card_elapsed) != num_elmt: card_elapsed.append(np.nan)
if len(card_type) != num_elmt: card_type.append(np.nan)
if len(card_team_id) != num_elmt: card_team_id.append(np.nan)
if len(card_player1_id) != num_elmt: card_player1_id.append(np.nan)
if len(card_player2_id) != num_elmt: card_player2_id.append(np.nan)
temp = pd.DataFrame({"match_api_id": [match_api_id] * len(card_id),
"card_id": card_id,
"card_elapsed": card_elapsed,
"card_type": card_type,
"card_team_api_id": card_team_id,
"card_player1_id": card_player1_id,
"card_player2_id": card_player2_id,})
df_match_ingame_card = pd.concat([df_match_ingame_card, temp])
df_match_ingame_card
| match_api_id | card_id | card_elapsed | card_type | card_team_api_id | card_player1_id | card_player2_id | |
|---|---|---|---|---|---|---|---|
| 0 | 489042 | 379481 | 78 | y | 10260 | 24157 | NaN |
| 1 | 489042 | 379503 | 82 | y | 10260 | 30362 | NaN |
| 2 | 489042 | 379547 | 90 | y | 10260 | 30829 | NaN |
| 0 | 489044 | 377978 | 56 | y | 8650 | 37442 | NaN |
| 1 | 489044 | 378060 | 90 | y | 8650 | 46621 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | 1992228 | 5640034 | 82 | y | 10192 | 25815 | NaN |
| 0 | 1992229 | 5639986 | 52 | y | 10243 | 198082 | NaN |
| 1 | 1992229 | 5639991 | 56 | y | 10243 | 121080 | NaN |
| 2 | 1992229 | 5639996 | 60 | y | 9824 | 214344 | NaN |
| 3 | 1992229 | 5639999 | 61 | y | 9824 | 294256 | NaN |
62114 rows × 7 columns
- Since foul and card have similar information, let’s combine two tables.
df_match_ingame_foul = df_match_ingame_foul.rename(columns = {"foul_id": "event_id",
"foul_elapsed": "elapsed",
"foul_team_api_id": "team_api_id",
"foul_player1_id": "player1_api_id",
"foul_player2_id": "player2_api_id"})
df_match_ingame_foul
| match_api_id | event_id | elapsed | foul_type | team_api_id | player1_api_id | player2_api_id | |
|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378824 | 1 | foulcommit | 10261 | 25518 | 32569 |
| 1 | 489042 | 378826 | 2 | foulcommit | 10261 | 30929 | 24157 |
| 2 | 489042 | 378830 | 3 | foulcommit | 10261 | 29581 | 24148 |
| 3 | 489042 | 378841 | 5 | foulcommit | 10260 | 30373 | 40565 |
| 4 | 489042 | 378894 | 10 | foulcommit | 10261 | 29581 | 30829 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 21 | 2030171 | 4941024 | 77 | foulcommit | 8558 | 213814 | 40220 |
| 22 | 2030171 | 4941052 | 79 | foulcommit | 8558 | 452243 | 75004 |
| 23 | 2030171 | 4941084 | 81 | foulcommit | 8558 | 629579 | 75004 |
| 24 | 2030171 | 4941106 | 88 | foulcommit | 8370 | 210065 | 37654 |
| 25 | 2030171 | 4941145 | 90 | foulcommit | 8370 | 183449 | NaN |
219234 rows × 7 columns
df_match_ingame_card
| match_api_id | card_id | card_elapsed | card_type | card_team_api_id | card_player1_id | card_player2_id | |
|---|---|---|---|---|---|---|---|
| 0 | 489042 | 379481 | 78 | y | 10260 | 24157 | NaN |
| 1 | 489042 | 379503 | 82 | y | 10260 | 30362 | NaN |
| 2 | 489042 | 379547 | 90 | y | 10260 | 30829 | NaN |
| 0 | 489044 | 377978 | 56 | y | 8650 | 37442 | NaN |
| 1 | 489044 | 378060 | 90 | y | 8650 | 46621 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | 1992228 | 5640034 | 82 | y | 10192 | 25815 | NaN |
| 0 | 1992229 | 5639986 | 52 | y | 10243 | 198082 | NaN |
| 1 | 1992229 | 5639991 | 56 | y | 10243 | 121080 | NaN |
| 2 | 1992229 | 5639996 | 60 | y | 9824 | 214344 | NaN |
| 3 | 1992229 | 5639999 | 61 | y | 9824 | 294256 | NaN |
62114 rows × 7 columns
df_match_ingame_card = df_match_ingame_card.rename(columns = {"card_id": "event_id",
"card_elapsed": "elapsed",
"card_type": "foul_type",
"card_team_api_id": "team_api_id",
"card_player1_id": "player1_api_id",
"card_player2_id": "player2_api_id"})
df_match_ingame_card
| match_api_id | event_id | elapsed | foul_type | team_api_id | player1_api_id | player2_api_id | |
|---|---|---|---|---|---|---|---|
| 0 | 489042 | 379481 | 78 | y | 10260 | 24157 | NaN |
| 1 | 489042 | 379503 | 82 | y | 10260 | 30362 | NaN |
| 2 | 489042 | 379547 | 90 | y | 10260 | 30829 | NaN |
| 0 | 489044 | 377978 | 56 | y | 8650 | 37442 | NaN |
| 1 | 489044 | 378060 | 90 | y | 8650 | 46621 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | 1992228 | 5640034 | 82 | y | 10192 | 25815 | NaN |
| 0 | 1992229 | 5639986 | 52 | y | 10243 | 198082 | NaN |
| 1 | 1992229 | 5639991 | 56 | y | 10243 | 121080 | NaN |
| 2 | 1992229 | 5639996 | 60 | y | 9824 | 214344 | NaN |
| 3 | 1992229 | 5639999 | 61 | y | 9824 | 294256 | NaN |
62114 rows × 7 columns
df_match_ingame_foul = pd.concat([df_match_ingame_foul, df_match_ingame_card])
df_match_ingame_foul
| match_api_id | event_id | elapsed | foul_type | team_api_id | player1_api_id | player2_api_id | |
|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378824 | 1 | foulcommit | 10261 | 25518 | 32569 |
| 1 | 489042 | 378826 | 2 | foulcommit | 10261 | 30929 | 24157 |
| 2 | 489042 | 378830 | 3 | foulcommit | 10261 | 29581 | 24148 |
| 3 | 489042 | 378841 | 5 | foulcommit | 10260 | 30373 | 40565 |
| 4 | 489042 | 378894 | 10 | foulcommit | 10261 | 29581 | 30829 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | 1992228 | 5640034 | 82 | y | 10192 | 25815 | NaN |
| 0 | 1992229 | 5639986 | 52 | y | 10243 | 198082 | NaN |
| 1 | 1992229 | 5639991 | 56 | y | 10243 | 121080 | NaN |
| 2 | 1992229 | 5639996 | 60 | y | 9824 | 214344 | NaN |
| 3 | 1992229 | 5639999 | 61 | y | 9824 | 294256 | NaN |
281348 rows × 7 columns
df_match_ingame_foul.to_csv("../data/df_match_ingame_foul.csv", index = False)
- Extract the xml data from the cross and corner columns.
df_match_ingame_cross = pd.DataFrame(columns = ["match_api_id", "cross_id", "cross_elapsed", "cross_type", "cross_team_api_id", "cross_player1_id"])
for match_api_id in df_match_ingame[~df_match_ingame.cross.isna()].match_api_id.unique():
xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].cross.values[0]
cross_id = []
cross_elapsed = []
cross_type = []
cross_team_id = []
cross_player1_id = []
num_elmt = 0
for item in ET.fromstring(xml).findall(".value"):
num_elmt += 1
for child in item:
if child.tag == "id": cross_id.append(child.text)
elif child.tag == "elapsed": cross_elapsed.append(child.text)
elif child.tag == "type": cross_type.append(child.text)
elif child.tag == "team": cross_team_id.append(child.text)
elif child.tag == "player1": cross_player1_id.append(child.text)
if len(cross_id) != num_elmt: cross_id.append(np.nan)
if len(cross_elapsed) != num_elmt: cross_elapsed.append(np.nan)
if len(cross_type) != num_elmt: cross_type.append(np.nan)
if len(cross_team_id) != num_elmt: cross_team_id.append(np.nan)
if len(cross_player1_id) != num_elmt: cross_player1_id.append(np.nan)
temp = pd.DataFrame({"match_api_id": [match_api_id] * len(cross_id),
"cross_id": cross_id,
"cross_elapsed": cross_elapsed,
"cross_type": cross_type,
"cross_team_api_id": cross_team_id,
"cross_player1_id": cross_player1_id})
df_match_ingame_cross = pd.concat([df_match_ingame_cross, temp])
df_match_ingame_cross
| match_api_id | cross_id | cross_elapsed | cross_type | cross_team_api_id | cross_player1_id | |
|---|---|---|---|---|---|---|
| 0 | 489042 | 378863 | 7 | cross | 10260 | 30829 |
| 1 | 489042 | 378921 | 14 | cross | 10260 | 24148 |
| 2 | 489042 | 378960 | 19 | corner | 10261 | 38807 |
| 3 | 489042 | 378970 | 19 | cross | 10261 | 30929 |
| 4 | 489042 | 378978 | 20 | cross | 10260 | 24148 |
| ... | ... | ... | ... | ... | ... | ... |
| 23 | 2030171 | 4940945 | 69 | cross | 8370 | 210065 |
| 24 | 2030171 | 4940996 | 74 | cross | 8370 | 46808 |
| 25 | 2030171 | 4941050 | 80 | cross | 8370 | 210065 |
| 26 | 2030171 | 4941089 | 85 | cross | 8370 | 40220 |
| 27 | 2030171 | 4941100 | 87 | cross | 8370 | 40220 |
284775 rows × 6 columns
df_match_ingame_cross.cross_type.unique()
array(['cross', 'corner', 'throwin'], dtype=object)
- It seems that cross information alread include the corner information.
df_match_ingame_corner = pd.DataFrame(columns = ["match_api_id", "corner_id", "corner_elapsed", "corner_type", "corner_team_api_id", "corner_player1_id"])
for match_api_id in df_match_ingame[~df_match_ingame.corner.isna()].match_api_id.unique():
xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].corner.values[0]
corner_id = []
corner_elapsed = []
corner_type = []
corner_team_id = []
corner_player1_id = []
num_elmt = 0
for item in ET.fromstring(xml).findall(".value"):
num_elmt += 1
for child in item:
if child.tag == "id": corner_id.append(child.text)
elif child.tag == "elapsed": corner_elapsed.append(child.text)
elif child.tag == "type": corner_type.append(child.text)
elif child.tag == "team": corner_team_id.append(child.text)
elif child.tag == "player1": corner_player1_id.append(child.text)
if len(corner_id) != num_elmt: corner_id.append(np.nan)
if len(corner_elapsed) != num_elmt: corner_elapsed.append(np.nan)
if len(corner_type) != num_elmt: corner_type.append(np.nan)
if len(corner_team_id) != num_elmt: corner_team_id.append(np.nan)
if len(corner_player1_id) != num_elmt: corner_player1_id.append(np.nan)
temp = pd.DataFrame({"match_api_id": [match_api_id] * len(corner_id),
"corner_id": corner_id,
"corner_elapsed": corner_elapsed,
"corner_type": corner_type,
"corner_team_api_id": corner_team_id,
"corner_player1_id": corner_player1_id})
df_match_ingame_corner = pd.concat([df_match_ingame_corner, temp])
df_match_ingame_corner
| match_api_id | corner_id | corner_elapsed | corner_type | corner_team_api_id | corner_player1_id | |
|---|---|---|---|---|---|---|
| 0 | 489042 | 378960 | 19 | corner | 10261 | 38807 |
| 1 | 489042 | 378992 | 22 | corner | 10261 | 40565 |
| 2 | 489042 | 378996 | 22 | corner | 10261 | 38807 |
| 3 | 489042 | 379352 | 48 | corner | 10260 | 24154 |
| 4 | 489042 | 379366 | 51 | corner | 10261 | 38807 |
| ... | ... | ... | ... | ... | ... | ... |
| 6 | 2030171 | 4940619 | 44 | corner | 8370 | 36130 |
| 7 | 2030171 | 4940734 | 49 | corner | 8558 | 498033 |
| 8 | 2030171 | 4940742 | 50 | corner | 8558 | 498033 |
| 9 | 2030171 | 4940990 | 73 | corner | 8558 | 498033 |
| 10 | 2030171 | 4941137 | 90 | corner | 8370 | 75004 |
87839 rows × 6 columns
temp = df_match_ingame_cross.merge(df_match_ingame_corner.rename(columns = {"corner_elapsed": "cross_elapsed"}), how = "left", on = ["match_api_id", "cross_elapsed"])
temp
| match_api_id | cross_id | cross_elapsed | cross_type | cross_team_api_id | cross_player1_id | corner_id | corner_type | corner_team_api_id | corner_player1_id | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 489042 | 378863 | 7 | cross | 10260 | 30829 | NaN | NaN | NaN | NaN |
| 1 | 489042 | 378921 | 14 | cross | 10260 | 24148 | NaN | NaN | NaN | NaN |
| 2 | 489042 | 378960 | 19 | corner | 10261 | 38807 | 378960 | corner | 10261 | 38807 |
| 3 | 489042 | 378970 | 19 | cross | 10261 | 30929 | 378960 | corner | 10261 | 38807 |
| 4 | 489042 | 378978 | 20 | cross | 10260 | 24148 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 296857 | 2030171 | 4940945 | 69 | cross | 8370 | 210065 | NaN | NaN | NaN | NaN |
| 296858 | 2030171 | 4940996 | 74 | cross | 8370 | 46808 | NaN | NaN | NaN | NaN |
| 296859 | 2030171 | 4941050 | 80 | cross | 8370 | 210065 | NaN | NaN | NaN | NaN |
| 296860 | 2030171 | 4941089 | 85 | cross | 8370 | 40220 | NaN | NaN | NaN | NaN |
| 296861 | 2030171 | 4941100 | 87 | cross | 8370 | 40220 | NaN | NaN | NaN | NaN |
296862 rows × 10 columns
len(df_match_ingame_corner.corner_id.unique())
87839
len(temp.corner_id.unique())
63736
- 63736 corner information out of 87839 corner information alread included in the cross information.
- So we will only use the cross information combined with corner information that are not in the cross information.
temp = df_match_ingame_corner[~df_match_ingame_corner.corner_id.isin(df_match_ingame_cross.cross_id)]
temp.columns = df_match_ingame_cross.columns
temp
| match_api_id | cross_id | cross_elapsed | cross_type | cross_team_api_id | cross_player1_id | |
|---|---|---|---|---|---|---|
| 1 | 489042 | 378992 | 22 | corner | 10261 | 40565 |
| 6 | 489042 | 379411 | 59 | corner | 10260 | 24148 |
| 0 | 489043 | 375535 | 3 | corner | 9825 | 39297 |
| 2 | 489043 | 375562 | 9 | corner | 9825 | 30960 |
| 6 | 489045 | 376655 | 40 | corner | 8528 | 35327 |
| ... | ... | ... | ... | ... | ... | ... |
| 6 | 2030171 | 4940619 | 44 | corner | 8370 | 36130 |
| 7 | 2030171 | 4940734 | 49 | corner | 8558 | 498033 |
| 8 | 2030171 | 4940742 | 50 | corner | 8558 | 498033 |
| 9 | 2030171 | 4940990 | 73 | corner | 8558 | 498033 |
| 10 | 2030171 | 4941137 | 90 | corner | 8370 | 75004 |
45588 rows × 6 columns
df_match_ingame_cross = pd.concat([df_match_ingame_cross, temp])
df_match_ingame_cross = df_match_ingame_cross.rename(columns = {"cross_id": "event_id",
"cross_elapsed": "elapsed",
"cross_team_api_id": "team_api_id",
"cross_player1_id": "player1_api_id"})
df_match_ingame_cross
| match_api_id | event_id | elapsed | cross_type | team_api_id | player1_api_id | |
|---|---|---|---|---|---|---|
| 0 | 489042 | 378863 | 7 | cross | 10260 | 30829 |
| 1 | 489042 | 378921 | 14 | cross | 10260 | 24148 |
| 2 | 489042 | 378960 | 19 | corner | 10261 | 38807 |
| 3 | 489042 | 378970 | 19 | cross | 10261 | 30929 |
| 4 | 489042 | 378978 | 20 | cross | 10260 | 24148 |
| ... | ... | ... | ... | ... | ... | ... |
| 6 | 2030171 | 4940619 | 44 | corner | 8370 | 36130 |
| 7 | 2030171 | 4940734 | 49 | corner | 8558 | 498033 |
| 8 | 2030171 | 4940742 | 50 | corner | 8558 | 498033 |
| 9 | 2030171 | 4940990 | 73 | corner | 8558 | 498033 |
| 10 | 2030171 | 4941137 | 90 | corner | 8370 | 75004 |
330363 rows × 6 columns
df_match_ingame_cross.to_csv("../data/df_match_ingame_cross.csv", index = False)
- Extract the xml data from the possesion column.
df_match_ingame_possesion = pd.DataFrame(columns = ["match_api_id", "pos_id", "pos_elapsed", "pos_home", "pos_away"])
for match_api_id in df_match_ingame[~df_match_ingame.possession.isna()].match_api_id.unique():
xml = df_match_ingame[df_match_ingame.match_api_id == match_api_id].possession.values[0]
pos_id = []
pos_elapsed = []
pos_home = []
pos_away = []
num_elmt = 0
for item in ET.fromstring(xml).findall(".value"):
num_elmt += 1
for child in item:
if child.tag == "id": pos_id.append(child.text)
elif child.tag == "elapsed": pos_elapsed.append(child.text)
elif child.tag == "homepos": pos_home.append(child.text)
elif child.tag == "awaypos": pos_away.append(child.text)
if len(pos_id) != num_elmt: pos_id.append(np.nan)
if len(pos_elapsed) != num_elmt: pos_elapsed.append(np.nan)
if len(pos_home) != num_elmt: pos_home.append(np.nan)
if len(pos_away) != num_elmt: pos_away.append(np.nan)
temp = pd.DataFrame({"match_api_id": [match_api_id] * len(pos_id),
"pos_id": pos_id,
"pos_elapsed": pos_elapsed,
"pos_home": pos_home,
"pos_away": pos_away})
df_match_ingame_possesion = pd.concat([df_match_ingame_possesion, temp])
df_match_ingame_possesion = df_match_ingame_possesion.rename(columns = {"pos_id": "event_id",
"pos_elapsed": "elapsed",
"pos_home": "home_possession",
"pos_away": "away_possession"})
df_match_ingame_possesion
| match_api_id | event_id | elapsed | home_possession | away_possession | |
|---|---|---|---|---|---|
| 0 | 489042 | 379029 | 25 | 56 | 44 |
| 1 | 489042 | 379251 | 45 | 54 | 46 |
| 2 | 489042 | 379443 | 70 | 54 | 46 |
| 3 | 489042 | 379575 | 90 | 55 | 45 |
| 0 | 489043 | 375608 | 27 | 65 | 35 |
| ... | ... | ... | ... | ... | ... |
| 3 | 2030170 | 4947481 | 85 | 45 | 55 |
| 0 | 2030171 | 4940420 | 22 | 59 | 41 |
| 1 | 2030171 | 4940639 | 45 | 61 | 39 |
| 2 | 2030171 | 4940941 | 68 | 60 | 40 |
| 3 | 2030171 | 4941128 | 90 | 62 | 38 |
34815 rows × 5 columns
- Since we need home team and away team api id, find and combine team ids by the match api id.
df_match_basic
| match_api_id | country_id | league_id | season | stage | match_date | home_team_api_id | away_team_api_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 492473 | 1 | 1 | 2008/2009 | 1 | 2008-08-17 | 9987 | 9993 |
| 1 | 492474 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 10000 | 9994 |
| 2 | 492475 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 9984 | 8635 |
| 3 | 492476 | 1 | 1 | 2008/2009 | 1 | 2008-08-17 | 9991 | 9998 |
| 4 | 492477 | 1 | 1 | 2008/2009 | 1 | 2008-08-16 | 7947 | 9985 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25974 | 1992091 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-22 | 10190 | 10191 |
| 25975 | 1992092 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 9824 | 10199 |
| 25976 | 1992093 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 9956 | 10179 |
| 25977 | 1992094 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-22 | 7896 | 10243 |
| 25978 | 1992095 | 24558 | 24558 | 2015/2016 | 9 | 2015-09-23 | 10192 | 9931 |
25979 rows × 8 columns
df_match_ingame_possesion = df_match_ingame_possesion.merge(df_match_basic[["match_api_id", "home_team_api_id", "away_team_api_id"]], how = "left", on = "match_api_id") \
[["match_api_id", "event_id", "elapsed", "home_team_api_id", "home_possession", "away_team_api_id", "away_possession"]]
df_match_ingame_possesion
| match_api_id | event_id | elapsed | home_team_api_id | home_possession | away_team_api_id | away_possession | |
|---|---|---|---|---|---|---|---|
| 0 | 489042 | 379029 | 25 | 10260 | 56 | 10261 | 44 |
| 1 | 489042 | 379251 | 45 | 10260 | 54 | 10261 | 46 |
| 2 | 489042 | 379443 | 70 | 10260 | 54 | 10261 | 46 |
| 3 | 489042 | 379575 | 90 | 10260 | 55 | 10261 | 45 |
| 4 | 489043 | 375608 | 27 | 9825 | 65 | 8659 | 35 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 34810 | 2030170 | 4947481 | 85 | 7878 | 45 | 8603 | 55 |
| 34811 | 2030171 | 4940420 | 22 | 8370 | 59 | 8558 | 41 |
| 34812 | 2030171 | 4940639 | 45 | 8370 | 61 | 8558 | 39 |
| 34813 | 2030171 | 4940941 | 68 | 8370 | 60 | 8558 | 40 |
| 34814 | 2030171 | 4941128 | 90 | 8370 | 62 | 8558 | 38 |
34815 rows × 7 columns
df_match_ingame_possesion.to_csv("../data/df_match_ingame_possesion.csv", index = False)
2.5. Summary


3. Team related tables: Country, League, Team, Team_Attributes
3.1. Country
org_country = pd.read_sql(
"select * from Country", con
)
org_country
| id | name | |
|---|---|---|
| 0 | 1 | Belgium |
| 1 | 1729 | England |
| 2 | 4769 | France |
| 3 | 7809 | Germany |
| 4 | 10257 | Italy |
| 5 | 13274 | Netherlands |
| 6 | 15722 | Poland |
| 7 | 17642 | Portugal |
| 8 | 19694 | Scotland |
| 9 | 21518 | Spain |
| 10 | 24558 | Switzerland |
- There are total 11 countries.
3.2. League
org_league = pd.read_sql(
"select * from League", con
)
org_league
| id | country_id | name | |
|---|---|---|---|
| 0 | 1 | 1 | Belgium Jupiler League |
| 1 | 1729 | 1729 | England Premier League |
| 2 | 4769 | 4769 | France Ligue 1 |
| 3 | 7809 | 7809 | Germany 1. Bundesliga |
| 4 | 10257 | 10257 | Italy Serie A |
| 5 | 13274 | 13274 | Netherlands Eredivisie |
| 6 | 15722 | 15722 | Poland Ekstraklasa |
| 7 | 17642 | 17642 | Portugal Liga ZON Sagres |
| 8 | 19694 | 19694 | Scotland Premier League |
| 9 | 21518 | 21518 | Spain LIGA BBVA |
| 10 | 24558 | 24558 | Switzerland Super League |
- There are total 11 leagues from different countries.
- Since there is only one league per country, country and league represent the same information.
- Since we can get the country information from the league name, let’s do not use the country and only use the league table later.
3.3. Team
org_team = pd.read_sql(
"select * from Team", con
)
org_team.head()
| id | team_api_id | team_fifa_api_id | team_long_name | team_short_name | |
|---|---|---|---|---|---|
| 0 | 1 | 9987 | 673.0 | KRC Genk | GEN |
| 1 | 2 | 9993 | 675.0 | Beerschot AC | BAC |
| 2 | 3 | 10000 | 15005.0 | SV Zulte-Waregem | ZUL |
| 3 | 4 | 9994 | 2007.0 | Sporting Lokeren | LOK |
| 4 | 5 | 9984 | 1750.0 | KSV Cercle Brugge | CEB |
- Since team_long_name and team_short_name have same information, we will only use the team_long_name.
org_team.shape
(299, 5)
len(org_team.team_api_id.unique())
299
len(org_team.team_fifa_api_id.unique())
286
org_team.isna().sum()
id 0
team_api_id 0
team_fifa_api_id 11
team_long_name 0
team_short_name 0
dtype: int64
- team_api_id and team_fifa_api_id have different number of unique values because team_fifa_api_id has 11 missing values.
- So it is better to use team_api rather than use team_fifa_api_id to merge tables later.
- There are 299 teams.
3.4. Team_Attributes
org_team_attr = pd.read_sql(
"select * from Team_Attributes", con
)
org_team_attr.head()
| id | team_fifa_api_id | team_api_id | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribbling | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | chanceCreationPassing | chanceCreationPassingClass | chanceCreationCrossing | chanceCreationCrossingClass | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 434 | 9930 | 2010-02-22 00:00:00 | 60 | Balanced | NaN | Little | 50 | Mixed | Organised | 60 | Normal | 65 | Normal | 55 | Normal | Organised | 50 | Medium | 55 | Press | 45 | Normal | Cover |
| 1 | 2 | 434 | 9930 | 2014-09-19 00:00:00 | 52 | Balanced | 48.0 | Normal | 56 | Mixed | Organised | 54 | Normal | 63 | Normal | 64 | Normal | Organised | 47 | Medium | 44 | Press | 54 | Normal | Cover |
| 2 | 3 | 434 | 9930 | 2015-09-10 00:00:00 | 47 | Balanced | 41.0 | Normal | 54 | Mixed | Organised | 54 | Normal | 63 | Normal | 64 | Normal | Organised | 47 | Medium | 44 | Press | 54 | Normal | Cover |
| 3 | 4 | 77 | 8485 | 2010-02-22 00:00:00 | 70 | Fast | NaN | Little | 70 | Long | Organised | 70 | Risky | 70 | Lots | 70 | Lots | Organised | 60 | Medium | 70 | Double | 70 | Wide | Cover |
| 4 | 5 | 77 | 8485 | 2011-02-22 00:00:00 | 47 | Balanced | NaN | Little | 52 | Mixed | Organised | 53 | Normal | 48 | Normal | 52 | Normal | Organised | 47 | Medium | 47 | Press | 52 | Normal | Cover |
org_team_attr.shape
(1458, 25)
- There are 21 team attributes.
len(org_team_attr.team_api_id.unique())
288
- The number of teams that have team attributes are 288 out of total 299 teams.
- This is because the team attributes were collected by team_fifa_api_id, which has 11 missing values.
- **Note**: Team attributes are important features, so I think matches of teams without team attributes should be excluded from the analysis later.
3.5. Merge all team related tables
team_country_info = pd.concat([df_match_basic[["home_team_api_id", "country_id"]].rename(columns = {"home_team_api_id": "team_api_id"}) ,
df_match_basic[["away_team_api_id", "country_id"]].rename(columns = {"away_team_api_id": "team_api_id"})]).drop_duplicates()
df_team = org_team_attr.merge(team_country_info, how = "left", on = "team_api_id") \
.merge(org_league.rename(columns = {"name": "league_name"}), how = "left", on = "country_id") \
.merge(org_team, how = "left", on = "team_api_id")
df_team.date
0 2010-02-22 00:00:00
1 2014-09-19 00:00:00
2 2015-09-10 00:00:00
3 2010-02-22 00:00:00
4 2011-02-22 00:00:00
...
1453 2011-02-22 00:00:00
1454 2012-02-22 00:00:00
1455 2013-09-20 00:00:00
1456 2014-09-19 00:00:00
1457 2015-09-10 00:00:00
Name: date, Length: 1458, dtype: object
- Since there are other date columns that have different meanings in other tables, let’s change the column name date in this table to the player_measured_date.
df_team["team_measured_date"] = pd.to_datetime(df_team.date)
df_team = df_team[["team_api_id", "team_long_name", "league_name", "team_measured_date",
'buildUpPlaySpeed', 'buildUpPlaySpeedClass', 'buildUpPlayDribbling','buildUpPlayDribblingClass',
'buildUpPlayPassing', 'buildUpPlayPassingClass', 'buildUpPlayPositioningClass', 'chanceCreationPassing',
'chanceCreationPassingClass','chanceCreationCrossing', 'chanceCreationCrossingClass', 'chanceCreationShooting',
'chanceCreationShootingClass','chanceCreationPositioningClass', 'defencePressure', 'defencePressureClass',
'defenceAggression', 'defenceAggressionClass', 'defenceTeamWidth', 'defenceTeamWidthClass',
'defenceDefenderLineClass']]
df_team
| team_api_id | team_long_name | league_name | team_measured_date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribbling | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | chanceCreationPassing | chanceCreationPassingClass | chanceCreationCrossing | chanceCreationCrossingClass | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9930 | FC Aarau | Switzerland Super League | 2010-02-22 | 60 | Balanced | NaN | Little | 50 | Mixed | Organised | 60 | Normal | 65 | Normal | 55 | Normal | Organised | 50 | Medium | 55 | Press | 45 | Normal | Cover |
| 1 | 9930 | FC Aarau | Switzerland Super League | 2014-09-19 | 52 | Balanced | 48.0 | Normal | 56 | Mixed | Organised | 54 | Normal | 63 | Normal | 64 | Normal | Organised | 47 | Medium | 44 | Press | 54 | Normal | Cover |
| 2 | 9930 | FC Aarau | Switzerland Super League | 2015-09-10 | 47 | Balanced | 41.0 | Normal | 54 | Mixed | Organised | 54 | Normal | 63 | Normal | 64 | Normal | Organised | 47 | Medium | 44 | Press | 54 | Normal | Cover |
| 3 | 8485 | Aberdeen | Scotland Premier League | 2010-02-22 | 70 | Fast | NaN | Little | 70 | Long | Organised | 70 | Risky | 70 | Lots | 70 | Lots | Organised | 60 | Medium | 70 | Double | 70 | Wide | Cover |
| 4 | 8485 | Aberdeen | Scotland Premier League | 2011-02-22 | 47 | Balanced | NaN | Little | 52 | Mixed | Organised | 53 | Normal | 48 | Normal | 52 | Normal | Organised | 47 | Medium | 47 | Press | 52 | Normal | Cover |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1453 | 10000 | SV Zulte-Waregem | Belgium Jupiler League | 2011-02-22 | 52 | Balanced | NaN | Little | 52 | Mixed | Organised | 52 | Normal | 48 | Normal | 53 | Normal | Organised | 46 | Medium | 48 | Press | 53 | Normal | Cover |
| 1454 | 10000 | SV Zulte-Waregem | Belgium Jupiler League | 2012-02-22 | 54 | Balanced | NaN | Little | 51 | Mixed | Organised | 47 | Normal | 52 | Normal | 50 | Normal | Organised | 44 | Medium | 55 | Press | 53 | Normal | Cover |
| 1455 | 10000 | SV Zulte-Waregem | Belgium Jupiler League | 2013-09-20 | 54 | Balanced | NaN | Little | 51 | Mixed | Organised | 47 | Normal | 52 | Normal | 32 | Little | Organised | 44 | Medium | 58 | Press | 37 | Normal | Cover |
| 1456 | 10000 | SV Zulte-Waregem | Belgium Jupiler League | 2014-09-19 | 54 | Balanced | 42.0 | Normal | 51 | Mixed | Organised | 47 | Normal | 52 | Normal | 32 | Little | Organised | 44 | Medium | 58 | Press | 37 | Normal | Cover |
| 1457 | 10000 | SV Zulte-Waregem | Belgium Jupiler League | 2015-09-10 | 54 | Balanced | 42.0 | Normal | 51 | Mixed | Organised | 47 | Normal | 52 | Normal | 32 | Little | Organised | 44 | Medium | 58 | Press | 37 | Normal | Cover |
1458 rows × 25 columns
team_api_id_grouped_count = df_team.groupby(["league_name", "team_long_name"]).team_api_id.nunique().reset_index()
team_api_id_grouped_count[team_api_id_grouped_count.team_api_id != 1]
| league_name | team_long_name | team_api_id | |
|---|---|---|---|
| 15 | Belgium Jupiler League | Royal Excel Mouscron | 2 |
| 191 | Poland Ekstraklasa | Polonia Bytom | 2 |
| 193 | Poland Ekstraklasa | Widzew Łódź | 2 |
- Each of the 3 teams has 2 different team_api_id. Let’s use only one of the two team_api_id
df_team.loc[df_team.team_long_name == "Royal Excel Mouscron", "team_api_id"] = 9996 # 274581 -> 9996
df_team.loc[df_team.team_long_name == "Polonia Bytom", "team_api_id"] = 8020 # 8031 -> 8020
df_team.loc[df_team.team_long_name == "Widzew Łódź", "team_api_id"] = 8024 # 8244 -> 8024
df_team = df_team.drop_duplicates()
team_api_id_grouped_count = df_team.groupby(["league_name", "team_long_name"]).team_api_id.nunique().reset_index()
team_api_id_grouped_count[team_api_id_grouped_count.team_api_id != 1]
| league_name | team_long_name | team_api_id |
|---|
- Now, all team has only one team_api_id.
- Save the changed df_team table.
df_team.to_csv("../data/df_team.csv", index = False)
- Also, fix the above team_api_ids in the all other tables.
df_match_basic.loc[df_match_basic.home_team_api_id == 274581, "home_team_api_id"] = 9996
df_match_basic.loc[df_match_basic.away_team_api_id == 274581, "away_team_api_id"] = 9996
df_match_basic.loc[df_match_basic.home_team_api_id == 8031, "home_team_api_id"] = 8020
df_match_basic.loc[df_match_basic.away_team_api_id == 8031, "away_team_api_id"] = 8020
df_match_basic.loc[df_match_basic.home_team_api_id == 8244, "home_team_api_id"] = 8024
df_match_basic.loc[df_match_basic.away_team_api_id == 8244, "away_team_api_id"] = 8024
df_match_basic.to_csv("../data/df_match_basic.csv", index = False)
df_match_ingame_cross.loc[df_match_ingame_cross.team_api_id == 274581, "team_api_id"] = 9996
df_match_ingame_cross.loc[df_match_ingame_cross.team_api_id == 8031, "team_api_id"] = 8020
df_match_ingame_cross.loc[df_match_ingame_cross.team_api_id == 8244, "team_api_id"] = 8024
df_match_ingame_cross.to_csv("../data/df_match_ingame_cross.csv", index = False)
df_match_ingame_foul.loc[df_match_ingame_foul.team_api_id == 274581, "team_api_id"] = 9996
df_match_ingame_foul.loc[df_match_ingame_foul.team_api_id == 8031, "team_api_id"] = 8020
df_match_ingame_foul.loc[df_match_ingame_foul.team_api_id == 8244, "team_api_id"] = 8024
df_match_ingame_foul.to_csv("../data/df_match_ingame_foul.csv", index = False)
df_match_ingame_possesion.loc[df_match_ingame_possesion.home_team_api_id == 274581, "home_team_api_id"] = 9996
df_match_ingame_possesion.loc[df_match_ingame_possesion.away_team_api_id == 274581, "away_team_api_id"] = 9996
df_match_ingame_possesion.loc[df_match_ingame_possesion.home_team_api_id == 8031, "home_team_api_id"] = 8020
df_match_ingame_possesion.loc[df_match_ingame_possesion.away_team_api_id == 8031, "away_team_api_id"] = 8020
df_match_ingame_possesion.loc[df_match_ingame_possesion.home_team_api_id == 8244, "home_team_api_id"] = 8024
df_match_ingame_possesion.loc[df_match_ingame_possesion.away_team_api_id == 8244, "away_team_api_id"] = 8024
df_match_ingame_possesion.to_csv("../data/df_match_ingame_possesion.csv", index = False)
df_match_ingame_shot.loc[df_match_ingame_shot.team_api_id == 274581, "team_api_id"] = 9996
df_match_ingame_shot.loc[df_match_ingame_shot.team_api_id == 8031, "team_api_id"] = 8020
df_match_ingame_shot.loc[df_match_ingame_shot.team_api_id == 8244, "team_api_id"] = 8024
df_match_ingame_shot.to_csv("../data/df_match_ingame_shot.csv", index = False)
3.6. Summary

4. Overall Summary

- **Note**: Team attributes are important features, so I think matches of teams without team attributes from the df_team should be excluded from the analysis later.