2 - 3) EDA - Match batting information

13 minute read

import numpy as np 
import pandas as pd 

import sqlite3 as sql 

import matplotlib.pyplot as plt 
import seaborn as sns 
import missingno as msno 
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
df_match_betting = pd.read_csv("../data/df_match_betting.csv")
df_match_betting
match_api_id B365H B365D B365A BWH BWD BWA IWH IWD IWA LBH LBD LBA PSH PSD PSA WHH WHD WHA SJH SJD SJA VCH VCD VCA GBH GBD GBA BSH BSD BSA
0 492473 1.73 3.40 5.00 1.75 3.35 4.20 1.85 3.2 3.5 1.80 3.3 3.75 NaN NaN NaN 1.70 3.30 4.33 1.90 3.3 4.00 1.65 3.40 4.50 1.78 3.25 4.00 1.73 3.40 4.20
1 492474 1.95 3.20 3.60 1.80 3.30 3.95 1.90 3.2 3.5 1.90 3.2 3.50 NaN NaN NaN 1.83 3.30 3.60 1.95 3.3 3.80 2.00 3.25 3.25 1.85 3.25 3.75 1.91 3.25 3.60
2 492475 2.38 3.30 2.75 2.40 3.30 2.55 2.60 3.1 2.3 2.50 3.2 2.50 NaN NaN NaN 2.50 3.25 2.40 2.63 3.3 2.50 2.35 3.25 2.65 2.50 3.20 2.50 2.30 3.20 2.75
3 492476 1.44 3.75 7.50 1.40 4.00 6.80 1.40 3.9 6.0 1.44 3.6 6.50 NaN NaN NaN 1.44 3.75 6.00 1.44 4.0 7.50 1.45 3.75 6.50 1.50 3.75 5.50 1.44 3.75 6.50
4 492477 5.00 3.50 1.65 5.00 3.50 1.60 4.00 3.3 1.7 4.00 3.4 1.72 NaN NaN NaN 4.20 3.40 1.70 4.50 3.5 1.73 4.50 3.40 1.65 4.50 3.50 1.65 4.75 3.30 1.67
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25974 1992091 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25975 1992092 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25976 1992093 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25977 1992094 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25978 1992095 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

25979 rows × 31 columns

  • Odds are an indication of how likely an outcome is to occur. For very likely outcomes, the odds are lower, meaning the bookmaker is expecting this outcome, and as such, a lower payout is attached to a winning bet.

  • Before the last letter means betting site. There are 10 betting sites.
  • Last letter
    • “H”: Home win odds
    • “D”: Draw odds
    • “A”: Away win odds
  • So there are 10 betting sites * 3 last letter = 30 betting information.

  • Let’s check the relation between odd and the match result.
con = sql.connect("../data/database.sqlite")
org_match = pd.read_sql(
    "select * from Match", con
    )
match_result_info = org_match[["match_api_id", "home_team_goal", "away_team_goal"]]
match_result_info["match_result"] = match_result_info.home_team_goal - match_result_info.away_team_goal
plt.figure(figsize = (20, 20))

corr = round(df_match_betting.merge(match_result_info, how = "left", on = "match_api_id").corr(), 2)

sns.heatmap(corr, annot = True, cmap = "BrBG", vmin = -1, vmax = 1, 
            linewidths = 0.5, cbar_kws = {"shrink" : 0.5})
<AxesSubplot:>

png

  • Match result (home team goal - away team goal) have some negative correlation with home team win odd.
  • Match result have some positive correlation with draw and away win odds.
  • That is, it seems that odds have some information related to the match result. So let’s use the odd variables as our predictors to predict the match result.

  • Also,
    • Home team win odd has a correlation close to 1 between different sites.
    • Draw and away team win odd have a correlation close to 1 between different sites.
  • That is, there are too many similar information. We need to reduce the betting related variables.
msno.matrix(df_match_betting)
<AxesSubplot:>

png

df_match_betting.isna().sum().sort_values(ascending = False)
PSA             14811
PSH             14811
PSD             14811
BSD             11818
BSH             11818
BSA             11818
GBH             11817
GBA             11817
GBD             11817
SJH              8882
SJA              8882
SJD              8882
IWD              3459
IWA              3459
IWH              3459
LBH              3423
LBD              3423
LBA              3423
VCH              3411
VCD              3411
VCA              3411
WHD              3408
WHH              3408
WHA              3408
BWA              3404
BWH              3404
BWD              3404
B365H            3387
B365A            3387
B365D            3387
match_api_id        0
dtype: int64
  • First, let’s exclude the betting variables with more than 4,000 missing values.
target_cols = df_match_betting.isna().sum()[df_match_betting.isna().sum() > 4000].index.values
df_match_betting = df_match_betting.drop(target_cols, axis = 1)
msno.matrix(df_match_betting)
<AxesSubplot:>

png

  • If there is a mssing value in the B356H, then all betting variables have mssing value.
df_match_betting[df_match_betting.B365H.isna()].shape
(3387, 19)
  • There are 3,387 matches that all betting variables have mssing values.
  • Simply drop these matches.
df_match_betting = df_match_betting[~df_match_betting.B365H.isna()]
target_id = df_match_betting.match_api_id.unique()
train_match_api_id = pd.read_csv("../data/train_match_api_id.csv")
train_match_api_id.shape
(18343, 1)
train_match_api_id = train_match_api_id[train_match_api_id.match_api_id.isin(target_id)]
train_match_api_id.shape
(17023, 1)
  • After drop the matches that don’t have any betting information, train matches become 18,343 -> 17,023.
test_match_api_id = pd.read_csv("../data/test_match_api_id.csv")
test_match_api_id.shape
(3018, 1)
test_match_api_id = test_match_api_id[test_match_api_id.match_api_id.isin(target_id)]
test_match_api_id.shape
(2662, 1)
  • After drop the matches that don’t have any betting information, test matches become 3,018 -> 2,662.
train_match_api_id.to_csv("../data/train_match_api_id.csv", index = False)
test_match_api_id.to_csv("../data/test_match_api_id.csv", index = False)
  • If th last letter is
    • “H”: Home win odds
    • “D”: Draw odds
    • “A”: Awa win odds
  • That is, if the last letter is same, they have same information between different sites.
  • So let’s use summary statistics as follow:
    • last letter “H” betting odds: min, max, mean, std
    • last letter “D” betting odds: min, max, mean, std
    • last letter “A” betting odds: min, max, mean, std
H_odds = ["B365H", "BWH", "IWH", "LBH", "WHH", "VCH"]
D_odds = ["B365D", "BWD", "IWD", "LBD", "WHD", "VCD"]
A_odds = ["B365A", "BWA", "IWA", "LBA", "WHA", "VCA"]
H_odds_stat = pd.DataFrame({"H_odd_min": df_match_betting[H_odds].min(axis = 1),
                            "H_odd_max": df_match_betting[H_odds].max(axis = 1),
                            "H_odd_mean": df_match_betting[H_odds].mean(axis = 1),
                            "H_odd_std": df_match_betting[H_odds].min(axis = 1)})
D_odds_stat = pd.DataFrame({"D_odd_min": df_match_betting[D_odds].min(axis = 1),
                            "D_odd_max": df_match_betting[D_odds].max(axis = 1),
                            "D_odd_mean": df_match_betting[D_odds].mean(axis = 1),
                            "D_odd_std": df_match_betting[D_odds].min(axis = 1)})
A_odds_stat = pd.DataFrame({"A_odd_min": df_match_betting[A_odds].min(axis = 1),
                            "A_odd_max": df_match_betting[A_odds].max(axis = 1),
                            "A_odd_mean": df_match_betting[A_odds].mean(axis = 1),
                            "A_odd_std": df_match_betting[A_odds].min(axis = 1)})
df_match_betting_stat = pd.concat([df_match_betting.match_api_id, H_odds_stat, D_odds_stat, A_odds_stat], axis = 1)
df_match_betting_stat
match_api_id H_odd_min H_odd_max H_odd_mean H_odd_std D_odd_min D_odd_max D_odd_mean D_odd_std A_odd_min A_odd_max A_odd_mean A_odd_std
0 492473 1.65 1.85 1.746667 1.65 3.2 3.4 3.325000 3.2 3.50 5.00 4.213333 3.50
1 492474 1.80 2.00 1.896667 1.80 3.2 3.3 3.241667 3.2 3.25 3.95 3.566667 3.25
2 492475 2.35 2.60 2.455000 2.35 3.1 3.3 3.233333 3.1 2.30 2.75 2.525000 2.30
3 492476 1.40 1.45 1.428333 1.40 3.6 4.0 3.791667 3.6 6.00 7.50 6.550000 6.00
4 492477 4.00 5.00 4.450000 4.00 3.3 3.5 3.416667 3.3 1.60 1.72 1.670000 1.60
... ... ... ... ... ... ... ... ... ... ... ... ... ...
24552 2030167 1.57 1.65 1.591667 1.57 3.3 4.0 3.758333 3.3 4.90 7.00 6.400000 4.90
24553 2030168 2.20 2.38 2.288333 2.20 3.1 3.4 3.208333 3.1 3.10 3.40 3.241667 3.10
24554 2030169 1.50 1.60 1.550000 1.50 3.5 4.2 3.900000 3.5 5.40 7.00 6.566667 5.40
24555 2030170 2.30 2.40 2.341667 2.30 3.1 3.4 3.250000 3.1 2.75 3.30 3.083333 2.75
24556 2030171 2.10 2.30 2.208333 2.10 3.3 3.6 3.433333 3.3 2.90 3.50 3.216667 2.90

22592 rows × 13 columns

msno.matrix(df_match_betting_stat)
<AxesSubplot:>

png

  • Finally, we have 3 result (home win, draw, away win) * 4 summary statistics (min, max, mean, std) = 12 betting related variables.

  • Save the new data frame.

df_match_betting_stat.to_csv("../data/df_match_betting_stat.csv", index = False)

Summary

  • Drop the matches that have the missing values in all betting related columns.
    • train matces: 18,343 -> 17,023
    • test matches: 3,018 -> 2,662
  • Drop the betting columns that have more than 4,000 missing values. (drop 12 columns)

  • If th last letter is
    • “H”: Home win odds
    • “D”: Draw odds
    • “A”: Awa win odds
  • That is, if the last letter is same, they have same information between different sites.
  • So we made summary statistics as follow:
    • last letter “H” betting odds: min, max, mean, std
    • last letter “D” betting odds: min, max, mean, std
    • last letter “A” betting odds: min, max, mean, std

2.3.summary