2 - 3) EDA - Match batting information

13 minute read

import numpy as np 
import pandas as pd 

import sqlite3 as sql 

import matplotlib.pyplot as plt 
import seaborn as sns 
import missingno as msno 

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

df_match_betting = pd.read_csv("../data/df_match_betting.csv")

df_match_betting

	match_api_id	B365H	B365D	B365A	BWH	BWD	BWA	IWH	IWD	IWA	LBH	LBD	LBA	PSH	PSD	PSA	WHH	WHD	WHA	SJH	SJD	SJA	VCH	VCD	VCA	GBH	GBD	GBA	BSH	BSD	BSA
0	492473	1.73	3.40	5.00	1.75	3.35	4.20	1.85	3.2	3.5	1.80	3.3	3.75	NaN	NaN	NaN	1.70	3.30	4.33	1.90	3.3	4.00	1.65	3.40	4.50	1.78	3.25	4.00	1.73	3.40	4.20
1	492474	1.95	3.20	3.60	1.80	3.30	3.95	1.90	3.2	3.5	1.90	3.2	3.50	NaN	NaN	NaN	1.83	3.30	3.60	1.95	3.3	3.80	2.00	3.25	3.25	1.85	3.25	3.75	1.91	3.25	3.60
2	492475	2.38	3.30	2.75	2.40	3.30	2.55	2.60	3.1	2.3	2.50	3.2	2.50	NaN	NaN	NaN	2.50	3.25	2.40	2.63	3.3	2.50	2.35	3.25	2.65	2.50	3.20	2.50	2.30	3.20	2.75
3	492476	1.44	3.75	7.50	1.40	4.00	6.80	1.40	3.9	6.0	1.44	3.6	6.50	NaN	NaN	NaN	1.44	3.75	6.00	1.44	4.0	7.50	1.45	3.75	6.50	1.50	3.75	5.50	1.44	3.75	6.50
4	492477	5.00	3.50	1.65	5.00	3.50	1.60	4.00	3.3	1.7	4.00	3.4	1.72	NaN	NaN	NaN	4.20	3.40	1.70	4.50	3.5	1.73	4.50	3.40	1.65	4.50	3.50	1.65	4.75	3.30	1.67
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
25974	1992091	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25975	1992092	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25976	1992093	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25977	1992094	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25978	1992095	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

25979 rows × 31 columns

Odds are an indication of how likely an outcome is to occur. For very likely outcomes, the odds are lower, meaning the bookmaker is expecting this outcome, and as such, a lower payout is attached to a winning bet.
Before the last letter means betting site. There are 10 betting sites.
Last letter
- “H”: Home win odds
- “D”: Draw odds
- “A”: Away win odds
So there are 10 betting sites * 3 last letter = 30 betting information.
Let’s check the relation between odd and the match result.

con = sql.connect("../data/database.sqlite")
org_match = pd.read_sql(
    "select * from Match", con
    )

match_result_info = org_match[["match_api_id", "home_team_goal", "away_team_goal"]]
match_result_info["match_result"] = match_result_info.home_team_goal - match_result_info.away_team_goal

plt.figure(figsize = (20, 20))

corr = round(df_match_betting.merge(match_result_info, how = "left", on = "match_api_id").corr(), 2)

sns.heatmap(corr, annot = True, cmap = "BrBG", vmin = -1, vmax = 1, 
            linewidths = 0.5, cbar_kws = {"shrink" : 0.5})

<AxesSubplot:>

png

Match result (home team goal - away team goal) have some negative correlation with home team win odd.
Match result have some positive correlation with draw and away win odds.
That is, it seems that odds have some information related to the match result. So let’s use the odd variables as our predictors to predict the match result.
Also,
- Home team win odd has a correlation close to 1 between different sites.
- Draw and away team win odd have a correlation close to 1 between different sites.
That is, there are too many similar information. We need to reduce the betting related variables.

msno.matrix(df_match_betting)

<AxesSubplot:>

png

df_match_betting.isna().sum().sort_values(ascending = False)

PSA             14811
PSH             14811
PSD             14811
BSD             11818
BSH             11818
BSA             11818
GBH             11817
GBA             11817
GBD             11817
SJH              8882
SJA              8882
SJD              8882
IWD              3459
IWA              3459
IWH              3459
LBH              3423
LBD              3423
LBA              3423
VCH              3411
VCD              3411
VCA              3411
WHD              3408
WHH              3408
WHA              3408
BWA              3404
BWH              3404
BWD              3404
B365H            3387
B365A            3387
B365D            3387
match_api_id        0
dtype: int64

First, let’s exclude the betting variables with more than 4,000 missing values.

target_cols = df_match_betting.isna().sum()[df_match_betting.isna().sum() > 4000].index.values
df_match_betting = df_match_betting.drop(target_cols, axis = 1)

msno.matrix(df_match_betting)

<AxesSubplot:>

png

If there is a mssing value in the B356H, then all betting variables have mssing value.

df_match_betting[df_match_betting.B365H.isna()].shape

(3387, 19)

There are 3,387 matches that all betting variables have mssing values.
Simply drop these matches.

df_match_betting = df_match_betting[~df_match_betting.B365H.isna()]

target_id = df_match_betting.match_api_id.unique()

train_match_api_id = pd.read_csv("../data/train_match_api_id.csv")
train_match_api_id.shape

(18343, 1)

train_match_api_id = train_match_api_id[train_match_api_id.match_api_id.isin(target_id)]
train_match_api_id.shape

(17023, 1)

After drop the matches that don’t have any betting information, train matches become 18,343 -> 17,023.

test_match_api_id = pd.read_csv("../data/test_match_api_id.csv")
test_match_api_id.shape

(3018, 1)

test_match_api_id = test_match_api_id[test_match_api_id.match_api_id.isin(target_id)]
test_match_api_id.shape

(2662, 1)

After drop the matches that don’t have any betting information, test matches become 3,018 -> 2,662.

train_match_api_id.to_csv("../data/train_match_api_id.csv", index = False)
test_match_api_id.to_csv("../data/test_match_api_id.csv", index = False)

If th last letter is
- “H”: Home win odds
- “D”: Draw odds
- “A”: Awa win odds
That is, if the last letter is same, they have same information between different sites.
So let’s use summary statistics as follow:
- last letter “H” betting odds: min, max, mean, std
- last letter “D” betting odds: min, max, mean, std
- last letter “A” betting odds: min, max, mean, std

H_odds = ["B365H", "BWH", "IWH", "LBH", "WHH", "VCH"]
D_odds = ["B365D", "BWD", "IWD", "LBD", "WHD", "VCD"]
A_odds = ["B365A", "BWA", "IWA", "LBA", "WHA", "VCA"]

H_odds_stat = pd.DataFrame({"H_odd_min": df_match_betting[H_odds].min(axis = 1),
                            "H_odd_max": df_match_betting[H_odds].max(axis = 1),
                            "H_odd_mean": df_match_betting[H_odds].mean(axis = 1),
                            "H_odd_std": df_match_betting[H_odds].min(axis = 1)})

D_odds_stat = pd.DataFrame({"D_odd_min": df_match_betting[D_odds].min(axis = 1),
                            "D_odd_max": df_match_betting[D_odds].max(axis = 1),
                            "D_odd_mean": df_match_betting[D_odds].mean(axis = 1),
                            "D_odd_std": df_match_betting[D_odds].min(axis = 1)})

A_odds_stat = pd.DataFrame({"A_odd_min": df_match_betting[A_odds].min(axis = 1),
                            "A_odd_max": df_match_betting[A_odds].max(axis = 1),
                            "A_odd_mean": df_match_betting[A_odds].mean(axis = 1),
                            "A_odd_std": df_match_betting[A_odds].min(axis = 1)})

df_match_betting_stat = pd.concat([df_match_betting.match_api_id, H_odds_stat, D_odds_stat, A_odds_stat], axis = 1)

df_match_betting_stat

	match_api_id	H_odd_min	H_odd_max	H_odd_mean	H_odd_std	D_odd_min	D_odd_max	D_odd_mean	D_odd_std	A_odd_min	A_odd_max	A_odd_mean	A_odd_std
0	492473	1.65	1.85	1.746667	1.65	3.2	3.4	3.325000	3.2	3.50	5.00	4.213333	3.50
1	492474	1.80	2.00	1.896667	1.80	3.2	3.3	3.241667	3.2	3.25	3.95	3.566667	3.25
2	492475	2.35	2.60	2.455000	2.35	3.1	3.3	3.233333	3.1	2.30	2.75	2.525000	2.30
3	492476	1.40	1.45	1.428333	1.40	3.6	4.0	3.791667	3.6	6.00	7.50	6.550000	6.00
4	492477	4.00	5.00	4.450000	4.00	3.3	3.5	3.416667	3.3	1.60	1.72	1.670000	1.60
...	...	...	...	...	...	...	...	...	...	...	...	...	...
24552	2030167	1.57	1.65	1.591667	1.57	3.3	4.0	3.758333	3.3	4.90	7.00	6.400000	4.90
24553	2030168	2.20	2.38	2.288333	2.20	3.1	3.4	3.208333	3.1	3.10	3.40	3.241667	3.10
24554	2030169	1.50	1.60	1.550000	1.50	3.5	4.2	3.900000	3.5	5.40	7.00	6.566667	5.40
24555	2030170	2.30	2.40	2.341667	2.30	3.1	3.4	3.250000	3.1	2.75	3.30	3.083333	2.75
24556	2030171	2.10	2.30	2.208333	2.10	3.3	3.6	3.433333	3.3	2.90	3.50	3.216667	2.90

22592 rows × 13 columns

msno.matrix(df_match_betting_stat)

<AxesSubplot:>

png

Finally, we have 3 result (home win, draw, away win) * 4 summary statistics (min, max, mean, std) = 12 betting related variables.
Save the new data frame.

df_match_betting_stat.to_csv("../data/df_match_betting_stat.csv", index = False)

Summary

Drop the matches that have the missing values in all betting related columns.
- train matces: 18,343 -> 17,023
- test matches: 3,018 -> 2,662
Drop the betting columns that have more than 4,000 missing values. (drop 12 columns)
If th last letter is
- “H”: Home win odds
- “D”: Draw odds
- “A”: Awa win odds
That is, if the last letter is same, they have same information between different sites.
So we made summary statistics as follow:
- last letter “H” betting odds: min, max, mean, std
- last letter “D” betting odds: min, max, mean, std
- last letter “A” betting odds: min, max, mean, std

2.3.summary

Youngjun Woo

Summary