3 - 1) Train machine learning models to predict match results (In progress)

53 minute read

import pandas as pd 
import numpy as np 

import matplotlib.pyplot as plt 
import matplotlib.patches as mpatches
import seaborn as sns 
import missingno as msno 

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis 
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import tree
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb

import optuna
from optuna.integration import LightGBMPruningCallback
from optuna.integration import XGBoostPruningCallback
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
from sklearn.metrics import classification_report

from sklearn.inspection import permutation_importance

import imblearn
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

We have 5 kinds of variable sets
- variable set 1: player attributes PC features
- Variable set 2: betting information features
- variable set 3: team attributes features
- variable set 4: goal and win percentage rolling features
- variable set 5: each team’s Elo rating

df_match_basic = pd.read_csv("../data/df_match_basic.csv")

df_match_player_attr_pcs = pd.read_csv("../data/df_match_player_attr_pcs.csv")

df_match_betting_stat = pd.read_csv("../data/df_match_betting_stat.csv")

df_match_team_num_attr = pd.read_csv("../data/df_match_team_num_attr.csv")

df_team_win_goal_rolling_features = pd.read_csv("../data/df_team_win_goal_rolling_features.csv")

df_match_elo = pd.read_csv("../data/df_match_elo.csv")

First, let’s predict the match result and compare the result by using each variable sets.

1. Train test split

Set last season as test set, other seasons as train set.

target_bool = (df_match_basic.match_api_id.isin(df_match_player_attr_pcs.match_api_id)) & \
              (df_match_basic.match_api_id.isin(df_match_betting_stat.match_api_id)) & \
              (df_match_basic.match_api_id.isin(df_match_team_num_attr.match_api_id)) & \
              (df_match_basic.match_api_id.isin(df_team_win_goal_rolling_features.match_api_id)) & \
              (df_match_basic.match_api_id.isin(df_match_elo.match_api_id))
    

target_matches = df_match_basic[target_bool]

test_match_api_id = target_matches[target_matches.season == "2015/2016"].match_api_id
train_match_api_id = target_matches[target_matches.season != "2015/2016"].match_api_id

print(len(train_match_api_id), len(test_match_api_id))

16988 2621

There are 16,988 train set and 2,621 test set.

2. Baseline accuracy

df_match_basic[df_match_basic.match_api_id.isin(train_match_api_id)].match_result.value_counts()

home_win    7840
away_win    4855
draw        4293
Name: match_result, dtype: int64

sns.countplot(x = df_match_basic[df_match_basic.match_api_id.isin(train_match_api_id)].match_result)

<AxesSubplot:xlabel='match_result', ylabel='count'>

png

About 46% of all 16,988 matches were won by the home team.
That is, if we predict all matches as home team win, then we can achieve about 46% accuracy, that can be used as our baseline accuracy.
Let’s check the baseline accuracy in the test data set.

df_match_basic[df_match_basic.match_api_id.isin(test_match_api_id)].match_result.value_counts()

home_win    1161
away_win     801
draw         659
Name: match_result, dtype: int64

sns.countplot(x = df_match_basic[df_match_basic.match_api_id.isin(test_match_api_id)].match_result)

<AxesSubplot:xlabel='match_result', ylabel='count'>

png

Baseline accuracy in the test dataset is about 44% (1,161 / 2,621)

3. Modeling with all variable sets

3.1. Variable set 1: Player attributes PC features

df_match_player_attr_pcs = df_match_player_attr_pcs.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")

df_match_player_attr_pcs = df_match_player_attr_pcs.set_index("match_api_id")
df_match_player_attr_pcs

	home_player_1_pc_1	home_player_1_pc_2	home_player_1_pc_3	home_player_1_pc_4	home_player_1_pc_5	home_player_2_pc_1	home_player_2_pc_2	home_player_2_pc_3	home_player_2_pc_4	home_player_2_pc_5	home_player_3_pc_1	home_player_3_pc_2	home_player_3_pc_3	home_player_3_pc_4	home_player_3_pc_5	home_player_4_pc_1	home_player_4_pc_2	home_player_4_pc_3	home_player_4_pc_4	home_player_4_pc_5	home_player_5_pc_1	home_player_5_pc_2	home_player_5_pc_3	home_player_5_pc_4	home_player_5_pc_5	home_player_6_pc_1	home_player_6_pc_2	home_player_6_pc_3	home_player_6_pc_4	home_player_6_pc_5	home_player_7_pc_1	home_player_7_pc_2	home_player_7_pc_3	home_player_7_pc_4	home_player_7_pc_5	home_player_8_pc_1	home_player_8_pc_2	home_player_8_pc_3	home_player_8_pc_4	home_player_8_pc_5	home_player_9_pc_1	home_player_9_pc_2	home_player_9_pc_3	home_player_9_pc_4	home_player_9_pc_5	home_player_10_pc_1	home_player_10_pc_2	home_player_10_pc_3	home_player_10_pc_4	home_player_10_pc_5	home_player_11_pc_1	home_player_11_pc_2	home_player_11_pc_3	home_player_11_pc_4	home_player_11_pc_5	away_player_1_pc_1	away_player_1_pc_2	away_player_1_pc_3	away_player_1_pc_4	away_player_1_pc_5	away_player_2_pc_1	away_player_2_pc_2	away_player_2_pc_3	away_player_2_pc_4	away_player_2_pc_5	away_player_3_pc_1	away_player_3_pc_2	away_player_3_pc_3	away_player_3_pc_4	away_player_3_pc_5	away_player_4_pc_1	away_player_4_pc_2	away_player_4_pc_3	away_player_4_pc_4	away_player_4_pc_5	away_player_5_pc_1	away_player_5_pc_2	away_player_5_pc_3	away_player_5_pc_4	away_player_5_pc_5	away_player_6_pc_1	away_player_6_pc_2	away_player_6_pc_3	away_player_6_pc_4	away_player_6_pc_5	away_player_7_pc_1	away_player_7_pc_2	away_player_7_pc_3	away_player_7_pc_4	away_player_7_pc_5	away_player_8_pc_1	away_player_8_pc_2	away_player_8_pc_3	away_player_8_pc_4	away_player_8_pc_5	away_player_9_pc_1	away_player_9_pc_2	away_player_9_pc_3	away_player_9_pc_4	away_player_9_pc_5	away_player_10_pc_1	away_player_10_pc_2	away_player_10_pc_3	away_player_10_pc_4	away_player_10_pc_5	away_player_11_pc_1	away_player_11_pc_2	away_player_11_pc_3	away_player_11_pc_4	away_player_11_pc_5	match_result
match_api_id
493017	9.172915	-0.705596	1.028500	-0.044401	1.246272	3.957784	1.650964	-2.348632	-0.837480	0.223750	-0.817702	-0.589548	0.195433	-2.144582	1.524434	3.108730	0.633708	-1.772338	0.807727	1.684577	0.615229	-0.611994	-0.845425	0.395221	3.107710	-0.038702	-0.551509	-0.280096	-0.068082	2.645878	1.086047	-2.583583	-0.596348	-1.915882	-0.987251	-0.845848	-0.053597	0.746034	-0.515174	1.763346	4.244623	1.088030	-2.054898	-0.414998	0.270225	-0.559441	1.233335	0.426201	0.609910	1.364324	1.472274	-0.298277	-1.530989	0.743621	-0.875547	9.794795	-0.549117	1.941560	0.281992	0.521328	-1.886782	1.005850	1.291888	0.172012	1.212622	1.320201	1.065244	-0.406996	-1.803778	0.573483	2.628391	0.940615	-0.609534	-1.586848	0.652795	3.207876	0.685128	-1.480756	-1.499414	0.134771	-1.360311	-2.979972	1.181522	-0.726996	1.161218	-1.735000	-0.323721	0.977427	-1.047350	1.307837	-0.004187	1.646099	0.543917	0.101722	0.374692	-1.836650	-2.551881	1.212316	0.139711	1.155972	-0.916797	-1.202214	0.022668	-0.118531	-0.332637	0.628660	-1.751083	-0.030693	0.418133	-0.206630	home_win
493025	6.467731	-2.125163	3.092089	-0.930974	1.172527	0.390653	1.341612	0.109198	-0.418330	0.921304	2.673401	1.688787	-0.346764	0.032250	-0.148238	0.544724	0.856618	-0.234054	-0.185530	1.496783	0.349558	-0.322471	0.311433	-0.720916	0.765089	-2.997770	-1.279240	1.492689	-0.174247	1.728035	-1.262416	0.290702	1.097804	0.377514	1.681666	-1.035827	-0.034337	1.133088	0.899750	1.451945	-1.970162	-1.958911	1.072887	0.806278	2.735079	-1.206227	-2.567254	0.401355	-0.240084	1.552955	-0.417405	-0.845298	0.077456	-1.026087	-0.249720	6.746068	-1.225452	5.441467	-1.741143	0.153029	-0.720625	1.362837	0.950063	-0.430052	1.920381	1.321520	2.234319	0.546943	0.392576	1.081937	-1.157135	1.181325	0.901247	0.722505	0.967595	-1.573349	0.366277	1.085719	-0.362088	0.794429	-2.362155	0.960996	2.341796	0.480765	1.235871	-2.036662	1.423269	1.975662	0.456836	0.564672	-1.002934	-2.335553	0.470491	-0.111478	0.585989	-1.407944	-2.205414	1.069296	0.053596	0.809813	-2.259696	-2.778135	1.816538	1.407815	0.022775	0.169010	-2.319129	0.361658	0.935315	-1.877390	away_win
493027	7.587977	-0.669761	3.351410	-1.725204	0.971832	-0.659142	2.447278	1.853202	-0.962117	0.277773	0.782575	2.059493	1.311728	-0.983126	0.652947	0.750721	3.078327	1.228950	0.540478	0.034199	0.361849	2.339354	1.689313	0.713260	0.455514	-3.759299	1.204148	3.191002	-0.784951	0.913294	-3.018176	0.545416	2.535436	-0.218828	2.658318	-2.589758	0.570906	1.654008	0.345163	0.580780	-3.859427	-1.870653	2.659663	-0.725774	2.171793	-0.199689	-2.415793	-0.317491	0.067180	0.943378	-1.056848	-1.205950	1.067483	1.246078	-0.070047	8.942737	-2.146976	1.394474	-1.573403	-0.305385	0.326027	1.666843	0.694338	0.105792	1.385551	1.716745	1.133649	-0.670094	-1.374712	0.468374	1.822806	1.126330	-0.789073	-1.478142	0.878564	5.021780	0.645976	-1.954515	-1.642732	0.230055	0.882744	1.703690	0.053693	0.563981	0.978446	-0.780143	-0.700812	0.612057	1.416142	2.059364	-0.691788	-2.016431	0.356132	-1.164456	2.251789	1.464444	0.379332	-0.902796	-0.837762	0.211743	1.923084	-2.893726	-2.042800	-1.018973	-1.292926	0.416905	-2.210474	-0.336256	0.054563	2.588626	home_win
493034	9.172915	-0.705596	1.028500	-0.044401	1.246272	3.957784	1.650964	-2.348632	-0.837480	0.223750	-0.817702	-0.589548	0.195433	-2.144582	1.524434	-0.559441	1.233335	0.426201	0.609910	1.364324	-0.845848	-0.053597	0.746034	-0.515174	1.763346	0.615229	-0.611994	-0.845425	0.395221	3.107710	3.108730	0.633708	-1.772338	0.807727	1.684577	1.086047	-2.583583	-0.596348	-1.915882	-0.987251	4.244623	1.088030	-2.054898	-0.414998	0.270225	-0.105750	0.211866	0.619067	0.766076	-0.979514	1.472274	-0.298277	-1.530989	0.743621	-0.875547	7.587977	-0.669761	3.351410	-1.725204	0.971832	-0.659142	2.447278	1.853202	-0.962117	0.277773	0.361849	2.339354	1.689313	0.713260	0.455514	1.052194	2.080899	0.370792	-0.782729	0.394531	2.077104	1.903504	0.684877	1.471026	-0.492695	-3.759299	1.204148	3.191002	-0.784951	0.913294	-0.631336	-1.917288	0.404582	-1.604514	-0.334234	-3.859427	-1.870653	2.659663	-0.725774	2.171793	-3.018176	0.545416	2.535436	-0.218828	2.658318	-2.589758	0.570906	1.654008	0.345163	0.580780	-1.056848	-1.205950	1.067483	1.246078	-0.070047	home_win
493040	8.942737	-2.146976	1.394474	-1.573403	-0.305385	0.326027	1.666843	0.694338	0.105792	1.385551	1.831429	1.436093	-0.753915	-1.703966	1.116009	1.716745	1.133649	-0.670094	-1.374712	0.468374	1.822806	1.126330	-0.789073	-1.478142	0.878564	0.882744	1.703690	0.053693	0.563981	0.978446	-0.780143	-0.700812	0.612057	1.416142	2.059364	-0.691788	-2.016431	0.356132	-1.164456	2.251789	0.416905	-2.210474	-0.336256	0.054563	2.588626	1.923084	-2.893726	-2.042800	-1.018973	-1.292926	1.464444	0.379332	-0.902796	-0.837762	0.211743	9.554376	-0.788436	2.277353	0.709577	0.731361	1.970444	1.864556	-0.597320	0.168243	0.246152	4.470705	2.449109	-1.396639	-0.315235	0.008914	1.829497	1.079852	-2.363541	-0.545278	0.775451	4.019384	0.341251	-2.730811	-1.093104	0.247451	0.682845	0.218515	-0.581395	0.951872	2.794600	-0.793495	-0.740710	0.492791	0.151706	2.489086	1.933672	0.891587	-1.109525	-0.792216	-0.402166	-0.203211	-1.625096	-0.009942	0.220891	1.091029	0.344323	-3.281820	-1.047589	-0.870665	0.278254	0.817798	-2.420838	-0.361229	0.689072	-0.850103	draw
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1992089	10.658392	-1.977648	0.867377	0.480266	1.074043	2.309500	0.931667	-3.860209	0.540601	-0.235396	4.829365	3.813902	-2.190143	2.896975	0.303158	4.363347	3.346856	-2.511394	1.256390	-1.264653	0.693648	2.017825	-1.753765	-2.152817	-0.662817	2.162406	-2.749882	-4.593596	0.434768	0.841622	-0.817444	1.999270	-0.449227	-0.166485	-0.336412	1.345606	0.069802	-3.170566	-0.790618	1.344050	-1.052756	0.131148	-1.406108	-0.920376	1.149814	-0.756354	-0.555587	-0.498005	0.934758	-2.606863	-2.571317	-2.788252	-0.783057	-0.228549	0.825315	11.297455	-1.943311	-0.622608	1.229927	2.987196	0.715893	1.452546	-2.111059	-0.524585	0.158115	5.742015	3.143138	-3.280495	0.901933	-1.055971	3.559549	2.023672	-2.954817	1.214646	0.395441	-0.652392	0.617508	-2.119288	0.420918	-0.312378	-1.179215	0.990786	-0.758331	0.280326	-1.019536	-0.983237	1.033960	-0.714609	-0.027433	0.645499	-0.468185	-2.492838	-2.135542	0.672037	1.175961	0.089223	-1.409898	-2.184536	0.860498	2.273182	-1.604118	-1.071515	-1.719817	-0.597994	0.268359	-2.027482	-0.929774	-0.429955	-0.408179	-0.159691	draw
1992091	12.241704	-2.064268	0.575912	0.027214	0.566339	2.941632	1.263739	-2.564271	-0.317355	0.251567	-0.338978	1.001774	-1.054876	0.023669	0.503527	3.322061	2.191727	-1.431455	-0.935647	-2.106780	0.805775	2.313243	-1.001739	-0.734564	-1.002397	0.167600	0.748910	-2.441020	-0.128875	1.354357	-0.992210	1.486864	-0.837892	-0.283414	0.885329	-0.144093	-1.113188	-1.698684	0.131062	0.388273	-1.074581	-2.141164	-1.372831	0.593159	-0.060973	-1.374814	-2.597658	-1.353337	-0.079045	0.449154	1.613120	-2.543332	-2.596794	1.483082	0.105411	12.290704	-2.430347	-1.144252	-0.197618	0.806754	4.561174	1.116819	-3.853639	-1.449020	-0.008049	0.859240	1.593152	-1.523836	0.182434	0.923206	3.366277	2.096648	-2.560049	1.302320	0.621418	-0.179065	1.542439	-0.883813	-1.534520	-0.929437	-1.274091	1.044816	-0.632610	0.214867	-1.155989	-1.165921	1.062657	-0.462190	-0.128853	0.815702	-0.551649	-3.139936	-1.871850	0.919162	1.101761	-0.895229	-1.357762	-1.203311	0.502165	1.217481	0.817539	-1.879420	-1.866788	1.257348	-2.040793	-2.027482	-0.929774	-0.429955	-0.408179	-0.159691	home_win
1992092	12.064639	-2.316174	0.433132	0.848287	2.231699	0.041330	1.067682	-1.528655	1.481193	0.463667	0.484107	2.132478	-2.712916	0.415998	-0.482643	2.368090	1.818320	-1.835697	2.213761	1.021893	2.542915	2.152431	-2.292237	-0.564741	0.029556	-0.661431	1.259634	-1.538893	0.327645	1.036215	-2.276135	-1.075133	-1.057017	0.562307	1.955697	1.429669	1.815169	-2.713671	-1.202667	-0.206876	-2.066585	-1.486135	-2.249610	0.528683	1.248011	2.305110	0.196755	-3.015813	0.884716	1.782990	0.202502	0.363303	-2.203897	0.890204	1.972601	11.605200	-2.180668	0.633318	0.505193	2.121379	0.936419	1.221556	-2.522024	-0.246410	0.123592	0.472920	1.911681	-1.404507	-1.552859	0.102178	2.614211	3.470441	-0.910743	4.107990	0.561911	-0.386817	2.148362	-0.789333	0.490890	0.348217	1.412310	-0.776693	-3.695553	-2.537028	0.654647	0.038726	0.194863	-1.450742	-1.480179	-0.329070	-1.248945	0.135936	-0.373578	-0.008843	1.145546	-1.268382	-2.736194	-2.134065	-2.336878	0.312531	-1.267814	-0.952598	-0.567959	1.417181	-1.031334	-2.777047	-2.958610	-1.015094	-0.426489	1.111898	away_win
1992093	12.055222	-2.425069	1.339212	0.274265	0.840987	0.807082	0.384775	-3.089207	-1.255036	0.706752	3.316945	2.584242	-1.555439	-0.624218	-1.494029	0.752059	2.634155	0.029098	-2.048228	-2.345172	-0.477442	0.588966	-1.810286	-0.671132	0.670969	-4.479330	0.957108	1.567778	1.821481	1.887503	-0.304292	2.229022	-0.374285	-0.629550	-0.550464	-2.488337	-2.492236	0.004280	-0.328167	-0.137174	-1.303862	-3.461748	-0.804760	-0.413108	-1.193188	-3.587590	-2.419330	0.831607	2.357133	0.281695	-2.476574	-2.795499	-0.080093	0.062647	-2.011499	12.360092	-1.868190	1.990823	0.546306	0.372803	-1.329084	1.395726	-0.722913	-0.893829	0.215497	1.165437	3.021942	-0.188859	2.311939	0.403286	-4.008986	1.662399	0.332054	0.291516	1.172819	-1.397060	1.400592	-0.140128	-0.156016	-0.683065	-1.079926	0.426619	-0.576034	-2.142184	-0.515036	-0.909491	1.934916	0.160027	0.916388	1.269362	-1.061561	-2.717686	-0.665092	-0.671323	-2.761000	-1.823349	-1.744156	-0.979761	0.332470	0.702492	-1.032879	-1.066695	-0.612696	-0.450766	0.290344	-1.642076	-3.316316	0.011961	0.375835	-3.465042	home_win
1992095	11.109830	-2.853322	3.703596	-0.584121	-0.402950	-0.371390	1.595755	-1.118035	-1.088917	-0.072962	0.782687	3.074394	-0.351447	0.339711	-0.876789	0.954087	3.884860	0.323121	-0.869362	-1.514651	-1.424525	1.026468	-0.666682	-0.521699	1.419345	-2.778054	-3.370352	-0.446712	-0.978029	-0.758080	2.408722	1.658858	-2.251337	-0.820091	-0.630519	-2.159654	1.814418	0.256719	0.567236	-0.314540	-2.998560	-3.922779	0.105072	0.225998	1.222075	-1.617306	-2.339904	-0.992463	-1.548240	0.519653	-2.910805	-2.167197	0.101099	1.144445	-0.165881	12.307758	-1.719545	4.261962	0.728169	1.500838	-1.795498	2.597198	0.399761	-0.004946	-0.912142	1.097353	2.822507	-0.720652	-1.295103	-1.490739	-1.267830	3.175089	1.368422	-0.396539	-2.110602	0.679151	1.514007	-0.744474	-2.598616	-1.680172	-2.472498	1.701903	0.506631	-1.382100	-0.412576	-3.875220	1.593144	1.643325	2.150146	2.170000	-1.395402	0.464636	-0.435683	-0.975001	-1.386601	-3.586664	-2.865959	0.395006	1.476579	2.034092	-4.892234	-0.601987	1.354383	0.830095	-0.426868	-3.654878	-2.094599	1.339425	0.285083	-2.245146	home_win

21374 rows × 111 columns

Split the table into train and test set.

train_bool = df_match_player_attr_pcs.reset_index().match_api_id.isin(train_match_api_id)
test_bool = df_match_player_attr_pcs.reset_index().match_api_id.isin(test_match_api_id) 

df_pc_train = df_match_player_attr_pcs.reset_index()[train_bool].set_index("match_api_id")
df_pc_test = df_match_player_attr_pcs.reset_index()[test_bool].set_index("match_api_id")

X_pc_train = df_pc_train.drop("match_result", axis = 1)
y_pc_train = df_pc_train.match_result 

X_pc_test = df_pc_test.drop("match_result", axis = 1)
y_pc_test = df_pc_test.match_result

print("Number of train data: ", X_pc_train.shape[0])
print("Number of test data: ", X_pc_test.shape[0])

Number of train data:  16988
Number of test data:  2621

Preprocess the data before modeling.

# Transform the match_result class to numerical labels.

le = preprocessing.LabelEncoder()
le.fit(y_pc_train)

y_pc_train_encd = le.transform(y_pc_train)
y_pc_test_encd = le.transform(y_pc_test)

names = ["KNN", 
         "LDA", 
         "QDA", 
         "Naive Bayes",
         "Logistic regression",
         "Decesion tree", 
         "Random Forest",  
         "AdaBoost",
         "XGBoost",
         "Polynomial kernel SVM",
         "Radial kernel SVM",
         "GBM",
         "LightGBM"
        ]

classifiers = [
    KNeighborsClassifier(3),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    GaussianNB(), 
    LogisticRegression(),
    DecisionTreeClassifier(random_state = 42),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    xgb.XGBClassifier(),
    SVC(kernel = "poly", probability = True),
    SVC(kernel = "rbf", probability = True),
    GradientBoostingClassifier(),
    lgb.LGBMClassifier()
    ]

result_accuracy = pd.DataFrame(names, columns = ["model_name"])

# baseline accuracy

y_pred_baseline = le.transform(["home_win"])
baseline_accuracy = np.mean(y_pred_baseline == y_pc_test_encd)
result_accuracy["Baseline accuracy"] = baseline_accuracy

y_pred_dict = {}

for name, clf in zip(names, classifiers):
    clf.fit(X_pc_train, y_pc_train_encd)
    
    y_pred = clf.predict(X_pc_test)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_pc_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Player PC Variables"] = round(accuracy * 100, 3)

result_accuracy

	model_name	Baseline accuracy	Player PC Variables
0	KNN	0.442961	41.320
1	LDA	0.442961	49.943
2	QDA	0.442961	42.350
3	Naive Bayes	0.442961	47.272
4	Logistic regression	0.442961	50.172
5	Decesion tree	0.442961	38.077
6	Random Forest	0.442961	49.790
7	AdaBoost	0.442961	50.362
8	XGBoost	0.442961	47.196
9	Polynomial kernel SVM	0.442961	50.515
10	Radial kernel SVM	0.442961	50.897
11	GBM	0.442961	50.820
12	LightGBM	0.442961	49.866

Except for the KNN, QDA, and Decision tree models, all models have higher accuracies than the baseline accuracy when use the player PC information.

3.2. Variable set 2: Betting information features

df_match_betting_stat = df_match_betting_stat.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")

Split the data into train and test set.

train_bool = df_match_betting_stat.match_api_id.isin(train_match_api_id)
test_bool = df_match_betting_stat.match_api_id.isin(test_match_api_id) 

df_bet_train = df_match_betting_stat[train_bool].set_index("match_api_id")
df_bet_test = df_match_betting_stat[test_bool].set_index("match_api_id")

X_bet_train = df_bet_train.drop("match_result", axis = 1)
y_bet_train = df_bet_train.match_result 

X_bet_test = df_bet_test.drop("match_result", axis = 1)
y_bet_test = df_bet_test.match_result

print("Number of train data: ", X_bet_train.shape[0])
print("Number of test data: ", X_bet_test.shape[0])

Number of train data:  16988
Number of test data:  2621

Preprocess variables before modeling.

# Transform the match_result class to numerical labels.
y_bet_train_encd = le.transform(y_bet_train)
y_bet_test_encd = le.transform(y_bet_test)

# Standardize features
col_names = X_bet_train.columns

scaler = StandardScaler()
scaler.fit(X_bet_train)

X_bet_train_std = pd.DataFrame(scaler.transform(X_bet_train), columns = col_names)
X_bet_test_std = pd.DataFrame(scaler.transform(X_bet_test), columns = col_names)

for name, clf in zip(names, classifiers):
    clf.fit(X_bet_train_std, y_bet_train_encd)
    
    y_pred = clf.predict(X_bet_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_bet_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Betting Statistics Variables"] = round(accuracy * 100, 3)

result_accuracy

	model_name	Baseline accuracy	Player PC Variables	Betting Statistics Variables
0	KNN	0.442961	41.320	44.220
1	LDA	0.442961	49.943	51.469
2	QDA	0.442961	42.350	40.557
3	Naive Bayes	0.442961	47.272	42.198
4	Logistic regression	0.442961	50.172	51.545
5	Decesion tree	0.442961	38.077	43.304
6	Random Forest	0.442961	49.790	48.798
7	AdaBoost	0.442961	50.362	51.698
8	XGBoost	0.442961	47.196	50.439
9	Polynomial kernel SVM	0.442961	50.515	48.760
10	Radial kernel SVM	0.442961	50.897	51.393
11	GBM	0.442961	50.820	52.079
12	LightGBM	0.442961	49.866	52.041

Except for the KNN, QDA, Naive Bayes, and decision tree models, all models have higher accuracies than the baseline accuracy when use the betting statistics information.
Overall, accuracies are higher when using betting information than when using pc information.

3.3 Variable set 3: Team attribute features

df_match_team_num_attr = df_match_team_num_attr.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")

Split the data into train and test set.

train_bool = df_match_team_num_attr.match_api_id.isin(train_match_api_id)
test_bool = df_match_team_num_attr.match_api_id.isin(test_match_api_id) 

df_team_train = df_match_team_num_attr[train_bool].set_index("match_api_id")
df_team_test = df_match_team_num_attr[test_bool].set_index("match_api_id")

X_team_train = df_team_train.drop("match_result", axis = 1)
y_team_train = df_team_train.match_result 

X_team_test = df_team_test.drop("match_result", axis = 1)
y_team_test = df_team_test.match_result

print("Number of train data: ", X_team_train.shape[0])
print("Number of test data: ", X_team_test.shape[0])

Number of train data:  16988
Number of test data:  2621

Preprocess the data before modeling.

# Transform the match_result class to numerical labels.
y_team_train_encd = le.transform(y_team_train)
y_team_test_encd = le.transform(y_team_test)

# fill the missing values with 0
X_team_train.fillna(0, inplace = True)
X_team_test.fillna(0, inplace = True)

# Standardize features
col_names = X_team_train.columns

scaler = StandardScaler()
scaler.fit(X_team_train)

X_team_train_std = pd.DataFrame(scaler.transform(X_team_train), columns = col_names)
X_team_test_std = pd.DataFrame(scaler.transform(X_team_test), columns = col_names)

for name, clf in zip(names, classifiers):
    clf.fit(X_team_train_std, y_team_train_encd)
    
    y_pred = clf.predict(X_team_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_team_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Team attribute Variables"] = round(accuracy * 100, 3)

result_accuracy

	model_name	Baseline accuracy	Player PC Variables	Betting Statistics Variables	Team attribute Variables
0	KNN	0.442961	41.320	44.220	39.412
1	LDA	0.442961	49.943	51.469	45.670
2	QDA	0.442961	42.350	40.557	45.784
3	Naive Bayes	0.442961	47.272	42.198	46.280
4	Logistic regression	0.442961	50.172	51.545	45.555
5	Decesion tree	0.442961	38.077	43.304	38.001
6	Random Forest	0.442961	49.790	48.798	45.326
7	AdaBoost	0.442961	50.362	51.698	46.814
8	XGBoost	0.442961	47.196	50.439	43.037
9	Polynomial kernel SVM	0.442961	50.515	48.760	44.601
10	Radial kernel SVM	0.442961	50.897	51.393	44.868
11	GBM	0.442961	50.820	52.079	47.310
12	LightGBM	0.442961	49.866	52.041	47.119

Except for the KNN, XGBoost, and decision tree models, all models have higher accuracies than the baseline accuracy when use the each team’s attribute information.
When using team attribute information, the overall accuracies are lower than when using other variables.

3.4. Variable set 4: Goal and win percentage rolling features

df_team_win_goal_rolling_features = df_team_win_goal_rolling_features.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")

Split the data into train and test set.

train_bool = df_team_win_goal_rolling_features.reset_index().match_api_id.isin(train_match_api_id)
test_bool = df_team_win_goal_rolling_features.reset_index().match_api_id.isin(test_match_api_id) 

df_rolling_train = df_team_win_goal_rolling_features[train_bool].set_index("match_api_id")
df_rolling_test = df_team_win_goal_rolling_features[test_bool].set_index("match_api_id")

X_rolling_train = df_rolling_train.drop("match_result", axis = 1)
y_rolling_train = df_rolling_train.match_result 

X_rolling_test = df_rolling_test.drop("match_result", axis = 1)
y_rolling_test = df_rolling_test.match_result

print("Number of train data: ", X_rolling_train.shape[0])
print("Number of test data: ", X_rolling_test.shape[0])

Number of train data:  16988
Number of test data:  2621

Preprocess the data befor modeling.

# Transform the match_result class to numerical labels.

y_rolling_train_encd = le.transform(y_rolling_train)
y_rolling_test_encd = le.transform(y_rolling_test)

# fill missing values with 0

X_rolling_train.fillna(0, inplace = True)
X_rolling_test.fillna(0, inplace = True)

# Standardize features

col_names = X_rolling_train.columns

scaler = StandardScaler()
scaler.fit(X_rolling_train)

X_rolling_train_std = pd.DataFrame(scaler.transform(X_rolling_train), columns = col_names)
X_rolling_test_std = pd.DataFrame(scaler.transform(X_rolling_test), columns = col_names)

for name, clf in zip(names, classifiers):
    clf.fit(X_rolling_train_std, y_rolling_train_encd)
    
    y_pred = clf.predict(X_rolling_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_rolling_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Team's goal and win percentage rolling Variables"] = round(accuracy * 100, 3)

result_accuracy

	model_name	Baseline accuracy	Player PC Variables	Betting Statistics Variables	Team attribute Variables	Team's goal and win percentage rolling Variables
0	KNN	0.442961	41.320	44.220	39.412	43.342
1	LDA	0.442961	49.943	51.469	45.670	49.676
2	QDA	0.442961	42.350	40.557	45.784	45.059
3	Naive Bayes	0.442961	47.272	42.198	46.280	46.929
4	Logistic regression	0.442961	50.172	51.545	45.555	49.790
5	Decesion tree	0.442961	38.077	43.304	38.001	39.489
6	Random Forest	0.442961	49.790	48.798	45.326	49.447
7	AdaBoost	0.442961	50.362	51.698	46.814	50.019
8	XGBoost	0.442961	47.196	50.439	43.037	48.607
9	Polynomial kernel SVM	0.442961	50.515	48.760	44.601	48.264
10	Radial kernel SVM	0.442961	50.897	51.393	44.868	49.828
11	GBM	0.442961	50.820	52.079	47.310	49.752
12	LightGBM	0.442961	49.866	52.041	47.119	48.989

Except for the KNN and decision tree models, all models have higher accuracies than the baseline accuracy when use the each team’s goal and win percentage rolling features.
Overall, the performance of all models is not bad when using each team’s goal and win percentage rolling features.

3.5. Variable set 5: each team’s Elo rating

df_match_elo = df_match_elo.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")

Split the data into train and test set.

train_bool = df_match_elo.reset_index().match_api_id.isin(train_match_api_id)
test_bool = df_match_elo.reset_index().match_api_id.isin(test_match_api_id) 

df_elo_train = df_match_elo[train_bool].set_index("match_api_id")
df_elo_test = df_match_elo[test_bool].set_index("match_api_id")

X_elo_train = df_elo_train.drop("match_result", axis = 1)
y_elo_train = df_elo_train.match_result 

X_elo_test = df_elo_test.drop("match_result", axis = 1)
y_elo_test = df_elo_test.match_result

print("Number of train data: ", X_rolling_train.shape[0])
print("Number of test data: ", X_rolling_test.shape[0])

Number of train data:  16988
Number of test data:  2621

Preprocess the data before modeling.

# Transform the match_result class to numerical labels.

y_elo_train_encd = le.transform(y_elo_train)
y_elo_test_encd = le.transform(y_elo_test)

# fill missing values with 0

X_elo_train.fillna(0, inplace = True)
X_elo_test.fillna(0, inplace = True)

# Standardize features

col_names = X_elo_train.columns

scaler = StandardScaler()
scaler.fit(X_elo_train)

X_elo_train_std = pd.DataFrame(scaler.transform(X_elo_train), columns = col_names)
X_elo_test_std = pd.DataFrame(scaler.transform(X_elo_test), columns = col_names)

for name, clf in zip(names, classifiers):
    clf.fit(X_elo_train_std, y_elo_train_encd)
    
    y_pred = clf.predict(X_elo_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_elo_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Team's Elo rating related Variables"] = round(accuracy * 100, 3)

result_accuracy

	model_name	Baseline accuracy	Player PC Variables	Betting Statistics Variables	Team attribute Variables	Team's goal and win percentage rolling Variables	Team's Elo rating related Variables
0	KNN	0.442961	41.320	44.220	39.412	43.342	40.710
1	LDA	0.442961	49.943	51.469	45.670	49.676	50.630
2	QDA	0.442961	42.350	40.557	45.784	45.059	38.878
3	Naive Bayes	0.442961	47.272	42.198	46.280	46.929	48.607
4	Logistic regression	0.442961	50.172	51.545	45.555	49.790	50.630
5	Decesion tree	0.442961	38.077	43.304	38.001	39.489	38.573
6	Random Forest	0.442961	49.790	48.798	45.326	49.447	49.142
7	AdaBoost	0.442961	50.362	51.698	46.814	50.019	51.011
8	XGBoost	0.442961	47.196	50.439	43.037	48.607	48.150
9	Polynomial kernel SVM	0.442961	50.515	48.760	44.601	48.264	48.874
10	Radial kernel SVM	0.442961	50.897	51.393	44.868	49.828	50.439
11	GBM	0.442961	50.820	52.079	47.310	49.752	50.591
12	LightGBM	0.442961	49.866	52.041	47.119	48.989	49.561

Except for the KNN, QDA, and decision tree models, all models have higher accuracies than the baseline accuracy when use the each team’s Elo rating related features.

3.6. Use all variables

Merge all feature tables.

df_all = df_match_player_attr_pcs.merge(df_match_betting_stat.drop("match_result", axis = 1), how = "left", on = ["match_api_id"]) \
                                 .merge(df_match_team_num_attr.drop("match_result", axis = 1), how = "left", on = ["match_api_id"]) \
                                 .merge(df_team_win_goal_rolling_features.drop("match_result", axis = 1), how = "left", on = ["match_api_id"])  \
                                 .merge(df_match_elo.drop("match_result", axis = 1), how = "left", on = ["match_api_id"])
                                 

Split the data into train and test set.

train_bool = df_all.match_api_id.isin(train_match_api_id)
test_bool = df_all.match_api_id.isin(test_match_api_id) 

df_all_train = df_all[train_bool].set_index("match_api_id")
df_all_test = df_all[test_bool].set_index("match_api_id")

X_all_train = df_all_train.drop("match_result", axis = 1)
y_all_train = df_all_train.match_result 

X_all_test = df_all_test.drop("match_result", axis = 1)
y_all_test = df_all_test.match_result

print("Number of train data: ", X_all_train.shape[0])
print("Number of test data: ", X_all_test.shape[0])

Number of train data:  16988
Number of test data:  2621

Preprocess the data before modeling.

# Transform the match_result class to numerical labels.

y_all_train_encd = le.transform(y_all_train)
y_all_test_encd = le.transform(y_all_test)

# fill missing values with 0

X_all_train.fillna(0, inplace = True)
X_all_test.fillna(0, inplace = True)

# Standardize features

col_names = X_all_train.columns

scaler = StandardScaler()
scaler.fit(X_all_train)

X_all_train_std = pd.DataFrame(scaler.transform(X_all_train), columns = col_names)
X_all_test_std = pd.DataFrame(scaler.transform(X_all_test), columns = col_names)

Save the tables.

df_all.to_csv("../data/df_all.csv", index = False)

train_match_api_id.to_csv("../data/train_match_api_id.csv", index = False)
test_match_api_id.to_csv("../data/test_match_api_id.csv", index = False)

X_all_train.to_csv("../data/X_all_train.csv", index = False)
X_all_test.to_csv("../data/X_all_train.csv", index = False)

X_all_train_std.to_csv("../data/X_all_train_std.csv", index = False)
X_all_test_std.to_csv("../data/X_all_test_std.csv", index = False)

y_all_train.to_csv("../data/y_all_train.csv", index = False)
y_all_test.to_csv("../data/y_all_test.csv", index = False)

for name, clf in zip(names, classifiers):
    clf.fit(X_all_train_std, y_all_train_encd)
    
    y_pred = clf.predict(X_all_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_all_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "All Variables"] = round(accuracy * 100, 3)

result_accuracy

	model_name	Baseline accuracy	Player PC Variables	Betting Statistics Variables	Team attribute Variables	Team's goal and win percentage rolling Variables	Team's Elo rating related Variables	All Variables
0	KNN	0.442961	41.320	44.220	39.412	43.342	40.710	43.686
1	LDA	0.442961	49.943	51.469	45.670	49.676	50.630	50.706
2	QDA	0.442961	42.350	40.557	45.784	45.059	38.878	46.051
3	Naive Bayes	0.442961	47.272	42.198	46.280	46.929	48.607	45.937
4	Logistic regression	0.442961	50.172	51.545	45.555	49.790	50.630	51.316
5	Decesion tree	0.442961	38.077	43.304	38.001	39.489	38.573	41.892
6	Random Forest	0.442961	49.790	48.798	45.326	49.447	49.142	52.003
7	AdaBoost	0.442961	50.362	51.698	46.814	50.019	51.011	51.278
8	XGBoost	0.442961	47.196	50.439	43.037	48.607	48.150	49.447
9	Polynomial kernel SVM	0.442961	50.515	48.760	44.601	48.264	48.874	48.913
10	Radial kernel SVM	0.442961	50.897	51.393	44.868	49.828	50.439	51.240
11	GBM	0.442961	50.820	52.079	47.310	49.752	50.591	51.736
12	LightGBM	0.442961	49.866	52.041	47.119	48.989	49.561	51.164

When all variables were used, the accuracy of random forest is the highest at 52.003
So, let’s tune the hyperparameters of the random forest.
Also, among the models with accuracy greater than 50, since the LightGBM is faster to tune than other models, let’s tune the LightGBM as well.
Before tune the hyperparameters, let’s check the confusion matrix of the random forest and the LightGBM.

Default Random Forest confusion matrix

rf_default = RandomForestClassifier()
rf_default.fit(X_all_train_std, y_all_train_encd)
rf_default_pred = rf_default.predict(X_all_test_std)

le.inverse_transform(y_all_test_encd)

array(['home_win', 'home_win', 'home_win', ..., 'home_win', 'draw',
       'home_win'], dtype=object)

rf_default_cm = confusion_matrix(le.inverse_transform(y_all_test_encd), 
                                 le.inverse_transform(rf_default_pred))
cm_display = ConfusionMatrixDisplay(confusion_matrix = rf_default_cm, 
                                    display_labels = le.inverse_transform(rf_default.classes_))
cm_display.plot();    

png

print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(rf_default_pred)))

              precision    recall  f1-score   support

    away_win       0.49      0.49      0.49       801
        draw       0.31      0.05      0.09       659
    home_win       0.53      0.77      0.63      1161

    accuracy                           0.51      2621
   macro avg       0.44      0.44      0.40      2621
weighted avg       0.46      0.51      0.45      2621

Default LightGBM confusion matrix

lgbm_default = lgb.LGBMClassifier()
lgbm_default.fit(X_all_train_std, y_all_train_encd)
lgbm_default_pred = lgbm_default.predict(X_all_test_std)

lgbm_default_cm = confusion_matrix(le.inverse_transform(y_all_test_encd), 
                                   le.inverse_transform(lgbm_default_pred))
cm_display = ConfusionMatrixDisplay(confusion_matrix = lgbm_default_cm, 
                                    display_labels = le.inverse_transform(lgbm_default.classes_))
cm_display.plot();    

png

print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(lgbm_default_pred)))

              precision    recall  f1-score   support

    away_win       0.49      0.48      0.49       801
        draw       0.30      0.07      0.11       659
    home_win       0.54      0.79      0.64      1161

    accuracy                           0.51      2621
   macro avg       0.44      0.44      0.41      2621
weighted avg       0.46      0.51      0.46      2621

4. Hyperparameter tuning

4.1. Random forest

Candidate hyperparameters are as follow:
- n_estimators: 100, 300, 500, 1000
- learning_rate: 1e-8 ~ 1
- max_depth: 3 ~ 20
- max_features: auto, sqrt, log2
- min_samples_leaf: 1 ~ 10
- min_samples_split: 2 ~ 10

def rf_objective(trial, X, y):
    param_grid = {
        "n_estimators": trial.suggest_categorical("n_estimators", [100, 300, 500, 1000]),
        "max_depth": trial.suggest_int("max_depth", 3, 20, step = 2),
        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2"]),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 10),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
    }
    
    cv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
    cv_scores = np.empty(5)
    
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        model = RandomForestClassifier(**param_grid, n_jobs = -1)
        model.fit(X_train, y_train)
          
        pred = model.predict(X_test)
        cv_scores[idx] = np.mean(pred == y_test)

    return np.mean(cv_scores)

rf_study = optuna.create_study(direction = "maximize", study_name = "RandomForest Classifier")
func = lambda trial: rf_objective(trial, X_all_train_std, y_all_train_encd)
rf_study.optimize(func, n_trials = 20)

Best parameters are as follow:

rf_study.best_params

{'n_estimators': 300,
 'max_depth': 5,
 'max_features': 'sqrt',
 'min_samples_leaf': 7,
 'min_samples_split': 4}

Let’s check the test accuracy with the best hyperparameters set.

rf_best = RandomForestClassifier(**rf_study.best_params)
rf_best.fit(X_all_train_std, y_all_train_encd)
rf_best_pred = rf_best.predict(X_all_test_std)
rf_best_accuracy = np.mean(rf_best_pred == y_all_test_encd)

print("Accuracy before tuning the hyperparameters: ", result_accuracy[result_accuracy.model_name == "Random Forest"]["All Variables"].values[0])
print("Accuracy after tuning the hyperparameters: ", rf_best_accuracy * 100)

Accuracy before tuning the hyperparameters:  52.003
Accuracy after tuning the hyperparameters:  52.04120564669973

Let’s check the confusion matrix of the tuned random forest model.

fig, axes = plt.subplots(1, 2, figsize = (15, 5))

# confusion matrix for the random forest with default hyperparameters

rf_default_display = ConfusionMatrixDisplay(confusion_matrix = rf_default_cm, 
                                            display_labels = le.inverse_transform(rf_default.classes_))

# confusion matrix for the random forest with the best hyperparameters

rf_tuned_cm = confusion_matrix(le.inverse_transform(y_all_test_encd), 
                               le.inverse_transform(rf_best_pred))
rf_best_display = ConfusionMatrixDisplay(confusion_matrix = rf_tuned_cm, 
                                    display_labels = le.inverse_transform(rf_best.classes_))

rf_default_display.plot(ax = axes[0])
axes[0].set_title("Random Forest before tuning", fontsize = 15)

rf_best_display.plot(ax = axes[1])
axes[1].set_title("Random Forest after tuning", fontsize = 15)

plt.tight_layout()

png

print("< Random Forest before tuning >")
print("")
print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(rf_default_pred)))

print("")
print("< Random Forest after tuning >")
print("")
print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(rf_best_pred)))

< Random Forest before tuning >

              precision    recall  f1-score   support

    away_win       0.49      0.49      0.49       801
        draw       0.31      0.05      0.09       659
    home_win       0.53      0.77      0.63      1161

    accuracy                           0.51      2621
   macro avg       0.44      0.44      0.40      2621
weighted avg       0.46      0.51      0.45      2621


< Random Forest after tuning >

              precision    recall  f1-score   support

    away_win       0.50      0.50      0.50       801
        draw       0.00      0.00      0.00       659
    home_win       0.53      0.83      0.65      1161

    accuracy                           0.52      2621
   macro avg       0.34      0.44      0.38      2621
weighted avg       0.39      0.52      0.44      2621

Results for away_win and home_win have improved, but the model is struggling to predict the draw.

4.2. LightGBM

Candidate hyperparameters are as follow:
- learning_rate: 0.01 ~ 0.3
- num_leaves: 20 ~ 3000 with step = 20
- max_depth: 3 ~ 12
- min_data_in_leaf: 200 ~ 10000 with step = 100
- max_bing: 200 ~ 300
- lambda_l1: 0 ~ 100 with step = 5
- lambda_l2: 0 ~ 100 with step = 5
- min_gain_to_split: 0 ~ 15
- bagging_fraction: 0.2 ~ 0.95 with step = 0.1
- feature_fraction: 0.2 ~ 0.95 with step = 0.1

def lgbm_objective(trial, X, y):
    param_grid = {
        "n_estimators": trial.suggest_categorical("n_estimators", [10000]),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step = 20),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 200, 10000, step = 100),
        "max_bin": trial.suggest_int("max_bin", 200, 300),
        "lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step = 5),
        "lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step = 5),
        "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
        "bagging_fraction": trial.suggest_float(
            "bagging_fraction", 0.2, 0.95, step = 0.1
        ),
        "bagging_freq": trial.suggest_categorical("bagging_freq", [1]),
        "feature_fraction": trial.suggest_float(
            "feature_fraction", 0.2, 0.95, step = 0.1
        ),
        "silent": 1,
        "verbose": -1
    }
    
    cv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
    cv_scores = np.empty(5)
    
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        model = lgb.LGBMClassifier(objective = "multiclass", num_class = 3, **param_grid, n_jobs = -1)
        model.fit(
            X_train,
            y_train,
            eval_set=[(X_test, y_test)],
            #eval_metric = accuracy_score,
            early_stopping_rounds = 100,
            # callbacks=[
            #     LightGBMPruningCallback(trial, accuracy_score)
            # ],  # Add a pruning callback
            verbose = -1
        )
        preds = model.predict(X_test)
        accuracy = np.mean(y_test == preds)
        cv_scores[idx] = accuracy

    return np.mean(cv_scores)

lgbm_study = optuna.create_study(direction = "maximize", study_name = "LightGBM Classifier")
func = lambda trial: lgbm_objective(trial, X_all_train_std, y_all_train_encd)
lgbm_study.optimize(func, n_trials = 100)

Best parameters are as follow:

lgbm_study.best_params

{'n_estimators': 10000,
 'learning_rate': 0.29341244351241397,
 'num_leaves': 1560,
 'max_depth': 12,
 'min_data_in_leaf': 1800,
 'max_bin': 205,
 'lambda_l1': 20,
 'lambda_l2': 0,
 'min_gain_to_split': 12.558014144849205,
 'bagging_fraction': 0.8,
 'bagging_freq': 1,
 'feature_fraction': 0.30000000000000004}

lgb_best = lgb.LGBMClassifier(**lgbm_study.best_params, n_jobs = -1)
lgb_best.fit(X_all_train_std, y_all_train_encd)
lgb_best_pred = lgb_best.predict(X_all_test_std)
lgbm_best_accuracy = np.mean(lgb_best_pred == y_all_test_encd)

print("Accuracy before tuning the hyperparameters: ", result_accuracy[result_accuracy.model_name == "LightGBM"]["All Variables"].values[0])
print("Accuracy after tuning the hyperparameters: ", lgbm_best_accuracy * 100)

Accuracy before tuning the hyperparameters:  51.164
Accuracy after tuning the hyperparameters:  52.003052270125906

fig, axes = plt.subplots(1, 2, figsize = (15, 5))

# confusion matrix for the lgbm with default hyperparameters

lgbm_default_display = ConfusionMatrixDisplay(confusion_matrix = lgbm_default_cm, 
                                              display_labels = le.inverse_transform(lgbm_default.classes_))

# confusion matrix for the lgbm with the best hyperparameters

lgbm_tuned_cm = confusion_matrix(le.inverse_transform(y_all_test_encd), 
                                 le.inverse_transform(lgb_best_pred))
lgbm_best_display = ConfusionMatrixDisplay(confusion_matrix = lgbm_tuned_cm, 
                                           display_labels = le.inverse_transform(lgb_best.classes_))

lgbm_default_display.plot(ax = axes[0])
axes[0].set_title("LightGBM before tuning", fontsize = 15)

lgbm_best_display.plot(ax = axes[1])
axes[1].set_title("LightGBM after tuning", fontsize = 15)

plt.tight_layout()

png

print("< LightGBM before tuning >")
print("")
print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(lgbm_default_pred)))

print("")
print("< LightGBM after tuning >")
print("")
print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(lgb_best_pred)))

< LightGBM before tuning >

              precision    recall  f1-score   support

    away_win       0.49      0.48      0.49       801
        draw       0.30      0.07      0.11       659
    home_win       0.54      0.79      0.64      1161

    accuracy                           0.51      2621
   macro avg       0.44      0.44      0.41      2621
weighted avg       0.46      0.51      0.46      2621


< LightGBM after tuning >

              precision    recall  f1-score   support

    away_win       0.49      0.51      0.50       801
        draw       0.00      0.00      0.00       659
    home_win       0.53      0.82      0.65      1161

    accuracy                           0.52      2621
   macro avg       0.34      0.44      0.38      2621
weighted avg       0.39      0.52      0.44      2621

Results for away_win and home_win have improved, but the LightGBM is also struggling to predict the draw.

5. Feature importance

Let’s check the feature importance from the tuned random forest model based on feature permutation.

rf_best_params = {
    'n_estimators': 300,
    'max_depth': 5,
    'max_features': 'sqrt',
    'min_samples_leaf': 7,
    'min_samples_split': 4
}

rf_best = RandomForestClassifier(**rf_best_params)
rf_best.fit(X_all_train_std, y_all_train_encd)
rf_best_pred = rf_best.predict(X_all_test_std)
rf_best_accuracy = np.mean(rf_best_pred == y_all_test_encd)

result = permutation_importance(
    rf_best, X_all_test_std, y_all_test_encd, n_repeats=10, random_state=42, n_jobs=-1
)

rf_feature_imp_permutation =  pd.DataFrame(sorted(zip(result.importances_mean, col_names)), columns=['Value','Feature'])

plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data = rf_feature_imp_permutation.sort_values("Value", ascending = False).head(50))
plt.title('Random Forest Features')
plt.tight_layout()
plt.show()

png

Above plot shows top 50 most important features for predicting the match results.
Let’s compare the distribution of the feature importance between different variable sets.
- variable set 1: player attributes PC features
- Variable set 2: betting information features
- variable set 3: team attributes features
- variable set 4: goal and win percentage rolling features
- variable set 5: each team’s Elo rating

player_attr_pc_vars = df_match_player_attr_pcs.columns
bet_stat_vars = df_match_betting_stat.columns
team_attr_vars = df_match_team_num_attr.columns
team_rolling_vars = df_team_win_goal_rolling_features.columns
elo_vars = df_match_elo.columns

rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(player_attr_pc_vars), "feature_set"] = f"Player attribute PC variables (#: {len(player_attr_pc_vars) - 1})"
rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(bet_stat_vars), "feature_set"] = f"Betting odds statistics variables (#: {len(bet_stat_vars) - 1})"
rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(team_attr_vars), "feature_set"] = f"Team attribute variables (#: {len(team_attr_vars) - 1})"
rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(team_rolling_vars), "feature_set"] = f"Team's recent average goal and win percentage variables (#: {len(team_rolling_vars) - 1})"
rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(elo_vars), "feature_set"] = f"Team's recent Elo variables (#: {len(elo_vars) - 1})"

plt.figure(figsize = (12, 5))
sns.boxplot(data = rf_feature_imp_permutation, x = "Value", y = "feature_set")
plt.xlabel("Feature importance", fontsize = 12)
plt.ylabel("Feature sets", fontsize = 12)
plt.title("Feature importance distribution from different feature sets", fontsize = 15)

Text(0.5, 1.0, 'Feature importance distribution from different feature sets')

png

The betting odds statistics variables shows the highest importance among different feature sets.
Team attribute variables have the lowest feature importance.
The remaining three variable sets show similar importance.

5.1. Betting odds statistics variables

Betting odds statistics can be subdivided into:
- home win, away win, and draw
- mean, max, min, std

betting_importance = rf_feature_imp_permutation[rf_feature_imp_permutation.feature_set == "Betting odds statistics variables (#: 13)"]
betting_importance["home_away"] = betting_importance.Feature.str.split("_").str[0]
betting_importance["statistics"] = betting_importance.Feature.str.split("_").str[2]

betting_importance.sort_values("Value", ascending = False)

	Value	Feature	feature_set	home_away	statistics
234	0.004159	H_odd_mean	Betting odds statistics variables (#: 13)	H	mean
233	0.003892	A_odd_mean	Betting odds statistics variables (#: 13)	A	mean
232	0.003853	A_odd_max	Betting odds statistics variables (#: 13)	A	max
231	0.003014	H_odd_max	Betting odds statistics variables (#: 13)	H	max
230	0.002251	H_odd_min	Betting odds statistics variables (#: 13)	H	min
229	0.001831	H_odd_std	Betting odds statistics variables (#: 13)	H	std
228	0.001831	A_odd_min	Betting odds statistics variables (#: 13)	A	min
227	0.001145	A_odd_std	Betting odds statistics variables (#: 13)	A	std
226	0.001106	D_odd_std	Betting odds statistics variables (#: 13)	D	std
224	0.000954	D_odd_max	Betting odds statistics variables (#: 13)	D	max
221	0.000801	D_odd_min	Betting odds statistics variables (#: 13)	D	min
204	0.000496	D_odd_mean	Betting odds statistics variables (#: 13)	D	mean

The importance of variables for home and away were high, and the importance of variables for draw were low.
The importance of variables related to mean and max were high.

5.2. Team’s recent Elo variables

Elo rating related variables can be subdivided into:
- home team, away team
- average, std
- recent 1, 3, 5, 10, 20, 30, 60, 90 matches

elo_importance = rf_feature_imp_permutation[rf_feature_imp_permutation.feature_set == "Team's recent Elo variables (#: 34)"]

elo_importance["home_away"] = elo_importance.Feature.str.split("_").str[0]
elo_importance["statistics"] = elo_importance.Feature.str.split("_").str[2]
elo_importance["matches"] = elo_importance.Feature.str.split("_").str[5]

elo_importance.groupby("home_away").Value.mean().reset_index()

	home_away	Value
0	away	0.000057
1	elo	0.000458
2	home	0.000210

Youngjun Woo

1. Train test split

2. Baseline accuracy

3. Modeling with all variable sets

3.1. Variable set 1: Player attributes PC features

3.2. Variable set 2: Betting information features

3.3 Variable set 3: Team attribute features

3.4. Variable set 4: Goal and win percentage rolling features

3.5. Variable set 5: each team’s Elo rating

3.6. Use all variables

Default Random Forest confusion matrix

Default LightGBM confusion matrix

4. Hyperparameter tuning

4.1. Random forest

4.2. LightGBM

5. Feature importance

5.1. Betting odds statistics variables

5.2. Team’s recent Elo variables