3 - 1) Train machine learning models to predict match results (In progress)

53 minute read

import pandas as pd 
import numpy as np 

import matplotlib.pyplot as plt 
import matplotlib.patches as mpatches
import seaborn as sns 
import missingno as msno 

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis 
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import tree
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb

import optuna
from optuna.integration import LightGBMPruningCallback
from optuna.integration import XGBoostPruningCallback
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
from sklearn.metrics import classification_report

from sklearn.inspection import permutation_importance

import imblearn
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
  • We have 5 kinds of variable sets
    • variable set 1: player attributes PC features
    • Variable set 2: betting information features
    • variable set 3: team attributes features
    • variable set 4: goal and win percentage rolling features
    • variable set 5: each team’s Elo rating
df_match_basic = pd.read_csv("../data/df_match_basic.csv")
df_match_player_attr_pcs = pd.read_csv("../data/df_match_player_attr_pcs.csv")
df_match_betting_stat = pd.read_csv("../data/df_match_betting_stat.csv")
df_match_team_num_attr = pd.read_csv("../data/df_match_team_num_attr.csv")
df_team_win_goal_rolling_features = pd.read_csv("../data/df_team_win_goal_rolling_features.csv")
df_match_elo = pd.read_csv("../data/df_match_elo.csv")
  • First, let’s predict the match result and compare the result by using each variable sets.

1. Train test split

  • Set last season as test set, other seasons as train set.
target_bool = (df_match_basic.match_api_id.isin(df_match_player_attr_pcs.match_api_id)) & \
              (df_match_basic.match_api_id.isin(df_match_betting_stat.match_api_id)) & \
              (df_match_basic.match_api_id.isin(df_match_team_num_attr.match_api_id)) & \
              (df_match_basic.match_api_id.isin(df_team_win_goal_rolling_features.match_api_id)) & \
              (df_match_basic.match_api_id.isin(df_match_elo.match_api_id))
    
target_matches = df_match_basic[target_bool]
test_match_api_id = target_matches[target_matches.season == "2015/2016"].match_api_id
train_match_api_id = target_matches[target_matches.season != "2015/2016"].match_api_id
print(len(train_match_api_id), len(test_match_api_id))
16988 2621
  • There are 16,988 train set and 2,621 test set.

2. Baseline accuracy

df_match_basic[df_match_basic.match_api_id.isin(train_match_api_id)].match_result.value_counts()
home_win    7840
away_win    4855
draw        4293
Name: match_result, dtype: int64
sns.countplot(x = df_match_basic[df_match_basic.match_api_id.isin(train_match_api_id)].match_result)
<AxesSubplot:xlabel='match_result', ylabel='count'>

png

  • About 46% of all 16,988 matches were won by the home team.
  • That is, if we predict all matches as home team win, then we can achieve about 46% accuracy, that can be used as our baseline accuracy.

  • Let’s check the baseline accuracy in the test data set.
df_match_basic[df_match_basic.match_api_id.isin(test_match_api_id)].match_result.value_counts()
home_win    1161
away_win     801
draw         659
Name: match_result, dtype: int64
sns.countplot(x = df_match_basic[df_match_basic.match_api_id.isin(test_match_api_id)].match_result)
<AxesSubplot:xlabel='match_result', ylabel='count'>

png

  • Baseline accuracy in the test dataset is about 44% (1,161 / 2,621)

3. Modeling with all variable sets

3.1. Variable set 1: Player attributes PC features

df_match_player_attr_pcs = df_match_player_attr_pcs.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")
df_match_player_attr_pcs = df_match_player_attr_pcs.set_index("match_api_id")
df_match_player_attr_pcs

home_player_1_pc_1 home_player_1_pc_2 home_player_1_pc_3 home_player_1_pc_4 home_player_1_pc_5 home_player_2_pc_1 home_player_2_pc_2 home_player_2_pc_3 home_player_2_pc_4 home_player_2_pc_5 home_player_3_pc_1 home_player_3_pc_2 home_player_3_pc_3 home_player_3_pc_4 home_player_3_pc_5 home_player_4_pc_1 home_player_4_pc_2 home_player_4_pc_3 home_player_4_pc_4 home_player_4_pc_5 home_player_5_pc_1 home_player_5_pc_2 home_player_5_pc_3 home_player_5_pc_4 home_player_5_pc_5 home_player_6_pc_1 home_player_6_pc_2 home_player_6_pc_3 home_player_6_pc_4 home_player_6_pc_5 home_player_7_pc_1 home_player_7_pc_2 home_player_7_pc_3 home_player_7_pc_4 home_player_7_pc_5 home_player_8_pc_1 home_player_8_pc_2 home_player_8_pc_3 home_player_8_pc_4 home_player_8_pc_5 home_player_9_pc_1 home_player_9_pc_2 home_player_9_pc_3 home_player_9_pc_4 home_player_9_pc_5 home_player_10_pc_1 home_player_10_pc_2 home_player_10_pc_3 home_player_10_pc_4 home_player_10_pc_5 home_player_11_pc_1 home_player_11_pc_2 home_player_11_pc_3 home_player_11_pc_4 home_player_11_pc_5 away_player_1_pc_1 away_player_1_pc_2 away_player_1_pc_3 away_player_1_pc_4 away_player_1_pc_5 away_player_2_pc_1 away_player_2_pc_2 away_player_2_pc_3 away_player_2_pc_4 away_player_2_pc_5 away_player_3_pc_1 away_player_3_pc_2 away_player_3_pc_3 away_player_3_pc_4 away_player_3_pc_5 away_player_4_pc_1 away_player_4_pc_2 away_player_4_pc_3 away_player_4_pc_4 away_player_4_pc_5 away_player_5_pc_1 away_player_5_pc_2 away_player_5_pc_3 away_player_5_pc_4 away_player_5_pc_5 away_player_6_pc_1 away_player_6_pc_2 away_player_6_pc_3 away_player_6_pc_4 away_player_6_pc_5 away_player_7_pc_1 away_player_7_pc_2 away_player_7_pc_3 away_player_7_pc_4 away_player_7_pc_5 away_player_8_pc_1 away_player_8_pc_2 away_player_8_pc_3 away_player_8_pc_4 away_player_8_pc_5 away_player_9_pc_1 away_player_9_pc_2 away_player_9_pc_3 away_player_9_pc_4 away_player_9_pc_5 away_player_10_pc_1 away_player_10_pc_2 away_player_10_pc_3 away_player_10_pc_4 away_player_10_pc_5 away_player_11_pc_1 away_player_11_pc_2 away_player_11_pc_3 away_player_11_pc_4 away_player_11_pc_5 match_result
match_api_id
493017 9.172915 -0.705596 1.028500 -0.044401 1.246272 3.957784 1.650964 -2.348632 -0.837480 0.223750 -0.817702 -0.589548 0.195433 -2.144582 1.524434 3.108730 0.633708 -1.772338 0.807727 1.684577 0.615229 -0.611994 -0.845425 0.395221 3.107710 -0.038702 -0.551509 -0.280096 -0.068082 2.645878 1.086047 -2.583583 -0.596348 -1.915882 -0.987251 -0.845848 -0.053597 0.746034 -0.515174 1.763346 4.244623 1.088030 -2.054898 -0.414998 0.270225 -0.559441 1.233335 0.426201 0.609910 1.364324 1.472274 -0.298277 -1.530989 0.743621 -0.875547 9.794795 -0.549117 1.941560 0.281992 0.521328 -1.886782 1.005850 1.291888 0.172012 1.212622 1.320201 1.065244 -0.406996 -1.803778 0.573483 2.628391 0.940615 -0.609534 -1.586848 0.652795 3.207876 0.685128 -1.480756 -1.499414 0.134771 -1.360311 -2.979972 1.181522 -0.726996 1.161218 -1.735000 -0.323721 0.977427 -1.047350 1.307837 -0.004187 1.646099 0.543917 0.101722 0.374692 -1.836650 -2.551881 1.212316 0.139711 1.155972 -0.916797 -1.202214 0.022668 -0.118531 -0.332637 0.628660 -1.751083 -0.030693 0.418133 -0.206630 home_win
493025 6.467731 -2.125163 3.092089 -0.930974 1.172527 0.390653 1.341612 0.109198 -0.418330 0.921304 2.673401 1.688787 -0.346764 0.032250 -0.148238 0.544724 0.856618 -0.234054 -0.185530 1.496783 0.349558 -0.322471 0.311433 -0.720916 0.765089 -2.997770 -1.279240 1.492689 -0.174247 1.728035 -1.262416 0.290702 1.097804 0.377514 1.681666 -1.035827 -0.034337 1.133088 0.899750 1.451945 -1.970162 -1.958911 1.072887 0.806278 2.735079 -1.206227 -2.567254 0.401355 -0.240084 1.552955 -0.417405 -0.845298 0.077456 -1.026087 -0.249720 6.746068 -1.225452 5.441467 -1.741143 0.153029 -0.720625 1.362837 0.950063 -0.430052 1.920381 1.321520 2.234319 0.546943 0.392576 1.081937 -1.157135 1.181325 0.901247 0.722505 0.967595 -1.573349 0.366277 1.085719 -0.362088 0.794429 -2.362155 0.960996 2.341796 0.480765 1.235871 -2.036662 1.423269 1.975662 0.456836 0.564672 -1.002934 -2.335553 0.470491 -0.111478 0.585989 -1.407944 -2.205414 1.069296 0.053596 0.809813 -2.259696 -2.778135 1.816538 1.407815 0.022775 0.169010 -2.319129 0.361658 0.935315 -1.877390 away_win
493027 7.587977 -0.669761 3.351410 -1.725204 0.971832 -0.659142 2.447278 1.853202 -0.962117 0.277773 0.782575 2.059493 1.311728 -0.983126 0.652947 0.750721 3.078327 1.228950 0.540478 0.034199 0.361849 2.339354 1.689313 0.713260 0.455514 -3.759299 1.204148 3.191002 -0.784951 0.913294 -3.018176 0.545416 2.535436 -0.218828 2.658318 -2.589758 0.570906 1.654008 0.345163 0.580780 -3.859427 -1.870653 2.659663 -0.725774 2.171793 -0.199689 -2.415793 -0.317491 0.067180 0.943378 -1.056848 -1.205950 1.067483 1.246078 -0.070047 8.942737 -2.146976 1.394474 -1.573403 -0.305385 0.326027 1.666843 0.694338 0.105792 1.385551 1.716745 1.133649 -0.670094 -1.374712 0.468374 1.822806 1.126330 -0.789073 -1.478142 0.878564 5.021780 0.645976 -1.954515 -1.642732 0.230055 0.882744 1.703690 0.053693 0.563981 0.978446 -0.780143 -0.700812 0.612057 1.416142 2.059364 -0.691788 -2.016431 0.356132 -1.164456 2.251789 1.464444 0.379332 -0.902796 -0.837762 0.211743 1.923084 -2.893726 -2.042800 -1.018973 -1.292926 0.416905 -2.210474 -0.336256 0.054563 2.588626 home_win
493034 9.172915 -0.705596 1.028500 -0.044401 1.246272 3.957784 1.650964 -2.348632 -0.837480 0.223750 -0.817702 -0.589548 0.195433 -2.144582 1.524434 -0.559441 1.233335 0.426201 0.609910 1.364324 -0.845848 -0.053597 0.746034 -0.515174 1.763346 0.615229 -0.611994 -0.845425 0.395221 3.107710 3.108730 0.633708 -1.772338 0.807727 1.684577 1.086047 -2.583583 -0.596348 -1.915882 -0.987251 4.244623 1.088030 -2.054898 -0.414998 0.270225 -0.105750 0.211866 0.619067 0.766076 -0.979514 1.472274 -0.298277 -1.530989 0.743621 -0.875547 7.587977 -0.669761 3.351410 -1.725204 0.971832 -0.659142 2.447278 1.853202 -0.962117 0.277773 0.361849 2.339354 1.689313 0.713260 0.455514 1.052194 2.080899 0.370792 -0.782729 0.394531 2.077104 1.903504 0.684877 1.471026 -0.492695 -3.759299 1.204148 3.191002 -0.784951 0.913294 -0.631336 -1.917288 0.404582 -1.604514 -0.334234 -3.859427 -1.870653 2.659663 -0.725774 2.171793 -3.018176 0.545416 2.535436 -0.218828 2.658318 -2.589758 0.570906 1.654008 0.345163 0.580780 -1.056848 -1.205950 1.067483 1.246078 -0.070047 home_win
493040 8.942737 -2.146976 1.394474 -1.573403 -0.305385 0.326027 1.666843 0.694338 0.105792 1.385551 1.831429 1.436093 -0.753915 -1.703966 1.116009 1.716745 1.133649 -0.670094 -1.374712 0.468374 1.822806 1.126330 -0.789073 -1.478142 0.878564 0.882744 1.703690 0.053693 0.563981 0.978446 -0.780143 -0.700812 0.612057 1.416142 2.059364 -0.691788 -2.016431 0.356132 -1.164456 2.251789 0.416905 -2.210474 -0.336256 0.054563 2.588626 1.923084 -2.893726 -2.042800 -1.018973 -1.292926 1.464444 0.379332 -0.902796 -0.837762 0.211743 9.554376 -0.788436 2.277353 0.709577 0.731361 1.970444 1.864556 -0.597320 0.168243 0.246152 4.470705 2.449109 -1.396639 -0.315235 0.008914 1.829497 1.079852 -2.363541 -0.545278 0.775451 4.019384 0.341251 -2.730811 -1.093104 0.247451 0.682845 0.218515 -0.581395 0.951872 2.794600 -0.793495 -0.740710 0.492791 0.151706 2.489086 1.933672 0.891587 -1.109525 -0.792216 -0.402166 -0.203211 -1.625096 -0.009942 0.220891 1.091029 0.344323 -3.281820 -1.047589 -0.870665 0.278254 0.817798 -2.420838 -0.361229 0.689072 -0.850103 draw
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1992089 10.658392 -1.977648 0.867377 0.480266 1.074043 2.309500 0.931667 -3.860209 0.540601 -0.235396 4.829365 3.813902 -2.190143 2.896975 0.303158 4.363347 3.346856 -2.511394 1.256390 -1.264653 0.693648 2.017825 -1.753765 -2.152817 -0.662817 2.162406 -2.749882 -4.593596 0.434768 0.841622 -0.817444 1.999270 -0.449227 -0.166485 -0.336412 1.345606 0.069802 -3.170566 -0.790618 1.344050 -1.052756 0.131148 -1.406108 -0.920376 1.149814 -0.756354 -0.555587 -0.498005 0.934758 -2.606863 -2.571317 -2.788252 -0.783057 -0.228549 0.825315 11.297455 -1.943311 -0.622608 1.229927 2.987196 0.715893 1.452546 -2.111059 -0.524585 0.158115 5.742015 3.143138 -3.280495 0.901933 -1.055971 3.559549 2.023672 -2.954817 1.214646 0.395441 -0.652392 0.617508 -2.119288 0.420918 -0.312378 -1.179215 0.990786 -0.758331 0.280326 -1.019536 -0.983237 1.033960 -0.714609 -0.027433 0.645499 -0.468185 -2.492838 -2.135542 0.672037 1.175961 0.089223 -1.409898 -2.184536 0.860498 2.273182 -1.604118 -1.071515 -1.719817 -0.597994 0.268359 -2.027482 -0.929774 -0.429955 -0.408179 -0.159691 draw
1992091 12.241704 -2.064268 0.575912 0.027214 0.566339 2.941632 1.263739 -2.564271 -0.317355 0.251567 -0.338978 1.001774 -1.054876 0.023669 0.503527 3.322061 2.191727 -1.431455 -0.935647 -2.106780 0.805775 2.313243 -1.001739 -0.734564 -1.002397 0.167600 0.748910 -2.441020 -0.128875 1.354357 -0.992210 1.486864 -0.837892 -0.283414 0.885329 -0.144093 -1.113188 -1.698684 0.131062 0.388273 -1.074581 -2.141164 -1.372831 0.593159 -0.060973 -1.374814 -2.597658 -1.353337 -0.079045 0.449154 1.613120 -2.543332 -2.596794 1.483082 0.105411 12.290704 -2.430347 -1.144252 -0.197618 0.806754 4.561174 1.116819 -3.853639 -1.449020 -0.008049 0.859240 1.593152 -1.523836 0.182434 0.923206 3.366277 2.096648 -2.560049 1.302320 0.621418 -0.179065 1.542439 -0.883813 -1.534520 -0.929437 -1.274091 1.044816 -0.632610 0.214867 -1.155989 -1.165921 1.062657 -0.462190 -0.128853 0.815702 -0.551649 -3.139936 -1.871850 0.919162 1.101761 -0.895229 -1.357762 -1.203311 0.502165 1.217481 0.817539 -1.879420 -1.866788 1.257348 -2.040793 -2.027482 -0.929774 -0.429955 -0.408179 -0.159691 home_win
1992092 12.064639 -2.316174 0.433132 0.848287 2.231699 0.041330 1.067682 -1.528655 1.481193 0.463667 0.484107 2.132478 -2.712916 0.415998 -0.482643 2.368090 1.818320 -1.835697 2.213761 1.021893 2.542915 2.152431 -2.292237 -0.564741 0.029556 -0.661431 1.259634 -1.538893 0.327645 1.036215 -2.276135 -1.075133 -1.057017 0.562307 1.955697 1.429669 1.815169 -2.713671 -1.202667 -0.206876 -2.066585 -1.486135 -2.249610 0.528683 1.248011 2.305110 0.196755 -3.015813 0.884716 1.782990 0.202502 0.363303 -2.203897 0.890204 1.972601 11.605200 -2.180668 0.633318 0.505193 2.121379 0.936419 1.221556 -2.522024 -0.246410 0.123592 0.472920 1.911681 -1.404507 -1.552859 0.102178 2.614211 3.470441 -0.910743 4.107990 0.561911 -0.386817 2.148362 -0.789333 0.490890 0.348217 1.412310 -0.776693 -3.695553 -2.537028 0.654647 0.038726 0.194863 -1.450742 -1.480179 -0.329070 -1.248945 0.135936 -0.373578 -0.008843 1.145546 -1.268382 -2.736194 -2.134065 -2.336878 0.312531 -1.267814 -0.952598 -0.567959 1.417181 -1.031334 -2.777047 -2.958610 -1.015094 -0.426489 1.111898 away_win
1992093 12.055222 -2.425069 1.339212 0.274265 0.840987 0.807082 0.384775 -3.089207 -1.255036 0.706752 3.316945 2.584242 -1.555439 -0.624218 -1.494029 0.752059 2.634155 0.029098 -2.048228 -2.345172 -0.477442 0.588966 -1.810286 -0.671132 0.670969 -4.479330 0.957108 1.567778 1.821481 1.887503 -0.304292 2.229022 -0.374285 -0.629550 -0.550464 -2.488337 -2.492236 0.004280 -0.328167 -0.137174 -1.303862 -3.461748 -0.804760 -0.413108 -1.193188 -3.587590 -2.419330 0.831607 2.357133 0.281695 -2.476574 -2.795499 -0.080093 0.062647 -2.011499 12.360092 -1.868190 1.990823 0.546306 0.372803 -1.329084 1.395726 -0.722913 -0.893829 0.215497 1.165437 3.021942 -0.188859 2.311939 0.403286 -4.008986 1.662399 0.332054 0.291516 1.172819 -1.397060 1.400592 -0.140128 -0.156016 -0.683065 -1.079926 0.426619 -0.576034 -2.142184 -0.515036 -0.909491 1.934916 0.160027 0.916388 1.269362 -1.061561 -2.717686 -0.665092 -0.671323 -2.761000 -1.823349 -1.744156 -0.979761 0.332470 0.702492 -1.032879 -1.066695 -0.612696 -0.450766 0.290344 -1.642076 -3.316316 0.011961 0.375835 -3.465042 home_win
1992095 11.109830 -2.853322 3.703596 -0.584121 -0.402950 -0.371390 1.595755 -1.118035 -1.088917 -0.072962 0.782687 3.074394 -0.351447 0.339711 -0.876789 0.954087 3.884860 0.323121 -0.869362 -1.514651 -1.424525 1.026468 -0.666682 -0.521699 1.419345 -2.778054 -3.370352 -0.446712 -0.978029 -0.758080 2.408722 1.658858 -2.251337 -0.820091 -0.630519 -2.159654 1.814418 0.256719 0.567236 -0.314540 -2.998560 -3.922779 0.105072 0.225998 1.222075 -1.617306 -2.339904 -0.992463 -1.548240 0.519653 -2.910805 -2.167197 0.101099 1.144445 -0.165881 12.307758 -1.719545 4.261962 0.728169 1.500838 -1.795498 2.597198 0.399761 -0.004946 -0.912142 1.097353 2.822507 -0.720652 -1.295103 -1.490739 -1.267830 3.175089 1.368422 -0.396539 -2.110602 0.679151 1.514007 -0.744474 -2.598616 -1.680172 -2.472498 1.701903 0.506631 -1.382100 -0.412576 -3.875220 1.593144 1.643325 2.150146 2.170000 -1.395402 0.464636 -0.435683 -0.975001 -1.386601 -3.586664 -2.865959 0.395006 1.476579 2.034092 -4.892234 -0.601987 1.354383 0.830095 -0.426868 -3.654878 -2.094599 1.339425 0.285083 -2.245146 home_win

21374 rows × 111 columns

  • Split the table into train and test set.
train_bool = df_match_player_attr_pcs.reset_index().match_api_id.isin(train_match_api_id)
test_bool = df_match_player_attr_pcs.reset_index().match_api_id.isin(test_match_api_id) 
df_pc_train = df_match_player_attr_pcs.reset_index()[train_bool].set_index("match_api_id")
df_pc_test = df_match_player_attr_pcs.reset_index()[test_bool].set_index("match_api_id")
X_pc_train = df_pc_train.drop("match_result", axis = 1)
y_pc_train = df_pc_train.match_result 

X_pc_test = df_pc_test.drop("match_result", axis = 1)
y_pc_test = df_pc_test.match_result
print("Number of train data: ", X_pc_train.shape[0])
print("Number of test data: ", X_pc_test.shape[0])
Number of train data:  16988
Number of test data:  2621
  • Preprocess the data before modeling.
# Transform the match_result class to numerical labels.

le = preprocessing.LabelEncoder()
le.fit(y_pc_train)

y_pc_train_encd = le.transform(y_pc_train)
y_pc_test_encd = le.transform(y_pc_test)
names = ["KNN", 
         "LDA", 
         "QDA", 
         "Naive Bayes",
         "Logistic regression",
         "Decesion tree", 
         "Random Forest",  
         "AdaBoost",
         "XGBoost",
         "Polynomial kernel SVM",
         "Radial kernel SVM",
         "GBM",
         "LightGBM"
        ]

classifiers = [
    KNeighborsClassifier(3),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    GaussianNB(), 
    LogisticRegression(),
    DecisionTreeClassifier(random_state = 42),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    xgb.XGBClassifier(),
    SVC(kernel = "poly", probability = True),
    SVC(kernel = "rbf", probability = True),
    GradientBoostingClassifier(),
    lgb.LGBMClassifier()
    ]
result_accuracy = pd.DataFrame(names, columns = ["model_name"])
# baseline accuracy

y_pred_baseline = le.transform(["home_win"])
baseline_accuracy = np.mean(y_pred_baseline == y_pc_test_encd)
result_accuracy["Baseline accuracy"] = baseline_accuracy
y_pred_dict = {}

for name, clf in zip(names, classifiers):
    clf.fit(X_pc_train, y_pc_train_encd)
    
    y_pred = clf.predict(X_pc_test)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_pc_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Player PC Variables"] = round(accuracy * 100, 3)
result_accuracy
model_name Baseline accuracy Player PC Variables
0 KNN 0.442961 41.320
1 LDA 0.442961 49.943
2 QDA 0.442961 42.350
3 Naive Bayes 0.442961 47.272
4 Logistic regression 0.442961 50.172
5 Decesion tree 0.442961 38.077
6 Random Forest 0.442961 49.790
7 AdaBoost 0.442961 50.362
8 XGBoost 0.442961 47.196
9 Polynomial kernel SVM 0.442961 50.515
10 Radial kernel SVM 0.442961 50.897
11 GBM 0.442961 50.820
12 LightGBM 0.442961 49.866
  • Except for the KNN, QDA, and Decision tree models, all models have higher accuracies than the baseline accuracy when use the player PC information.

3.2. Variable set 2: Betting information features

df_match_betting_stat = df_match_betting_stat.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")
  • Split the data into train and test set.
train_bool = df_match_betting_stat.match_api_id.isin(train_match_api_id)
test_bool = df_match_betting_stat.match_api_id.isin(test_match_api_id) 
df_bet_train = df_match_betting_stat[train_bool].set_index("match_api_id")
df_bet_test = df_match_betting_stat[test_bool].set_index("match_api_id")
X_bet_train = df_bet_train.drop("match_result", axis = 1)
y_bet_train = df_bet_train.match_result 

X_bet_test = df_bet_test.drop("match_result", axis = 1)
y_bet_test = df_bet_test.match_result
print("Number of train data: ", X_bet_train.shape[0])
print("Number of test data: ", X_bet_test.shape[0])
Number of train data:  16988
Number of test data:  2621
  • Preprocess variables before modeling.
# Transform the match_result class to numerical labels.
y_bet_train_encd = le.transform(y_bet_train)
y_bet_test_encd = le.transform(y_bet_test)
# Standardize features
col_names = X_bet_train.columns

scaler = StandardScaler()
scaler.fit(X_bet_train)

X_bet_train_std = pd.DataFrame(scaler.transform(X_bet_train), columns = col_names)
X_bet_test_std = pd.DataFrame(scaler.transform(X_bet_test), columns = col_names)

for name, clf in zip(names, classifiers):
    clf.fit(X_bet_train_std, y_bet_train_encd)
    
    y_pred = clf.predict(X_bet_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_bet_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Betting Statistics Variables"] = round(accuracy * 100, 3)
result_accuracy
model_name Baseline accuracy Player PC Variables Betting Statistics Variables
0 KNN 0.442961 41.320 44.220
1 LDA 0.442961 49.943 51.469
2 QDA 0.442961 42.350 40.557
3 Naive Bayes 0.442961 47.272 42.198
4 Logistic regression 0.442961 50.172 51.545
5 Decesion tree 0.442961 38.077 43.304
6 Random Forest 0.442961 49.790 48.798
7 AdaBoost 0.442961 50.362 51.698
8 XGBoost 0.442961 47.196 50.439
9 Polynomial kernel SVM 0.442961 50.515 48.760
10 Radial kernel SVM 0.442961 50.897 51.393
11 GBM 0.442961 50.820 52.079
12 LightGBM 0.442961 49.866 52.041
  • Except for the KNN, QDA, Naive Bayes, and decision tree models, all models have higher accuracies than the baseline accuracy when use the betting statistics information.
  • Overall, accuracies are higher when using betting information than when using pc information.

3.3 Variable set 3: Team attribute features

df_match_team_num_attr = df_match_team_num_attr.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")
  • Split the data into train and test set.
train_bool = df_match_team_num_attr.match_api_id.isin(train_match_api_id)
test_bool = df_match_team_num_attr.match_api_id.isin(test_match_api_id) 
df_team_train = df_match_team_num_attr[train_bool].set_index("match_api_id")
df_team_test = df_match_team_num_attr[test_bool].set_index("match_api_id")
X_team_train = df_team_train.drop("match_result", axis = 1)
y_team_train = df_team_train.match_result 

X_team_test = df_team_test.drop("match_result", axis = 1)
y_team_test = df_team_test.match_result
print("Number of train data: ", X_team_train.shape[0])
print("Number of test data: ", X_team_test.shape[0])
Number of train data:  16988
Number of test data:  2621
  • Preprocess the data before modeling.
# Transform the match_result class to numerical labels.
y_team_train_encd = le.transform(y_team_train)
y_team_test_encd = le.transform(y_team_test)
# fill the missing values with 0
X_team_train.fillna(0, inplace = True)
X_team_test.fillna(0, inplace = True)
# Standardize features
col_names = X_team_train.columns

scaler = StandardScaler()
scaler.fit(X_team_train)

X_team_train_std = pd.DataFrame(scaler.transform(X_team_train), columns = col_names)
X_team_test_std = pd.DataFrame(scaler.transform(X_team_test), columns = col_names)

for name, clf in zip(names, classifiers):
    clf.fit(X_team_train_std, y_team_train_encd)
    
    y_pred = clf.predict(X_team_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_team_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Team attribute Variables"] = round(accuracy * 100, 3)
result_accuracy
model_name Baseline accuracy Player PC Variables Betting Statistics Variables Team attribute Variables
0 KNN 0.442961 41.320 44.220 39.412
1 LDA 0.442961 49.943 51.469 45.670
2 QDA 0.442961 42.350 40.557 45.784
3 Naive Bayes 0.442961 47.272 42.198 46.280
4 Logistic regression 0.442961 50.172 51.545 45.555
5 Decesion tree 0.442961 38.077 43.304 38.001
6 Random Forest 0.442961 49.790 48.798 45.326
7 AdaBoost 0.442961 50.362 51.698 46.814
8 XGBoost 0.442961 47.196 50.439 43.037
9 Polynomial kernel SVM 0.442961 50.515 48.760 44.601
10 Radial kernel SVM 0.442961 50.897 51.393 44.868
11 GBM 0.442961 50.820 52.079 47.310
12 LightGBM 0.442961 49.866 52.041 47.119
  • Except for the KNN, XGBoost, and decision tree models, all models have higher accuracies than the baseline accuracy when use the each team’s attribute information.
  • When using team attribute information, the overall accuracies are lower than when using other variables.

3.4. Variable set 4: Goal and win percentage rolling features

df_team_win_goal_rolling_features = df_team_win_goal_rolling_features.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")
  • Split the data into train and test set.
train_bool = df_team_win_goal_rolling_features.reset_index().match_api_id.isin(train_match_api_id)
test_bool = df_team_win_goal_rolling_features.reset_index().match_api_id.isin(test_match_api_id) 
df_rolling_train = df_team_win_goal_rolling_features[train_bool].set_index("match_api_id")
df_rolling_test = df_team_win_goal_rolling_features[test_bool].set_index("match_api_id")
X_rolling_train = df_rolling_train.drop("match_result", axis = 1)
y_rolling_train = df_rolling_train.match_result 

X_rolling_test = df_rolling_test.drop("match_result", axis = 1)
y_rolling_test = df_rolling_test.match_result
print("Number of train data: ", X_rolling_train.shape[0])
print("Number of test data: ", X_rolling_test.shape[0])
Number of train data:  16988
Number of test data:  2621
  • Preprocess the data befor modeling.
# Transform the match_result class to numerical labels.

y_rolling_train_encd = le.transform(y_rolling_train)
y_rolling_test_encd = le.transform(y_rolling_test)
# fill missing values with 0

X_rolling_train.fillna(0, inplace = True)
X_rolling_test.fillna(0, inplace = True)
# Standardize features

col_names = X_rolling_train.columns

scaler = StandardScaler()
scaler.fit(X_rolling_train)

X_rolling_train_std = pd.DataFrame(scaler.transform(X_rolling_train), columns = col_names)
X_rolling_test_std = pd.DataFrame(scaler.transform(X_rolling_test), columns = col_names)

for name, clf in zip(names, classifiers):
    clf.fit(X_rolling_train_std, y_rolling_train_encd)
    
    y_pred = clf.predict(X_rolling_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_rolling_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Team's goal and win percentage rolling Variables"] = round(accuracy * 100, 3)
result_accuracy
model_name Baseline accuracy Player PC Variables Betting Statistics Variables Team attribute Variables Team's goal and win percentage rolling Variables
0 KNN 0.442961 41.320 44.220 39.412 43.342
1 LDA 0.442961 49.943 51.469 45.670 49.676
2 QDA 0.442961 42.350 40.557 45.784 45.059
3 Naive Bayes 0.442961 47.272 42.198 46.280 46.929
4 Logistic regression 0.442961 50.172 51.545 45.555 49.790
5 Decesion tree 0.442961 38.077 43.304 38.001 39.489
6 Random Forest 0.442961 49.790 48.798 45.326 49.447
7 AdaBoost 0.442961 50.362 51.698 46.814 50.019
8 XGBoost 0.442961 47.196 50.439 43.037 48.607
9 Polynomial kernel SVM 0.442961 50.515 48.760 44.601 48.264
10 Radial kernel SVM 0.442961 50.897 51.393 44.868 49.828
11 GBM 0.442961 50.820 52.079 47.310 49.752
12 LightGBM 0.442961 49.866 52.041 47.119 48.989
  • Except for the KNN and decision tree models, all models have higher accuracies than the baseline accuracy when use the each team’s goal and win percentage rolling features.
  • Overall, the performance of all models is not bad when using each team’s goal and win percentage rolling features.

3.5. Variable set 5: each team’s Elo rating

df_match_elo = df_match_elo.merge(df_match_basic[["match_api_id", "match_result"]], how = "left", on = "match_api_id")
  • Split the data into train and test set.
train_bool = df_match_elo.reset_index().match_api_id.isin(train_match_api_id)
test_bool = df_match_elo.reset_index().match_api_id.isin(test_match_api_id) 
df_elo_train = df_match_elo[train_bool].set_index("match_api_id")
df_elo_test = df_match_elo[test_bool].set_index("match_api_id")
X_elo_train = df_elo_train.drop("match_result", axis = 1)
y_elo_train = df_elo_train.match_result 

X_elo_test = df_elo_test.drop("match_result", axis = 1)
y_elo_test = df_elo_test.match_result
print("Number of train data: ", X_rolling_train.shape[0])
print("Number of test data: ", X_rolling_test.shape[0])
Number of train data:  16988
Number of test data:  2621
  • Preprocess the data before modeling.
# Transform the match_result class to numerical labels.

y_elo_train_encd = le.transform(y_elo_train)
y_elo_test_encd = le.transform(y_elo_test)
# fill missing values with 0

X_elo_train.fillna(0, inplace = True)
X_elo_test.fillna(0, inplace = True)
# Standardize features

col_names = X_elo_train.columns

scaler = StandardScaler()
scaler.fit(X_elo_train)

X_elo_train_std = pd.DataFrame(scaler.transform(X_elo_train), columns = col_names)
X_elo_test_std = pd.DataFrame(scaler.transform(X_elo_test), columns = col_names)

for name, clf in zip(names, classifiers):
    clf.fit(X_elo_train_std, y_elo_train_encd)
    
    y_pred = clf.predict(X_elo_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_elo_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "Team's Elo rating related Variables"] = round(accuracy * 100, 3)
result_accuracy
model_name Baseline accuracy Player PC Variables Betting Statistics Variables Team attribute Variables Team's goal and win percentage rolling Variables Team's Elo rating related Variables
0 KNN 0.442961 41.320 44.220 39.412 43.342 40.710
1 LDA 0.442961 49.943 51.469 45.670 49.676 50.630
2 QDA 0.442961 42.350 40.557 45.784 45.059 38.878
3 Naive Bayes 0.442961 47.272 42.198 46.280 46.929 48.607
4 Logistic regression 0.442961 50.172 51.545 45.555 49.790 50.630
5 Decesion tree 0.442961 38.077 43.304 38.001 39.489 38.573
6 Random Forest 0.442961 49.790 48.798 45.326 49.447 49.142
7 AdaBoost 0.442961 50.362 51.698 46.814 50.019 51.011
8 XGBoost 0.442961 47.196 50.439 43.037 48.607 48.150
9 Polynomial kernel SVM 0.442961 50.515 48.760 44.601 48.264 48.874
10 Radial kernel SVM 0.442961 50.897 51.393 44.868 49.828 50.439
11 GBM 0.442961 50.820 52.079 47.310 49.752 50.591
12 LightGBM 0.442961 49.866 52.041 47.119 48.989 49.561
  • Except for the KNN, QDA, and decision tree models, all models have higher accuracies than the baseline accuracy when use the each team’s Elo rating related features.

3.6. Use all variables

  • Merge all feature tables.
df_all = df_match_player_attr_pcs.merge(df_match_betting_stat.drop("match_result", axis = 1), how = "left", on = ["match_api_id"]) \
                                 .merge(df_match_team_num_attr.drop("match_result", axis = 1), how = "left", on = ["match_api_id"]) \
                                 .merge(df_team_win_goal_rolling_features.drop("match_result", axis = 1), how = "left", on = ["match_api_id"])  \
                                 .merge(df_match_elo.drop("match_result", axis = 1), how = "left", on = ["match_api_id"])
                                 
  • Split the data into train and test set.
train_bool = df_all.match_api_id.isin(train_match_api_id)
test_bool = df_all.match_api_id.isin(test_match_api_id) 
df_all_train = df_all[train_bool].set_index("match_api_id")
df_all_test = df_all[test_bool].set_index("match_api_id")
X_all_train = df_all_train.drop("match_result", axis = 1)
y_all_train = df_all_train.match_result 

X_all_test = df_all_test.drop("match_result", axis = 1)
y_all_test = df_all_test.match_result
print("Number of train data: ", X_all_train.shape[0])
print("Number of test data: ", X_all_test.shape[0])
Number of train data:  16988
Number of test data:  2621
  • Preprocess the data before modeling.
# Transform the match_result class to numerical labels.

y_all_train_encd = le.transform(y_all_train)
y_all_test_encd = le.transform(y_all_test)
# fill missing values with 0

X_all_train.fillna(0, inplace = True)
X_all_test.fillna(0, inplace = True)
# Standardize features

col_names = X_all_train.columns

scaler = StandardScaler()
scaler.fit(X_all_train)

X_all_train_std = pd.DataFrame(scaler.transform(X_all_train), columns = col_names)
X_all_test_std = pd.DataFrame(scaler.transform(X_all_test), columns = col_names)

  • Save the tables.
df_all.to_csv("../data/df_all.csv", index = False)

train_match_api_id.to_csv("../data/train_match_api_id.csv", index = False)
test_match_api_id.to_csv("../data/test_match_api_id.csv", index = False)

X_all_train.to_csv("../data/X_all_train.csv", index = False)
X_all_test.to_csv("../data/X_all_train.csv", index = False)

X_all_train_std.to_csv("../data/X_all_train_std.csv", index = False)
X_all_test_std.to_csv("../data/X_all_test_std.csv", index = False)

y_all_train.to_csv("../data/y_all_train.csv", index = False)
y_all_test.to_csv("../data/y_all_test.csv", index = False)

for name, clf in zip(names, classifiers):
    clf.fit(X_all_train_std, y_all_train_encd)
    
    y_pred = clf.predict(X_all_test_std)
    y_pred_dict[name] = y_pred
    
    accuracy = np.mean(y_pred == y_all_test_encd)
    
    result_accuracy.loc[result_accuracy.model_name == name, "All Variables"] = round(accuracy * 100, 3)
result_accuracy
model_name Baseline accuracy Player PC Variables Betting Statistics Variables Team attribute Variables Team's goal and win percentage rolling Variables Team's Elo rating related Variables All Variables
0 KNN 0.442961 41.320 44.220 39.412 43.342 40.710 43.686
1 LDA 0.442961 49.943 51.469 45.670 49.676 50.630 50.706
2 QDA 0.442961 42.350 40.557 45.784 45.059 38.878 46.051
3 Naive Bayes 0.442961 47.272 42.198 46.280 46.929 48.607 45.937
4 Logistic regression 0.442961 50.172 51.545 45.555 49.790 50.630 51.316
5 Decesion tree 0.442961 38.077 43.304 38.001 39.489 38.573 41.892
6 Random Forest 0.442961 49.790 48.798 45.326 49.447 49.142 52.003
7 AdaBoost 0.442961 50.362 51.698 46.814 50.019 51.011 51.278
8 XGBoost 0.442961 47.196 50.439 43.037 48.607 48.150 49.447
9 Polynomial kernel SVM 0.442961 50.515 48.760 44.601 48.264 48.874 48.913
10 Radial kernel SVM 0.442961 50.897 51.393 44.868 49.828 50.439 51.240
11 GBM 0.442961 50.820 52.079 47.310 49.752 50.591 51.736
12 LightGBM 0.442961 49.866 52.041 47.119 48.989 49.561 51.164
  • When all variables were used, the accuracy of random forest is the highest at 52.003
  • So, let’s tune the hyperparameters of the random forest.
  • Also, among the models with accuracy greater than 50, since the LightGBM is faster to tune than other models, let’s tune the LightGBM as well.

  • Before tune the hyperparameters, let’s check the confusion matrix of the random forest and the LightGBM.

Default Random Forest confusion matrix

rf_default = RandomForestClassifier()
rf_default.fit(X_all_train_std, y_all_train_encd)
rf_default_pred = rf_default.predict(X_all_test_std)
le.inverse_transform(y_all_test_encd)
array(['home_win', 'home_win', 'home_win', ..., 'home_win', 'draw',
       'home_win'], dtype=object)
rf_default_cm = confusion_matrix(le.inverse_transform(y_all_test_encd), 
                                 le.inverse_transform(rf_default_pred))
cm_display = ConfusionMatrixDisplay(confusion_matrix = rf_default_cm, 
                                    display_labels = le.inverse_transform(rf_default.classes_))
cm_display.plot();    

png

print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(rf_default_pred)))
              precision    recall  f1-score   support

    away_win       0.49      0.49      0.49       801
        draw       0.31      0.05      0.09       659
    home_win       0.53      0.77      0.63      1161

    accuracy                           0.51      2621
   macro avg       0.44      0.44      0.40      2621
weighted avg       0.46      0.51      0.45      2621

Default LightGBM confusion matrix

lgbm_default = lgb.LGBMClassifier()
lgbm_default.fit(X_all_train_std, y_all_train_encd)
lgbm_default_pred = lgbm_default.predict(X_all_test_std)
lgbm_default_cm = confusion_matrix(le.inverse_transform(y_all_test_encd), 
                                   le.inverse_transform(lgbm_default_pred))
cm_display = ConfusionMatrixDisplay(confusion_matrix = lgbm_default_cm, 
                                    display_labels = le.inverse_transform(lgbm_default.classes_))
cm_display.plot();    

png

print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(lgbm_default_pred)))
              precision    recall  f1-score   support

    away_win       0.49      0.48      0.49       801
        draw       0.30      0.07      0.11       659
    home_win       0.54      0.79      0.64      1161

    accuracy                           0.51      2621
   macro avg       0.44      0.44      0.41      2621
weighted avg       0.46      0.51      0.46      2621

4. Hyperparameter tuning

4.1. Random forest

  • Candidate hyperparameters are as follow:
    • n_estimators: 100, 300, 500, 1000
    • learning_rate: 1e-8 ~ 1
    • max_depth: 3 ~ 20
    • max_features: auto, sqrt, log2
    • min_samples_leaf: 1 ~ 10
    • min_samples_split: 2 ~ 10
def rf_objective(trial, X, y):
    param_grid = {
        "n_estimators": trial.suggest_categorical("n_estimators", [100, 300, 500, 1000]),
        "max_depth": trial.suggest_int("max_depth", 3, 20, step = 2),
        "max_features": trial.suggest_categorical("max_features", ["sqrt", "log2"]),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 10),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
    }
    
    cv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
    cv_scores = np.empty(5)
    
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        model = RandomForestClassifier(**param_grid, n_jobs = -1)
        model.fit(X_train, y_train)
          
        pred = model.predict(X_test)
        cv_scores[idx] = np.mean(pred == y_test)

    return np.mean(cv_scores)
rf_study = optuna.create_study(direction = "maximize", study_name = "RandomForest Classifier")
func = lambda trial: rf_objective(trial, X_all_train_std, y_all_train_encd)
rf_study.optimize(func, n_trials = 20)
  • Best parameters are as follow:
rf_study.best_params
{'n_estimators': 300,
 'max_depth': 5,
 'max_features': 'sqrt',
 'min_samples_leaf': 7,
 'min_samples_split': 4}
  • Let’s check the test accuracy with the best hyperparameters set.
rf_best = RandomForestClassifier(**rf_study.best_params)
rf_best.fit(X_all_train_std, y_all_train_encd)
rf_best_pred = rf_best.predict(X_all_test_std)
rf_best_accuracy = np.mean(rf_best_pred == y_all_test_encd)
print("Accuracy before tuning the hyperparameters: ", result_accuracy[result_accuracy.model_name == "Random Forest"]["All Variables"].values[0])
print("Accuracy after tuning the hyperparameters: ", rf_best_accuracy * 100)
Accuracy before tuning the hyperparameters:  52.003
Accuracy after tuning the hyperparameters:  52.04120564669973
  • Let’s check the confusion matrix of the tuned random forest model.
fig, axes = plt.subplots(1, 2, figsize = (15, 5))

# confusion matrix for the random forest with default hyperparameters

rf_default_display = ConfusionMatrixDisplay(confusion_matrix = rf_default_cm, 
                                            display_labels = le.inverse_transform(rf_default.classes_))

# confusion matrix for the random forest with the best hyperparameters

rf_tuned_cm = confusion_matrix(le.inverse_transform(y_all_test_encd), 
                               le.inverse_transform(rf_best_pred))
rf_best_display = ConfusionMatrixDisplay(confusion_matrix = rf_tuned_cm, 
                                    display_labels = le.inverse_transform(rf_best.classes_))

rf_default_display.plot(ax = axes[0])
axes[0].set_title("Random Forest before tuning", fontsize = 15)

rf_best_display.plot(ax = axes[1])
axes[1].set_title("Random Forest after tuning", fontsize = 15)

plt.tight_layout()

png

print("< Random Forest before tuning >")
print("")
print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(rf_default_pred)))

print("")
print("< Random Forest after tuning >")
print("")
print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(rf_best_pred)))
< Random Forest before tuning >

              precision    recall  f1-score   support

    away_win       0.49      0.49      0.49       801
        draw       0.31      0.05      0.09       659
    home_win       0.53      0.77      0.63      1161

    accuracy                           0.51      2621
   macro avg       0.44      0.44      0.40      2621
weighted avg       0.46      0.51      0.45      2621


< Random Forest after tuning >

              precision    recall  f1-score   support

    away_win       0.50      0.50      0.50       801
        draw       0.00      0.00      0.00       659
    home_win       0.53      0.83      0.65      1161

    accuracy                           0.52      2621
   macro avg       0.34      0.44      0.38      2621
weighted avg       0.39      0.52      0.44      2621
  • Results for away_win and home_win have improved, but the model is struggling to predict the draw.

4.2. LightGBM

  • Candidate hyperparameters are as follow:
    • learning_rate: 0.01 ~ 0.3
    • num_leaves: 20 ~ 3000 with step = 20
    • max_depth: 3 ~ 12
    • min_data_in_leaf: 200 ~ 10000 with step = 100
    • max_bing: 200 ~ 300
    • lambda_l1: 0 ~ 100 with step = 5
    • lambda_l2: 0 ~ 100 with step = 5
    • min_gain_to_split: 0 ~ 15
    • bagging_fraction: 0.2 ~ 0.95 with step = 0.1
    • feature_fraction: 0.2 ~ 0.95 with step = 0.1
def lgbm_objective(trial, X, y):
    param_grid = {
        "n_estimators": trial.suggest_categorical("n_estimators", [10000]),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step = 20),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 200, 10000, step = 100),
        "max_bin": trial.suggest_int("max_bin", 200, 300),
        "lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step = 5),
        "lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step = 5),
        "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
        "bagging_fraction": trial.suggest_float(
            "bagging_fraction", 0.2, 0.95, step = 0.1
        ),
        "bagging_freq": trial.suggest_categorical("bagging_freq", [1]),
        "feature_fraction": trial.suggest_float(
            "feature_fraction", 0.2, 0.95, step = 0.1
        ),
        "silent": 1,
        "verbose": -1
    }
    
    cv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
    cv_scores = np.empty(5)
    
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        model = lgb.LGBMClassifier(objective = "multiclass", num_class = 3, **param_grid, n_jobs = -1)
        model.fit(
            X_train,
            y_train,
            eval_set=[(X_test, y_test)],
            #eval_metric = accuracy_score,
            early_stopping_rounds = 100,
            # callbacks=[
            #     LightGBMPruningCallback(trial, accuracy_score)
            # ],  # Add a pruning callback
            verbose = -1
        )
        preds = model.predict(X_test)
        accuracy = np.mean(y_test == preds)
        cv_scores[idx] = accuracy

    return np.mean(cv_scores)
lgbm_study = optuna.create_study(direction = "maximize", study_name = "LightGBM Classifier")
func = lambda trial: lgbm_objective(trial, X_all_train_std, y_all_train_encd)
lgbm_study.optimize(func, n_trials = 100)
  • Best parameters are as follow:
lgbm_study.best_params
{'n_estimators': 10000,
 'learning_rate': 0.29341244351241397,
 'num_leaves': 1560,
 'max_depth': 12,
 'min_data_in_leaf': 1800,
 'max_bin': 205,
 'lambda_l1': 20,
 'lambda_l2': 0,
 'min_gain_to_split': 12.558014144849205,
 'bagging_fraction': 0.8,
 'bagging_freq': 1,
 'feature_fraction': 0.30000000000000004}
lgb_best = lgb.LGBMClassifier(**lgbm_study.best_params, n_jobs = -1)
lgb_best.fit(X_all_train_std, y_all_train_encd)
lgb_best_pred = lgb_best.predict(X_all_test_std)
lgbm_best_accuracy = np.mean(lgb_best_pred == y_all_test_encd)
print("Accuracy before tuning the hyperparameters: ", result_accuracy[result_accuracy.model_name == "LightGBM"]["All Variables"].values[0])
print("Accuracy after tuning the hyperparameters: ", lgbm_best_accuracy * 100)
Accuracy before tuning the hyperparameters:  51.164
Accuracy after tuning the hyperparameters:  52.003052270125906
fig, axes = plt.subplots(1, 2, figsize = (15, 5))

# confusion matrix for the lgbm with default hyperparameters

lgbm_default_display = ConfusionMatrixDisplay(confusion_matrix = lgbm_default_cm, 
                                              display_labels = le.inverse_transform(lgbm_default.classes_))

# confusion matrix for the lgbm with the best hyperparameters

lgbm_tuned_cm = confusion_matrix(le.inverse_transform(y_all_test_encd), 
                                 le.inverse_transform(lgb_best_pred))
lgbm_best_display = ConfusionMatrixDisplay(confusion_matrix = lgbm_tuned_cm, 
                                           display_labels = le.inverse_transform(lgb_best.classes_))

lgbm_default_display.plot(ax = axes[0])
axes[0].set_title("LightGBM before tuning", fontsize = 15)

lgbm_best_display.plot(ax = axes[1])
axes[1].set_title("LightGBM after tuning", fontsize = 15)

plt.tight_layout()

png

print("< LightGBM before tuning >")
print("")
print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(lgbm_default_pred)))

print("")
print("< LightGBM after tuning >")
print("")
print(classification_report(le.inverse_transform(y_all_test_encd), 
                            le.inverse_transform(lgb_best_pred)))
< LightGBM before tuning >

              precision    recall  f1-score   support

    away_win       0.49      0.48      0.49       801
        draw       0.30      0.07      0.11       659
    home_win       0.54      0.79      0.64      1161

    accuracy                           0.51      2621
   macro avg       0.44      0.44      0.41      2621
weighted avg       0.46      0.51      0.46      2621


< LightGBM after tuning >

              precision    recall  f1-score   support

    away_win       0.49      0.51      0.50       801
        draw       0.00      0.00      0.00       659
    home_win       0.53      0.82      0.65      1161

    accuracy                           0.52      2621
   macro avg       0.34      0.44      0.38      2621
weighted avg       0.39      0.52      0.44      2621
  • Results for away_win and home_win have improved, but the LightGBM is also struggling to predict the draw.

5. Feature importance

  • Let’s check the feature importance from the tuned random forest model based on feature permutation.
rf_best_params = {
    'n_estimators': 300,
    'max_depth': 5,
    'max_features': 'sqrt',
    'min_samples_leaf': 7,
    'min_samples_split': 4
}
rf_best = RandomForestClassifier(**rf_best_params)
rf_best.fit(X_all_train_std, y_all_train_encd)
rf_best_pred = rf_best.predict(X_all_test_std)
rf_best_accuracy = np.mean(rf_best_pred == y_all_test_encd)
result = permutation_importance(
    rf_best, X_all_test_std, y_all_test_encd, n_repeats=10, random_state=42, n_jobs=-1
)
rf_feature_imp_permutation =  pd.DataFrame(sorted(zip(result.importances_mean, col_names)), columns=['Value','Feature'])
plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data = rf_feature_imp_permutation.sort_values("Value", ascending = False).head(50))
plt.title('Random Forest Features')
plt.tight_layout()
plt.show()

png

  • Above plot shows top 50 most important features for predicting the match results.

  • Let’s compare the distribution of the feature importance between different variable sets.

    • variable set 1: player attributes PC features
    • Variable set 2: betting information features
    • variable set 3: team attributes features
    • variable set 4: goal and win percentage rolling features
    • variable set 5: each team’s Elo rating
player_attr_pc_vars = df_match_player_attr_pcs.columns
bet_stat_vars = df_match_betting_stat.columns
team_attr_vars = df_match_team_num_attr.columns
team_rolling_vars = df_team_win_goal_rolling_features.columns
elo_vars = df_match_elo.columns

rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(player_attr_pc_vars), "feature_set"] = f"Player attribute PC variables (#: {len(player_attr_pc_vars) - 1})"
rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(bet_stat_vars), "feature_set"] = f"Betting odds statistics variables (#: {len(bet_stat_vars) - 1})"
rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(team_attr_vars), "feature_set"] = f"Team attribute variables (#: {len(team_attr_vars) - 1})"
rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(team_rolling_vars), "feature_set"] = f"Team's recent average goal and win percentage variables (#: {len(team_rolling_vars) - 1})"
rf_feature_imp_permutation.loc[rf_feature_imp_permutation.Feature.isin(elo_vars), "feature_set"] = f"Team's recent Elo variables (#: {len(elo_vars) - 1})"

plt.figure(figsize = (12, 5))
sns.boxplot(data = rf_feature_imp_permutation, x = "Value", y = "feature_set")
plt.xlabel("Feature importance", fontsize = 12)
plt.ylabel("Feature sets", fontsize = 12)
plt.title("Feature importance distribution from different feature sets", fontsize = 15)
Text(0.5, 1.0, 'Feature importance distribution from different feature sets')

png

  • The betting odds statistics variables shows the highest importance among different feature sets.
  • Team attribute variables have the lowest feature importance.
  • The remaining three variable sets show similar importance.

5.1. Betting odds statistics variables

  • Betting odds statistics can be subdivided into:
    • home win, away win, and draw
    • mean, max, min, std
betting_importance = rf_feature_imp_permutation[rf_feature_imp_permutation.feature_set == "Betting odds statistics variables (#: 13)"]
betting_importance["home_away"] = betting_importance.Feature.str.split("_").str[0]
betting_importance["statistics"] = betting_importance.Feature.str.split("_").str[2]

betting_importance.sort_values("Value", ascending = False)
Value Feature feature_set home_away statistics
234 0.004159 H_odd_mean Betting odds statistics variables (#: 13) H mean
233 0.003892 A_odd_mean Betting odds statistics variables (#: 13) A mean
232 0.003853 A_odd_max Betting odds statistics variables (#: 13) A max
231 0.003014 H_odd_max Betting odds statistics variables (#: 13) H max
230 0.002251 H_odd_min Betting odds statistics variables (#: 13) H min
229 0.001831 H_odd_std Betting odds statistics variables (#: 13) H std
228 0.001831 A_odd_min Betting odds statistics variables (#: 13) A min
227 0.001145 A_odd_std Betting odds statistics variables (#: 13) A std
226 0.001106 D_odd_std Betting odds statistics variables (#: 13) D std
224 0.000954 D_odd_max Betting odds statistics variables (#: 13) D max
221 0.000801 D_odd_min Betting odds statistics variables (#: 13) D min
204 0.000496 D_odd_mean Betting odds statistics variables (#: 13) D mean
  • The importance of variables for home and away were high, and the importance of variables for draw were low.
  • The importance of variables related to mean and max were high.

5.2. Team’s recent Elo variables

  • Elo rating related variables can be subdivided into:
    • home team, away team
    • average, std
    • recent 1, 3, 5, 10, 20, 30, 60, 90 matches
elo_importance = rf_feature_imp_permutation[rf_feature_imp_permutation.feature_set == "Team's recent Elo variables (#: 34)"]

elo_importance["home_away"] = elo_importance.Feature.str.split("_").str[0]
elo_importance["statistics"] = elo_importance.Feature.str.split("_").str[2]
elo_importance["matches"] = elo_importance.Feature.str.split("_").str[5]
elo_importance.groupby("home_away").Value.mean().reset_index()
home_away Value
0 away 0.000057
1 elo 0.000458
2 home 0.000210