HW1. KNN

20 minute read

Topics: EDA(Exploratory Data Analysis), KNN


import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msno

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

Problem: KNN

Use the k-nearest neighbor classifier on the diabetes dataset. In particular, consider k = 1, 2, . . ., 30. Show both the training and test errors for each choice and report your findings. Hint: 1) Note the prediction/input variables are of different units and scales. Therefore, standardization is necessary before applying the KNN method. 2) Exploratory data analysis (EDA) is an important step to get familiar with the data and better understand it. Make sure you do it with every data analysis project.

1. EDA

Missing values

df_train = pd.read_csv("diabetes_train.csv")
df_test = pd.read_csv("diabetes_test.csv")
df_train.isna().sum()
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
df_test.isna().sum()
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

It looks like there are no missing values in both train and test data.

df_train.describe()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000
mean 4.053738 124.752336 69.672897 20.072430 84.067757 32.549065 0.502308 34.329439 0.478972
std 3.538270 32.822486 19.135913 16.555687 124.157706 7.669440 0.347304 11.926841 0.500142
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 103.000000 64.000000 0.000000 0.000000 27.875000 0.253750 25.000000 0.000000
50% 3.000000 123.000000 72.000000 22.500000 0.000000 32.500000 0.402500 31.000000 0.000000
75% 7.000000 145.000000 80.000000 32.000000 130.000000 36.800000 0.675000 41.250000 1.000000
max 17.000000 199.000000 114.000000 99.000000 846.000000 59.400000 2.420000 81.000000 1.000000

However, if we check the basic statistics of each variables, then some variables have 0 as minimum value, which is impossible for human. For instance, people cannot have 0 glucose concentration, insuline, blood pressure, skinthickness, and BMI. Let’s check these variables more.

target_columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

fig, axes = plt.subplots(1, len(target_columns), figsize = (15, 5))

for i, col in enumerate(target_columns):
    sns.boxplot(df_train.loc[:, col], ax = axes[i])
    axes[i].set_xlabel(col)
    axes[i].set_ylabel("Value")

plt.tight_layout()

png

  • Glucose, BloodPressure, BMI: when we check the distribution of data, it is abnormal to have value 0. So we can conclude that value 0 means missing value in these variables.
  • SkinThickness, Insulin: Let’s take a closer look at these variables, as it is difficult to determine.
fig, axes = plt.subplots(1, 2, figsize = (15, 5))

sns.barplot(df_train.loc[df_train["SkinThickness"] < 15, "SkinThickness"].value_counts().sort_values().reset_index(), 
            x = "index", y = "SkinThickness", ax = axes[0])
axes[0].set_xlabel("SkinThickness")
axes[0].set_ylabel("Count")

sns.barplot(df_train.loc[df_train["Insulin"] < 50, "Insulin"].value_counts().sort_values().reset_index(), 
            x = "index", y = "Insulin", ax = axes[1])
axes[1].set_xlabel("Insulin")
axes[1].set_ylabel("Count")
Text(0, 0.5, 'Count')

png

The above bar plots show the distribution of data with values near zero. When compared to non-zero data, it can be seen that having 0 is abnormal in both variables. So we can conclude that value 0 means missing value in these variables.

Let’s replace 0 to the missing value in Blucose, BloodPressure, SkinThickness, Insuline, BMI variables

target_columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

for col in target_columns:
    df_train.loc[df_train[col] == 0, col] = np.nan
    df_test.loc[df_test[col] == 0, col] = np.nan
print("The number of nan values in training data:")
for col in target_columns:
    print(f"- {col}: ", np.sum(df_train[col].isna()))

print(" ")
print("The number of nan values in test data:")
for col in target_columns:
    print(f"- {col}: ", np.sum(df_test[col].isna()))
The number of nan values in training data:
- Glucose:  3
- BloodPressure:  19
- SkinThickness:  139
- Insulin:  218
- BMI:  6
 
The number of nan values in test data:
- Glucose:  1
- BloodPressure:  9
- SkinThickness:  32
- Insulin:  51
- BMI:  1
fig, axes = plt.subplots(1, 2, figsize = (15, 8))
msno.bar(df_train, ax = axes[0])
msno.bar(df_test, ax = axes[1])
axes[0].set_title("Train data", fontsize = 25)
axes[1].set_title("Test data", fontsize = 25)
plt.tight_layout()

png

Above print results and bar plots show how many missing values there are for each variables in train and test data.

  • Glucose, BMI, BloodPressure: have very few missing values. Less than 5% for train data and less than 10% for test data.
  • SkinThickness, Insuline: have very many missing values. Over 30% for both train and test data.
fig, axes = plt.subplots(1, 2, figsize = (20, 8))
msno.matrix(df_train, figsize = (12, 4), ax = axes[0])
msno.matrix(df_test, figsize = (12, 4), ax = axes[1])
axes[0].set_title("Train data", fontsize = 25)
axes[1].set_title("Test data", fontsize = 25)
plt.tight_layout()

png

We can check the pattern of missingness in the dataset by above plots.

  • Train data: If glucose is a missing value, it seems that all other 4 variables are also missing values.
  • Test data: It seems that there is no pattern of missing values among 5 variables.

Distribution

sns.pairplot(df_train, diag_kind = "kde", hue = "Outcome")
<seaborn.axisgrid.PairGrid at 0x7fa924c7a278>

png

  • When we check the kde plot of Glucose, it seems that the difference in distribution between Outcome = 1 and Outcome = 0 is larger than that of other variables.
  • When we check all pairplots, in the pairplot of glucose and other variables, it is possible to better distinguish Outcome = 1 and Outcome = 0.
df_train_boxplot = df_train.copy()


px.box(df_train_boxplot.melt(id_vars=['Outcome'], var_name = "col" ), x = "Outcome", y='value', 
       color = 'Outcome',facet_col='col').update_yaxes(matches=None)
  • As seen in the kde plot of pairplot, there seems to be a significant difference between the distribution of Outcome = 0 and Outcome = 1 for Glucose.
  • In the case of SkinThickness and Insulin, which have many missing values, there is no clear difference in the distribution of Outcome = 0 and Outcome = 1.

Correlation

corr = df_train.corr() 
mask = np.triu(np.ones_like(corr, dtype=bool))

sns.heatmap(corr, mask = mask, annot = True, cmap = "BrBG", vmin = -1, vmax = 1, linewidths = 0.5, cbar_kws = {"shrink" : 0.5})
<AxesSubplot:>

png

  • Most of the variables show positive correlations.
  • There are quite high correlations between Insulin & Glucose / BMI & SkinThickness / Age & Pregnancies.
  • Outcome has highest correlation of 0.47 with Glucose, which is the ssame result as above.
  • SkinThickness and Insuline have not that high correlation with Outcome.

We have check that Since SkinThickness and Insuline have many missing values and contain few information about Outcome. Also Skinthickness has high correlation with BMI and Insulin has high correlation with Glucose. I think that the information of Insulin and SkinThickness can be sufficiently covered by Glucose and BMI alone respectively. So let’s proceed with the analysis without the Skinthickness and Insulin variables.

df_train = df_train.drop(["SkinThickness", "Insulin"], axis = 1)
df_test = df_test.drop(["SkinThickness", "Insulin"], axis = 1)

2. KNN without SkinThickness & Insulin

def my_knn(df_train, df_test, k_range, method, verbose = False, plot = False):
    # (0) Make base dataset 
    if method == "dropna":
        df_train = df_train.dropna()
        df_test = df_test.dropna()
    
    X_train = df_train.drop(['Outcome'], axis=1)
    y_train = df_train['Outcome']
    X_test = df_test.drop(['Outcome'], axis=1)
    y_test = df_test['Outcome'] 
    
    columns = X_train.columns
    
    if method == "mean":
        imp = SimpleImputer(missing_values = np.nan, strategy = 'mean')
         
    elif method == "median":
        imp = SimpleImputer(missing_values=np.nan, strategy = 'median')
        
    elif method == "mice":
        imp = IterativeImputer(max_iter=10, random_state=42)
    
    if method != "dropna":    
        X_train = imp.fit_transform(X_train)
        X_test = imp.fit_transform(X_test)

    if verbose == True:
        print("Method: ", method) 
        print("Datasets are ready:")
        print("-- X_train: ", X_train.shape)
        print("-- y_train: ", y_train.shape)
        print("-- X_test: ", X_test.shape)
        print("-- y_test: ", y_test.shape)
        print("")
    
    # (1) Standardization
    scaler = StandardScaler()
    scaler.fit(X_train)

    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    if verbose == True:
        print("Datas are standardized: ")
        print("-- X_train: ")
        print(pd.DataFrame(X_train, columns = columns).describe())
        print("-- X_test: ")
        print(pd.DataFrame(X_test, columns = columns).describe())
        print("")

    # (2) KNN with varing K
    train_error = []
    test_error = []

    for k in k_range:
        # setup a knn classifier with k neighbors
        knn = KNeighborsClassifier(n_neighbors=k)

        # fit the model
        knn.fit(X_train, y_train)

        # train error rate
        pred_train = knn.predict(X_train)
        train_error.append(np.mean(pred_train != y_train))

        # test error rate
        pred_test = knn.predict(X_test)
        test_error.append(np.mean(pred_test != y_test))

    # (3) Plotting error rate curves
    if plot == True:
        plt.plot(1/k_range, train_error, label='Training Error')
        plt.plot(1/k_range, test_error, label='Test Error')
        plt.legend()
        plt.title('Training and test error rate for KNN')
        plt.xlabel('Number of neighbors (1 / k)')
        plt.ylabel('Error Rate')
        plt.show()
    
    return train_error, test_error
k_range = np.arange(1, 31)
fig, axes = plt.subplots(1, 4, figsize = (15, 5))

for i, method in enumerate(["dropna", "mean", "median", "mice"]):
    train_error, test_error = my_knn(df_train, df_test, k_range, method = method) 
    
    axes[i].plot(1/k_range, train_error, label = "Training Error")
    axes[i].plot(1/k_range, test_error, label = "Test Error")
    
    axes[i].set_xlabel('Number of neighbors (1 / k)')
    axes[i].set_ylabel('Error Rate')
    
    if method == "dropna": axes[i].set_title("Drop missing values")
    elif method == "mean": axes[i].set_title("Impute missing values with mean")
    elif method == "median": axes[i].set_title("Impute missing values with median")
    elif method == "mice": axes[i].set_title("Impute missing values with MICE")

plt.tight_layout()

png

Above line graphs show train and test error results for each ways to handle missing values. We can obtain minimum error rate about 0.24 with 1/k about 0.2 when we impute missing values with mean.

3. KNN with Glucose & one other variable

As seen in the above EDA process, there was a some difference in the distribution of Outcomes = 0 and Outcomes = 1 in the case of Glucose.

fig, axes = plt.subplots(1, 5, figsize = (15, 5))

target_column = ["Pregnancies", "BloodPressure", "BMI", "DiabetesPedigreeFunction", "Age"]

for i, col in enumerate(target_column):
    sns.scatterplot(data = df_train[["Outcome", "Glucose", col]], x = "Glucose", y = col, hue = "Outcome", ax = axes[i])
plt.tight_layout()

png

The plots above are scatter plots between Glucose and one other variable. If we look at the above plots, we can distinguish the Outcome = 0 group and Outcome = 1 group with our eyes to some extent. So, I want to try to apply KNN using only Glucose and one other variable and compare the results.

fig, axes = plt.subplots(4, 5, figsize = (15, 10))

for i, method in enumerate(["dropna", "mean", "median", "mice"]):
    if method == "dropna": y_label = "Drop NA"
    elif method == "mean": y_label = "Impute by mean"
    elif method == "median": y_label = "Impute by median"
    elif method == "mice": y_label = "Impute by MICE"
    
    axes[i, 0].set_ylabel(y_label, fontsize = 15)
    
    for j, col in enumerate(["Pregnancies", "BloodPressure", "BMI", "DiabetesPedigreeFunction", "Age"]):
        df_train_subset = df_train[["Outcome", "Glucose", col]]
        df_test_subset = df_test[["Outcome", "Glucose", col]]
        
        train_error, test_error = my_knn(df_train_subset, df_test_subset, k_range, method = method) 

        axes[i, j].plot(1/k_range, train_error, label = "Training Error")
        axes[i, j].plot(1/k_range, test_error, label = "Test Error")

        if i == 0: axes[i, j].set_title(f'Glucose & {col}', fontsize = 15)

plt.tight_layout()

png

As we can see in the above training error and test error results, when using only two variables(Glucose & BloodPressure or Glucose & Age), a lower test error can be obtained than when using the rest of the above variables together.

4. Compare results

train_error_all, test_error_all = my_knn(df_train, df_test, k_range, method = "mean")

df_train_subset = df_train[["Outcome", "Glucose", "Age"]]
df_test_subset = df_test[["Outcome", "Glucose", "Age"]]
train_error_age, test_error_age = my_knn(df_train_subset, df_test_subset, k_range, method = "median")

df_train_subset = df_train[["Outcome", "Glucose", "BloodPressure"]]
df_test_subset = df_test[["Outcome", "Glucose", "BloodPressure"]]
train_error_bp, test_error_bp = my_knn(df_train_subset, df_test_subset, k_range, method = "mice")
 
plt.figure(figsize = (10, 5))
plt.plot(1/k_range, test_error_age, label = "Case 1: Glucose & Age / Median imputation")
plt.plot(1/k_range, test_error_bp, label = "Case 2: Glucose & BloodPressure / MICE imputation")
plt.plot(1/k_range, test_error_all, label = "Case 3: All variables / Mean imputation")
plt.vlines(1/k_range[np.argmin(test_error_age)], 0.15, 0.4, linestyles = "dashed", colors = "blue", alpha = 0.5)
plt.vlines(1/k_range[np.argmin(test_error_all)], 0.15, 0.4, linestyles = "dashed", colors = "green", alpha = 0.5)
plt.vlines(1/k_range[np.argmin(test_error_bp)], 0.15, 0.4, linestyles = "dashed", colors = "red", alpha = 0.5)

plt.xlabel('Number of neighbors (1 / k)')
plt.ylabel('Error Rate')
plt.title("Test Errors")

plt.legend(loc = 2)



<matplotlib.legend.Legend at 0x7fa90fe95cc0>

png

print("When we use")
print(f"- (case 1) Glocose & Age with median imputation, we can obtain minimum teset error = {round(np.min(test_error_age), 3)} when k = {k_range[np.argmin(test_error_age)]}")
print(f"- (case 2) Glocose & BloodPressure with MICE imputation, we can obtain minimum teset error = {round(np.min(test_error_bp), 3)} when k = {k_range[np.argmin(test_error_bp)]}")
print(f"- (case 3) all variables except SkinThickness and Insulin with mean imputation, we can obtain minimum teset error = {round(np.min(test_error_all), 3)} when k = {k_range[np.argmin(test_error_all)]}")
When we use
- (case 1) Glocose & Age with median imputation, we can obtain minimum teset error = 0.194 when k = 17
- (case 2) Glocose & BloodPressure with MICE imputation, we can obtain minimum teset error = 0.204 when k = 11
- (case 3) all variables except SkinThickness and Insulin with mean imputation, we can obtain minimum teset error = 0.231 when k = 6