HW2. LDA, QDA, Naive Bayes

12 minute read

Topics: EDA(Exploratory Data Analysis), LDA(Linear Discriminant Analysis), QDA(Quadratic Discriminant Analysis), Naive Bayes

import pandas as pd
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis 
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

Problem: LDA, QDA, Naive Bayes

In this problem, you will develop models to predict the wine type based on the Wine data set.

(a) EDA

Explore the data graphically in order to investigate the association between Type and the other features. Which of the other features seem most likely to be useful in predicting Type? Scat- terplots and boxplots may be useful tools to answer this ques- tion. Describe your findings.

df_train = pd.read_csv("wine_train.csv")
df_test = pd.read_csv("wine_test.csv")

print(df_train.shape, df_test.shape)

(123, 14) (55, 14)

train data: n = 123
test data: n = 55

df_train.Type.value_counts()

  49
  41
  33
Name: Type, dtype: int64

df_train.dtypes

Type                 int64
Alcohol            float64
Malic              float64
Ash                float64
Alcalinity         float64
Magnesium            int64
Phenols            float64
Flavanoids         float64
Nonflavanoids      float64
Proanthocyanins    float64
Color              float64
Hue                float64
Dilution           float64
Proline              int64
dtype: object

independent variable:
- Type variable
- There are 3 types (1, 2, 3 types)
- The number of datas for each type are balanced. There are 41 type 1, 49 type 2, and 33 type 3.
dependent variable: 13 variables
- All variables are numerical types.

Dependent variables Type is integer data type, so let’s change it to string.

df_train["Type"] = df_train["Type"].astype("string")
df_test["Type"] = df_test["Type"].astype("string")

df_train.isna().sum()

Type               0
Alcohol            0
Malic              0
Ash                0
Alcalinity         0
Magnesium          0
Phenols            0
Flavanoids         0
Nonflavanoids      0
Proanthocyanins    0
Color              0
Hue                0
Dilution           0
Proline            0
dtype: int64

df_test.isna().sum()

Type               0
Alcohol            0
Malic              0
Ash                0
Alcalinity         0
Magnesium          0
Phenols            0
Flavanoids         0
Nonflavanoids      0
Proanthocyanins    0
Color              0
Hue                0
Dilution           0
Proline            0
dtype: int64

There is no missing value in both train and test data.

df_train.describe()

	Type	Alcohol	Malic	Ash	Alcalinity	Magnesium	Phenols	Flavanoids	Nonflavanoids	Proanthocyanins	Color	Hue	Dilution	Proline
count	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000
mean	1.934959	13.045285	2.387154	2.377398	19.604065	99.105691	2.293496	2.040163	0.362033	1.572439	5.122276	0.949561	2.614146	745.341463
std	0.776075	0.817379	1.111320	0.283956	3.605492	12.958201	0.629254	1.019045	0.123308	0.556818	2.329248	0.224467	0.732045	328.719693
min	1.000000	11.450000	0.890000	1.360000	10.600000	78.000000	0.980000	0.340000	0.140000	0.410000	1.280000	0.540000	1.270000	278.000000
25%	1.000000	12.370000	1.655000	2.225000	17.050000	88.000000	1.770000	1.095000	0.270000	1.235000	3.260000	0.775000	1.890000	495.000000
50%	2.000000	13.050000	1.900000	2.380000	19.500000	97.000000	2.400000	2.110000	0.340000	1.550000	4.900000	0.960000	2.780000	650.000000
75%	3.000000	13.725000	3.170000	2.600000	21.550000	106.500000	2.800000	2.895000	0.430000	1.955000	6.250000	1.120000	3.205000	1002.500000
max	3.000000	14.830000	5.800000	3.230000	30.000000	139.000000	3.880000	5.080000	0.660000	2.960000	13.000000	1.420000	4.000000	1547.000000

From the numerical statistics, we can check that there are no strange values that can mean missing value. (like 0 or negative value).

sns.pairplot(df_train, diag_kind = "kde", hue = "Type")

<seaborn.axisgrid.PairGrid at 0x7fbece8d2470>

png

It is hard to see scatter plots because there are so many variables, so let’s focus on several variables. Especially Flavanoids and Color variables seem to be important to classify Type.

fig, axes = plt.subplots(3, 4, figsize = (15, 10))

column_list = ['Alcohol', 'Malic', 'Ash', 'Alcalinity', 'Magnesium', 'Phenols',
               'Nonflavanoids', 'Proanthocyanins', 'Color', 'Hue',
               'Dilution', 'Proline']

for i, col in enumerate(column_list):
    sns.scatterplot(df_train, x = col, y = "Flavanoids", hue = "Type", ax = axes[i // 4, i % 4])

plt.tight_layout()

png

Above plots are scatter plots of Flavanoids with other variables.

We can easily classify types in almost every scaater plots.
Especially in scatter plots with Color or Proline, classifying types are more easier.

fig, axes = plt.subplots(3, 4, figsize = (15, 10))

column_list = ['Alcohol', 'Malic', 'Ash', 'Alcalinity', 'Magnesium', 'Phenols',
               'Nonflavanoids', 'Proanthocyanins', 'Flavanoids', 'Hue',
               'Dilution', 'Proline']

for i, col in enumerate(column_list):
    sns.scatterplot(df_train, x = col, y = "Color", hue = "Type", ax = axes[i // 4, i % 4])

plt.tight_layout()

png

Above plots are scatter plots of Color with other variables.

In scatter plots with Flavanoids, Proline, or Dilution, we can easily classify types.

fig, axes = plt.subplots(3, 5, figsize = (15, 10))

column_list = ['Alcohol', 'Malic', 'Ash', 'Alcalinity', 'Magnesium', 'Phenols',
               'Flavanoids', 'Nonflavanoids', 'Proanthocyanins', 'Color', 'Hue',
               'Dilution', 'Proline']

for i, col in enumerate(column_list):
    sns.boxplot(df_train, x = "Type", y = col, ax = axes[i // 5, i % 5])
    
plt.tight_layout()

png

If we check the boxplot, it is more easier to understand wich variables are important for classyfing the Type than the scatter plot. We can see the clear distributional differences between each types in Alchol, Phenols, Flavanoids, Proanthocyanins and Color variables.

(b) Modeling

Perform LDA, QDA and Naive Bayes on the training data in order to predict Type. What are the test errors of the models obtained?

X_train = df_train.drop(["Type"], axis = 1)
y_train = df_train["Type"]
X_test = df_test.drop(["Type"], axis = 1)
y_test = df_test["Type"]

LDA

model_lda = LinearDiscriminantAnalysis()
model_lda.fit(X_train, y_train)

LinearDiscriminantAnalysis()

y_train_pred_lda = model_lda.predict(X_train)
train_error_lda = np.mean(y_train_pred_lda != y_train)

y_test_pred_lda = model_lda.predict(X_test)
test_error_lda = np.mean(y_test_pred_lda != y_test)

print("Train error of LDA: ", train_error_lda)
print("Test error of LDA: ", test_error_lda)

Train error of LDA:  0.0
Test error of LDA:  0.01818181818181818

Train error of LDA is 0.0 but test error of LDA is 0.018.

cm_test_lda = confusion_matrix(y_test, y_test_pred_lda)
cm_display = ConfusionMatrixDisplay(confusion_matrix = cm_test_lda, display_labels = model_lda.classes_)
cm_display.plot()

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fbea0fcc6d8>

png

Only one data from type 3 is miss classified to type 2.

X_train_lda = model_lda.transform(X_train)

plt.figure()
colors = ['red', 'green', 'blue']
for color, class_name in zip(colors, model_lda.classes_):
    plt.scatter(X_train_lda[np.array(y_train == class_name).astype("bool"), 0], X_train_lda[np.array(y_train == class_name).astype("bool"), 1], 
                alpha=.8, color=color,label= f"Type: {class_name}")
plt.legend(loc='best')
plt.xlabel("LD1")
plt.ylabel("LD2")
plt.show()

png

We can see that data points that projected to 2d plane by LDA are well separated between types.

QDA

model_qda = QuadraticDiscriminantAnalysis()
model_qda.fit(X_train, y_train)

QuadraticDiscriminantAnalysis()

y_train_pred_qda = model_qda.predict(X_train)
train_error_qda = np.mean(y_train_pred_qda != y_train)

y_test_pred_qda = model_qda.predict(X_test)
test_error_qda = np.mean(y_test_pred_qda != y_test)

print("Train error of QDA: ", train_error_qda)
print("Test error of QDA: ", test_error_qda)

Train error of QDA:  0.0
Test error of QDA:  0.03636363636363636

Train error of QDA is 0.0 but test error of QDA is 0.036, higher than LDA.

cm_test_qda = confusion_matrix(y_test, y_test_pred_qda)
cm_display = ConfusionMatrixDisplay(confusion_matrix = cm_test_qda, display_labels = model_qda.classes_)
cm_display.plot()

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fbec0168828>

png

One data from Type 1 is miss classifeid to Type 2, and one data from Type 2 is miss classified to Type 1.

Naive Bayes

model_nb = GaussianNB() 
model_nb.fit(X_train, y_train)

GaussianNB()

y_train_pred_nb = model_nb.predict(X_train)
train_error_nb = np.mean(y_train_pred_nb != y_train)

y_test_pred_nb = model_nb.predict(X_test)
test_error_nb = np.mean(y_test_pred_nb != y_test)

print("Train error of Naive Bayes: ", train_error_nb)
print("Test error of Naive Bayes: ", test_error_nb)

Train error of Naive Bayes:  0.016260162601626018
Test error of Naive Bayes:  0.03636363636363636

Train error is 0.016, higher than LDA and QDA.
Test error is 0.036, higher than LDA and same with QDA.

cm_test_nb = confusion_matrix(y_test, y_test_pred_nb)
cm_display = ConfusionMatrixDisplay(confusion_matrix = cm_test_nb, display_labels = model_qda.classes_)
cm_display.plot()

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fbebf5bcb00>

png

2 data from Type 1 are miss classified to Type 2.

Compare the results

train_error = [train_error_lda, train_error_qda, train_error_nb] 
test_error = [test_error_lda, test_error_qda, test_error_nb] 

pd.DataFrame({"Train Error" : train_error,
             "Test Error" : test_error}, index = ["LDA", "QDA", "Naive Bayes"])

	Train Error	Test Error
LDA	0.00000	0.018182
QDA	0.00000	0.036364
Naive Bayes	0.01626	0.036364

Train error is 0 for both LDA and QDA. Train error of Naive Bayes is 0.016.
Test error of LDA is the lowest among three methods, 0.018. QDA and Naive Bayes have same test error, 0.036.

Youngjun Woo

Problem: LDA, QDA, Naive Bayes

(a) EDA

(b) Modeling

LDA

QDA

Naive Bayes

Compare the results