HW7. Machine Learning 2: Classification

138 minute read

Topics: K-NN, Linear SVM, RBF SVM, Gaussian process classifier, Decision tree classifier, Randomforest classifer, Neural network, AdaBoost classifer, Gaussian naive bayes classifer, PCA, t-SNE

This is, perhaps, one of the most exciting homework assignments that you have encountered in this course!

You are going to try your hand at a Kaggle competition to predict Titanic survivorship. (Recall that we’ve played with Titanic data earlier in this course – this data set is slightly different.)

(NOTE: if you prefer to not submit your work to the Kaggle competition that’s fine – just contact Chris via email (cteplovs@umich.edu) and we will work out an alternative.)

To start with, make sure you have a Kaggle account, then navigate to the Titanic: Machine Learning from Disaster project page.

We’ll view the introductory video together in class.

The basic steps for this assignment are outlined in the video:

Accept the rules and join the competition
Download the data (from the data tab of the competition page)
Understand the problem
EDA (Exploratory Data Analysis)
Train, tune, and ensemble (!) your machine learning models
Upload your prediction as a submission on Kaggle and receive an accuracy score

additionally, you will

Upload your final notebook to Canvas and report your best accuracy score.

Note that class grades are not entirely dependent on your accuracy score.
All models that achieve 75% accuracy will receive full points for the accuracy component of this assignment.

Rubric:

(20 points) EDA
(60 points) Train, tune, and ensemble machine learning models
(10 points) Accuracy score based on Kaggle submission report (or alternative, see NOTE above).
(10 points) PEP-8, grammar, spelling, style, etc.

Some additional notes:

If you use another notebook, code, or approaches be sure to reference the original work. (Note that we recommend you study existing Kaggle notebooks before starting your own work.)
You can help each other but in the end you must submit your own work, both to Kaggle and to Canvas.

Some additional resources:

“ensemble” your models with a VotingClassifier
a good primer on feature engineering
There are a lot of good notebooks to study (check the number of upvotes to help guide your exploration)

(and don’t cheat)

One final note: Your submission should be a self-contained notebook that is NOT based on this one. Studying the existing Kaggle competition notebooks should give you a sense of what makes a “good” notebook.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
import plotly.graph_objects as gp
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import statistics
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import GridSearchCV
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import DotProduct
from sklearn.gaussian_process.kernels import Matern
from sklearn.gaussian_process.kernels import RationalQuadratic
from sklearn.gaussian_process.kernels import WhiteKernel
import warnings

warnings.filterwarnings("ignore")

sns.set(style = "darkgrid")

1. EDA

titanic = pd.read_csv("./data/titanic_train.csv")

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

There is a total of 12 columns, but since PassengerId is just an ID, so let’s delete the column.

titanic_id = titanic["PassengerId"]
titanic.drop("PassengerId", axis = 1, inplace = True)
titanic.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

There are 10 independent variables and 1 dependent variable.

Dependent variable: Survived
Independent variables:
- Pclass: Ticket class. (1 = 1st, 2 = 2nd, 3 = 3rd)
- Name
- Sex
- Age
- SibSp: # of siblings / spouses aboard the Titanic
- Parch: # of parents / children aboard the Titanic
- Ticket: Ticket numberc
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

And we can categorize independent variables into 3 different categories:

Ticket related columns: Pclass, Ticket, Fare, Cabin, Embarked
Person related columns: Name, Sex, Age
Family related columns: SibSp, Parch

1.0. Change column names

For just convenience, let’s change the column names.

Dependent variable: Survived -> is_survived
Independent variables:
- Pclass: Ticket class. (1 = 1st, 2 = 2nd, 3 = 3rd) -> p_class
- Name -> name
- Sex -> sex
- Age -> age
- SibSp: # of siblings / spouses aboard the Titanic -> num_sb_sp
- Parch: # of parents / children aboard the Titanic -> num_pr_ch
- Ticket: Ticket number -> ticket_number
- Fare: Passenger fare -> ticket_fare
- Cabin: Cabin number -> cabin_number
- Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) -> embark_port

titanic.rename(columns = {"Survived" : "is_survived", 
                          "Pclass" : "p_class",
                          "Name" : "name",
                          "Sex" : "sex", 
                          "Age" : "age",
                          "SibSp" : "num_sb_sp",
                          "Parch" : "num_pr_ch",
                          "Ticket" : "ticket_number",
                          "Fare" : "ticket_fare",
                          "Cabin" : "cabin_number",
                          "Embarked" : "embark_port"}, inplace = True)
titanic.head()

	is_survived	p_class	name	sex	age	num_sb_sp	ticket_number	ticket_fare	cabin_number	embark_port
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

1.1. Check missing values

np.sum(titanic.isnull(), axis = 0).sort_values(ascending = False)

cabin_number     687
age              177
embark_port        2
is_survived        0
p_class            0
name               0
sex                0
num_sb_sp          0
num_pr_ch          0
ticket_number      0
ticket_fare        0
dtype: int64

Almost every row doesn’t have a cabin_number value.
age is the next most missing column.
embark_port only has 2 missing values.

p_class

titanic["p_class"].value_counts()

  491
  216
  184
Name: p_class, dtype: int64

1st class passengers: 216
2nd class passengers: 184
3rd class passengers: 491

np.sum(titanic["p_class"].isnull())

There is no missing value in the p_class column

p_class ~ is_survived

Let’s check the relationship between p_class and is_survived.

p_class_is_survived = titanic.groupby(["p_class","is_survived"]).count().name.unstack().reset_index()
p_class_is_survived = p_class_is_survived.rename_axis(None, axis = 1)
p_class_is_survived["total"] = p_class_is_survived[0] + p_class_is_survived[1]
p_class_is_survived["ratio"] = np.round(p_class_is_survived[1] / p_class_is_survived.total, 2)
p_class_is_survived

	p_class	0	1	total	ratio
0	1	80	136	216	0.63
1	2	97	87	184	0.47
2	3	372	119	491	0.24

plt.figure(figsize = (14, 8))

# bar graph for total students
color = "darkblue"
ax1 = sns.barplot(x = "p_class", y = "total", color = color, alpha = 0.8, \
                  data = p_class_is_survived)
top_bar = mpatches.Patch(color = color, label = 'Num of total passengers')

# bar graph for students have research experience
color = "lightblue"
ax2 = sns.barplot(x = "p_class", y = 1,  color = color, alpha = 0.8, \
                  data = p_class_is_survived)
ax2.set_xlabel("Passenger class", fontsize = 16)
ax2.set_ylabel("Number of passengers", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of survived passengers')

# ratio
plt.text(s = "63%", x = -0.05, y = 110, fontsize = 16)
plt.text(s = "47%", x = 0.95, y = 60, fontsize = 16)
plt.text(s = "24%", x = 1.95, y = 92, fontsize = 16)

plt.legend(handles=[top_bar, low_bar])
plt.show()

png

As the class goes up one level, the chance of surviving increases by about 1.5 times.
-> passenger class is important in predicting survival

ticket_number

np.sum(titanic["ticket_number"].isnull())

There is no missing value in the ticket_number column

titanic.ticket_number

           A/5 21171
            PC 17599
    STON/O2. 3101282
              113803
              373450
             ...       
            211536
            112053
        W./C. 6607
            111369
            370376
Name: ticket_number, Length: 891, dtype: object

Ticket numbers have a form of alphabet + number or just number.
-> Let’s divide the ticket number into ticket_number_alphabet and ticket_number_number.

titanic.loc[titanic["ticket_number"].str.split(" ").str[1].isnull() == False, "ticket_number_alphabet"] = titanic[titanic["ticket_number"].str.split(" ").str[1].isnull() == False].ticket_number.str.split(" ").str[0]
titanic["ticket_number_alphabet"] = titanic.ticket_number_alphabet.fillna("non")

titanic.head()

	is_survived	p_class	name	sex	age	num_sb_sp	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	A/5
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	PC
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	STON/O2.
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	non
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	non

titanic.loc[titanic.ticket_number_alphabet == "non", "ticket_number_number"] = titanic[titanic.ticket_number_alphabet == "non"]["ticket_number"].str.split(" ").str[0]
titanic.loc[titanic.ticket_number_alphabet != "non", "ticket_number_number"] = titanic[titanic.ticket_number_alphabet != "non"]["ticket_number"].str.split(" ").str[1]
titanic.loc[titanic.ticket_number.str.split(" ").str[2].isnull() == False, "ticket_number_number"] = titanic[titanic.ticket_number.str.split(" ").str[2].isnull() == False].ticket_number.str.split(" ").str[2]

titanic[["ticket_number", "ticket_number_alphabet", "ticket_number_number"]]

	ticket_number	ticket_number_alphabet	ticket_number_number
0	A/5 21171	A/5	21171
1	PC 17599	PC	17599
2	STON/O2. 3101282	STON/O2.	3101282
3	113803	non	113803
4	373450	non	373450
...	...	...	...
886	211536	non	211536
887	112053	non	112053
888	W./C. 6607	W./C.	6607
889	111369	non	111369
890	370376	non	370376

891 rows × 3 columns

Aslo, if we sort the data by ticket number,

titanic.sort_values("ticket_number")

	is_survived	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number
504	1	1	Maioni, Miss. Roberta	female	16.0	0	0	110152	86.500	B79	S	non	110152
257	1	1	Cherry, Miss. Gladys	female	30.0	0	0	110152	86.500	B77	S	non	110152
759	1	1	Rothes, the Countess. of (Lucy Noel Martha Dye...	female	33.0	0	0	110152	86.500	B77	S	non	110152
262	0	1	Taussig, Mr. Emil	male	52.0	1	1	110413	79.650	E67	S	non	110413
558	1	1	Taussig, Mrs. Emil (Tillie Mandelbaum)	female	39.0	1	1	110413	79.650	E67	S	non	110413
...	...	...	...	...	...	...	...	...	...	...	...	...	...
235	0	3	Harknett, Miss. Alice Phoebe	female	NaN	0	0	W./C. 6609	7.550	NaN	S	W./C.	6609
92	0	1	Chaffee, Mr. Herbert Fuller	male	46.0	1	0	W.E.P. 5734	61.175	E31	S	W.E.P.	5734
219	0	2	Harris, Mr. Walter	male	30.0	0	0	W/C 14208	10.500	NaN	S	W/C	14208
540	1	1	Crosby, Miss. Harriet R	female	36.0	0	2	WE/P 5735	71.000	B22	S	WE/P	5735
745	0	1	Crosby, Capt. Edward Gifford	male	70.0	1	1	WE/P 5735	71.000	B22	S	WE/P	5735

891 rows × 13 columns

then it seems that rows that have the same ticket number have the same ticket fare. Let’s check this hypothesis for all rows.

merged_by_ticket_number = titanic.merge(titanic, on = "ticket_number", how = "left")
merged_by_ticket_number[merged_by_ticket_number.ticket_fare_x != merged_by_ticket_number.ticket_fare_y]

	is_survived_x	p_class_x	name_x	sex_x	age_x	num_sb_sp_x	num_pr_ch_x	ticket_number	ticket_fare_x	cabin_number_x	...	name_y	sex_y	age_y	num_sb_sp_y	num_pr_ch_y	ticket_fare_y	cabin_number_y	embark_port_y	ticket_number_alphabet_y	ticket_number_number_y
248	0	3	Osen, Mr. Olaf Elon	male	16.0	0	0	7534	9.2167	NaN	...	Gustafsson, Mr. Alfred Ossian	male	20.0	0	0	9.8458	NaN	S	non	7534
1570	0	3	Gustafsson, Mr. Alfred Ossian	male	20.0	0	0	7534	9.8458	NaN	...	Osen, Mr. Olaf Elon	male	16.0	0	0	9.2167	NaN	S	non	7534

2 rows × 25 columns

There are only 2 people who have the same ticket number but have different ticket fares. So, I think this case is just an outlier, and we can think that if the ticket number is the same, then the ticket fare is also the same.

If we check some cases that have the same ticket number,

merged_by_ticket_number[["name_x", "name_y", "ticket_fare_x", "ticket_fare_y", "ticket_number"]].sort_values(["ticket_number", "name_x"]).head(20)

	name_x	name_y	ticket_fare_x	ticket_fare_y	ticket_number
460	Cherry, Miss. Gladys	Cherry, Miss. Gladys	86.50	86.50	110152
461	Cherry, Miss. Gladys	Maioni, Miss. Roberta	86.50	86.50	110152
462	Cherry, Miss. Gladys	Rothes, the Countess. of (Lucy Noel Martha Dye...	86.50	86.50	110152
889	Maioni, Miss. Roberta	Cherry, Miss. Gladys	86.50	86.50	110152
890	Maioni, Miss. Roberta	Maioni, Miss. Roberta	86.50	86.50	110152
891	Maioni, Miss. Roberta	Rothes, the Countess. of (Lucy Noel Martha Dye...	86.50	86.50	110152
1345	Rothes, the Countess. of (Lucy Noel Martha Dye...	Cherry, Miss. Gladys	86.50	86.50	110152
1346	Rothes, the Countess. of (Lucy Noel Martha Dye...	Maioni, Miss. Roberta	86.50	86.50	110152
1347	Rothes, the Countess. of (Lucy Noel Martha Dye...	Rothes, the Countess. of (Lucy Noel Martha Dye...	86.50	86.50	110152
1026	Taussig, Miss. Ruth	Taussig, Mr. Emil	79.65	79.65	110413
1027	Taussig, Miss. Ruth	Taussig, Mrs. Emil (Tillie Mandelbaum)	79.65	79.65	110413
1028	Taussig, Miss. Ruth	Taussig, Miss. Ruth	79.65	79.65	110413
473	Taussig, Mr. Emil	Taussig, Mr. Emil	79.65	79.65	110413
474	Taussig, Mr. Emil	Taussig, Mrs. Emil (Tillie Mandelbaum)	79.65	79.65	110413
475	Taussig, Mr. Emil	Taussig, Miss. Ruth	79.65	79.65	110413
987	Taussig, Mrs. Emil (Tillie Mandelbaum)	Taussig, Mr. Emil	79.65	79.65	110413
988	Taussig, Mrs. Emil (Tillie Mandelbaum)	Taussig, Mrs. Emil (Tillie Mandelbaum)	79.65	79.65	110413
989	Taussig, Mrs. Emil (Tillie Mandelbaum)	Taussig, Miss. Ruth	79.65	79.65	110413
842	Clifford, Mr. George Quincy	Porter, Mr. Walter Chamberlain	52.00	52.00	110465
843	Clifford, Mr. George Quincy	Clifford, Mr. George Quincy	52.00	52.00	110465

then we can infer that peoples who have the same ticket number are companions who were traveling together. Since if companions are not family members, then this information is not in the num_sb_sp and num_pr_ch columns.
-> let’s make a new column that shows how many companions were there for each passenger based on ticket number.

titanic = titanic.merge(titanic["ticket_number"].value_counts().reset_index().rename(columns = {"index" : "ticket_number", "ticket_number" : "num_cmp_by_ticket"}), \
                        on = "ticket_number", how = "left")
titanic["num_cmp_by_ticket"] = titanic["num_cmp_by_ticket"] - 1 # only one passenger with no companion has to have value 0
titanic.head()

	is_survived	p_class	name	sex	age	num_sb_sp	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp_by_ticket
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	A/5	21171	0
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	PC	17599	0
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	STON/O2.	3101282	0
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	non	113803	1
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	non	373450	0

But this can be different from the result from the sum of num_sb_sp and num_pr_ch.
-> let’s cacluate num_cmp_by_sb_sp_pr_ch and compare this value to num_cmp_by_ticket.

titanic["num_cmp_by_sb_sp_pr_ch"] = titanic["num_sb_sp"] + titanic["num_pr_ch"]
titanic[titanic.num_cmp_by_ticket != titanic.num_cmp_by_sb_sp_pr_ch]

	is_survived	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp_by_ticket	num_cmp_by_sb_sp_pr_ch
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	A/5	21171	0	1
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	PC	17599	0	1
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S	non	349909	3	4
10	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S	PP	9549	1	2
16	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q	non	382652	4	5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
866	1	2	Duran y More, Miss. Asuncion	female	27.0	1	0	SC/PARIS 2149	13.8583	NaN	C	SC/PARIS	2149	0	1
871	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S	non	11751	1	2
876	0	3	Gustafsson, Mr. Alfred Ossian	male	20.0	0	0	7534	9.8458	NaN	S	non	7534	1	0
885	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q	non	382652	4	5
888	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S	W./C.	6607	1	3

288 rows × 15 columns

There are 288 cases where the number of companions based on ticket number is different from the number of companions based on num_sb_sp and num_pr_ch.
-> Let’s make num_cmp = max(num_cmp_by_ticket, num_cmp_by_sb_sp_pr_ch)

titanic["num_cmp"] = titanic[["num_cmp_by_ticket", "num_cmp_by_sb_sp_pr_ch"]].max(axis = 1)
titanic.drop(["num_cmp_by_ticket", "num_cmp_by_sb_sp_pr_ch"], axis = 1, inplace = True)

titanic.head()

	is_survived	p_class	name	sex	age	num_sb_sp	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	A/5	21171	1
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	PC	17599	1
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	STON/O2.	3101282	0
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	non	113803	1
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	non	373450	0

titanic.shape

(891, 14)

ticket_number_alphabet

len(titanic.ticket_number_alphabet.unique())

titanic.ticket_number_alphabet.value_counts()

non           665
PC             60
C.A.           27
STON/O         12
A/5            10
W./C.           9
CA.             8
SOTON/O.Q.      8
SOTON/OQ        7
A/5.            7
CA              6
STON/O2.        6
C               5
F.C.C.          5
S.O.C.          5
SC/PARIS        5
SC/Paris        4
S.O./P.P.       3
PP              3
A/4.            3
A/4             3
SC/AH           3
A./5.           2
SOTON/O2        2
A.5.            2
WE/P            2
S.C./PARIS      2
P/PP            2
F.C.            1
SC              1
S.W./PP         1
A/S             1
Fa              1
SCO/W           1
SW/PP           1
W/C             1
S.C./A.4.       1
S.O.P.          1
A4.             1
W.E.P.          1
SO/C            1
S.P.            1
C.A./SOTON      1
Name: ticket_number_alphabet, dtype: int64

There are 42 distinct kinds of prefixes in the ticket number, and almost ticket numbers don’t have a prefix alphabet. But it is hard to interpret the ticket number alphabet or to find some relationship with other columns.
-> Do not use ticket_number_alphabet

**ticket_number_number ~ p_class

is_survived**

Let’s check if there is a relationship between ticket_number_number and p_class

plt.figure(figsize = (14, 8))
sns.boxplot(data = pd.concat([titanic.p_class, titanic[titanic.ticket_number_number != "LINE"].ticket_number_number.astype("int32")], axis = 1), x = "p_class", y = "ticket_number_number")
plt.xlabel("Passenger class", fontsize = 14)
plt.ylabel("Ticket number part", fontsize = 14)
plt.show()

png

It is hard to find relationship between ticket_number_number and p_class.

Let’s check if there is a relationship between ticket_number_number and is_survived

plt.figure(figsize = (14, 8))
sns.boxplot(data = pd.concat([titanic.is_survived, titanic[titanic.ticket_number_number != "LINE"].ticket_number_number.astype("int32")], axis = 1), x = "is_survived", y = "ticket_number_number")
plt.xlabel("Is survived", fontsize = 14)
plt.ylabel("Ticket number part", fontsize = 14)
plt.show()

png

It is hard to find relationship between ticket_number_number and p_class.
-> Do not use the ticket_number_number

ticket_fare

np.sum(titanic["ticket_fare"].isnull())

There is no missing value in the ticket_fare column

titanic["ticket_fare"].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: ticket_fare, dtype: float64

plt.figure(figsize = (14, 8))

sns.histplot(titanic.ticket_fare)
plt.xlabel("Ticket fare", fontsize = 14)
plt.ylabel("Count", fontsize = 14)
plt.show()

png

Minimum value is 0 and maximum value is 512.33.
75% of ticket fares are under 31.
These numerical values and the histogram show that the standard deviation is very large.
-> Let’s check the histogram of ticket fares under the 95% percentile.

ticket_fares_95 = np.percentile(titanic["ticket_fare"], 95)
ticket_fares_95

112.07915

fig, (ax_box, ax_hist) = plt.subplots(2, sharex = True, gridspec_kw = {"height_ratios": (.2, .8)}, figsize = (10, 7))

sns.boxplot(x = titanic.ticket_fare, ax = ax_box, showfliers = False)
sns.histplot(x = titanic[titanic["ticket_fare"] <= ticket_fares_95].ticket_fare, ax = ax_hist)

plt.xlabel("Ticket Fare", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
ax_box.set_xlabel("")

plt.show()

png

50% of ticket fares are under 14 and almost ticket fares are under 30.

Now, let’s check outliers

titanic[titanic["ticket_fare"] == 0]

	is_survived	p_class	name	sex	age	ticket_number	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp
179	0	3	Leonard, Mr. Lionel	male	36.0	LINE	NaN	S	non	LINE	3
263	0	1	Harrison, Mr. William	male	40.0	112059	B94	S	non	112059	0
271	1	3	Tornquist, Mr. William Henry	male	25.0	LINE	NaN	S	non	LINE	3
277	0	2	Parkes, Mr. Francis "Frank"	male	NaN	239853	NaN	S	non	239853	2
302	0	3	Johnson, Mr. William Cahoone Jr	male	19.0	LINE	NaN	S	non	LINE	3
413	0	2	Cunningham, Mr. Alfred Fleming	male	NaN	239853	NaN	S	non	239853	2
466	0	2	Campbell, Mr. William	male	NaN	239853	NaN	S	non	239853	2
481	0	2	Frost, Mr. Anthony Wood "Archie"	male	NaN	239854	NaN	S	non	239854	0
597	0	3	Johnson, Mr. Alfred	male	49.0	LINE	NaN	S	non	LINE	3
633	0	1	Parr, Mr. William Henry Marsh	male	NaN	112052	NaN	S	non	112052	0
674	0	2	Watson, Mr. Ennis Hastings	male	NaN	239856	NaN	S	non	239856	0
732	0	2	Knight, Mr. Robert J	male	NaN	239855	NaN	S	non	239855	0
806	0	1	Andrews, Mr. Thomas Jr	male	39.0	112050	A36	S	non	112050	0
815	0	1	Fry, Mr. Richard	male	NaN	112058	B102	S	non	112058	0
822	0	1	Reuchlin, Jonkheer. John George	male	38.0	19972	NaN	S	non	19972	0

There are only 15 rows whose ticket fare is 0.0. So 0 may mean a missing value.
-> In the above ticket_number EDA, we have found that the same ticket numbers have the same ticket fares. So let’s check if other rows have non-zero fare with the same ticket numbers as the above table.

for tn in titanic[titanic["ticket_fare"] == 0].ticket_number.unique():
    print(tn, " :", titanic[(titanic["ticket_fare"] != 0) & (titanic.ticket_number == tn)].shape)

LINE  : (0, 14)
112059  : (0, 14)
239853  : (0, 14)
239854  : (0, 14)
112052  : (0, 14)
239856  : (0, 14)
239855  : (0, 14)
112050  : (0, 14)
112058  : (0, 14)
19972  : (0, 14)

All row whose fare is 0 doesn’t have other rows that have the same ticket number as the non-zero fare. So it isn’t possible to impute the ticket fare values with ticket number information.
-> Since there are no missing values in p_class, let’s use p_class to impute the missing value in the ticket_fare column.

ticket_fare ~ p_class

Let’s check relationship between ticket_fare and p_class

titanic.groupby("p_class").ticket_fare.median().reset_index().rename({"ticket_fare" : "ticket_fare_median"}, axis = 1) \
    .merge(titanic.groupby("p_class").ticket_fare.mean().reset_index().rename({"ticket_fare" : "ticket_fare_mean"}, axis = 1), on = "p_class", how = "left")

	p_class	ticket_fare_median	ticket_fare_mean
0	1	60.2875	84.154687
1	2	14.2500	20.662183
2	3	8.0500	13.675550

plt.figure(figsize = (14, 8))

sns.boxplot(data = titanic, x = "p_class", y = "ticket_fare")
plt.xlabel("Passenger class", fontsize = 14)
plt.ylabel("Ticket fare", fontsize = 14)
plt.show()

png

We can see that there are meaningful differences in mean and median values of ticket fares between passenger classes.
-> Since there were no missing values in passenger class, I think it is a good way to impute missing value in ticket fare with the mean or median value of each passenger class. In the above box plot, we can see that there are some extreme values in passenger class = 1. Since mean is affected a lot with extreme values, let’s use median instead of mean to impute ticket fare.

p_class_fare_median = titanic.groupby("p_class").ticket_fare.median().reset_index().rename({"ticket_fare" : "ticket_fare_median"}, axis = 1)
p_class_fare_median

	p_class	ticket_fare_median
0	1	60.2875
1	2	14.2500
2	3	8.0500

titanic.loc[(titanic.ticket_fare == 0) & (titanic.p_class == 1), "ticket_fare"] = p_class_fare_median[p_class_fare_median.p_class == 1].ticket_fare_median.values[0]
titanic.loc[(titanic.ticket_fare == 0) & (titanic.p_class == 2), "ticket_fare"] = p_class_fare_median[p_class_fare_median.p_class == 2].ticket_fare_median.values[0]
titanic.loc[(titanic.ticket_fare == 0) & (titanic.p_class == 3), "ticket_fare"] = p_class_fare_median[p_class_fare_median.p_class == 3].ticket_fare_median.values[0]

titanic.shape

(891, 14)

titanic[titanic.ticket_fare == 0]

	is_survived	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp

titanic[titanic.ticket_fare.isnull()]

	is_survived	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp

ticket_fare ~ is_survived

plt.figure(figsize = (14, 8))

sns.boxplot(data = titanic, x = "is_survived", y = "ticket_fare", showfliers = False)
plt.xlabel("Is survived", fontsize = 14)
plt.ylabel("Ticket fare", fontsize = 14)
plt.show()

png

It seems that passengers who paid the higher fare may have been more likely to have survived.
-> Let’s cut the ticket_fare into 3 categories and compare the survival rate of each category.

titanic['ticket_fare_category'] = pd.qcut(titanic.ticket_fare, 3)
titanic.ticket_fare_category.value_counts()

(8.676, 26.25]                300
(4.010999999999999, 8.676]    297
(26.25, 512.329]              294
Name: ticket_fare_category, dtype: int64

ticket_fare_category_is_survived = pd.pivot_table(index = "ticket_fare_category", columns = "is_survived", aggfunc = len, fill_value = 0, data = titanic[["ticket_fare_category", "is_survived"]])
ticket_fare_category_is_survived = ticket_fare_category_is_survived.reset_index().rename_axis(None, axis = 1)
ticket_fare_category_is_survived["total"] = ticket_fare_category_is_survived[0] + ticket_fare_category_is_survived[1]
ticket_fare_category_is_survived["ratio"] = np.round(ticket_fare_category_is_survived[1] / ticket_fare_category_is_survived.total, 2)
ticket_fare_category_is_survived

	ticket_fare_category	0	1	total	ratio
0	(4.010999999999999, 8.676]	236	61	297	0.21
1	(8.676, 26.25]	180	120	300	0.40
2	(26.25, 512.329]	133	161	294	0.55

plt.figure(figsize = (14, 8))

# bar graph for total students
color = "darkblue"
ax1 = sns.barplot(x = "ticket_fare_category", y = "total", color = color, alpha = 0.8, \
                  data = ticket_fare_category_is_survived)
top_bar = mpatches.Patch(color = color, label = 'Num of total passengers')

# bar graph for students have research experience
color = "lightblue"
ax2 = sns.barplot(x = "ticket_fare_category", y = 1,  color = color, alpha = 0.8, \
                  data = ticket_fare_category_is_survived)
ax2.set_xlabel("Ticket fare category", fontsize = 16)
ax2.set_ylabel("Number of passengers", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of survived passengers')

# ratio
plt.text(s = "20%", x = -0.05, y = 35, fontsize = 16)
plt.text(s = "41%", x = 0.95, y = 95, fontsize = 16)
plt.text(s = "56%", x = 1.95, y = 136, fontsize = 16)

plt.legend(handles=[top_bar, low_bar])
plt.show()

png

It can be seen that the category that paid a higher fare showed a higher survival rate.
-> Use the ticket_fare_category column.

cabin_number

titanic["cabin_number"]

     NaN
     C85
     NaN
    C123
     NaN
       ... 
   NaN
   B42
   NaN
  C148
   NaN
Name: cabin_number, Length: 891, dtype: object

cabin_number is form of alphabet + number
-> Let’s divide the alphabet part and make this alphabet as a new column

titanic["cabin_alphabet"] = titanic.cabin_number.str[0]
titanic["cabin_alphabet"] = titanic["cabin_alphabet"].fillna("n")

titanic.cabin_alphabet.value_counts()

n    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: cabin_alphabet, dtype: int64

n means missing values. There are too many missing values in the cabin_alphabet column
-> If it is not related with is_survived or p_class, then do not use cabin_number and cabin_alphabet

**cabin_alphabet ~ is_survived

p_class**

Let’s check relationship between cabin_alphabet and p_class

pt = pd.pivot_table(index = "p_class", columns = "cabin_alphabet", aggfunc = len, fill_value = 0, data = titanic[["p_class", "cabin_alphabet"]])
pt

cabin_alphabet	A	B	C	D	E	F	G	T	n
p_class
1	15	47	59	29	25	0	0	1	40
2	0	0	0	4	4	8	0	0	168
3	0	0	0	0	3	5	4	0	479

There are too many missing values in passenger class = 2 and passenger class = 3

Let’s check relationship between cabin_alphabet and is_survived.

pt = pd.pivot_table(index = "is_survived", columns = "cabin_alphabet", aggfunc = len, fill_value = 0, data = titanic[["is_survived", "cabin_alphabet"]])
pt

cabin_alphabet	A	B	C	D	E	F	G	T	n
is_survived
0	8	12	24	8	8	5	2	1	481
1	7	35	35	25	24	8	2	0	206

Also, there are too many missing values in both is_survived cases
-> Do not use cabin_number and cabin_alphabet

embark_port

titanic.embark_port.value_counts().reset_index().rename(columns = {"index" : "embark_port", "embark_port" : "count"})

	embark_port	count
0	S	644
1	C	168
2	Q	77

plt.figure(figsize = (14, 8))
sns.barplot(data = titanic.embark_port.value_counts().reset_index().rename(columns = {"index" : "embark_port", "embark_port" : "count"}),
            x = "embark_port",
            y = "count")
plt.xlabel("Embark port", fontsize = 14)
plt.ylabel("Count", fontsize = 14)
plt.show()

png

Most passengers boarded from Southampton (S) port
Least passengers boarded from Queenstown (Q) port

titanic[titanic.embark_port.isnull()]

	is_survived	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp	ticket_fare_category	cabin_alphabet
61	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN	non	113572	1	(26.25, 512.329]	B
829	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN	non	113572	1	(26.25, 512.329]	B

There are only 2 rows where embark_port has a missing value.
-> Fill missing value with the most common embarked_pork value: S

titanic["embark_port"] = titanic["embark_port"].fillna("S")

np.sum(titanic.embark_port.isnull())

embark_port ~ p_class

Let’s check a relationship between embark_port and p_class

pd.pivot_table(index = "p_class", columns = "embark_port", aggfunc = len, fill_value = 0, data = titanic[titanic.ticket_fare != 0][["p_class", "embark_port"]])

embark_port	C	Q	S
p_class
1	85	2	129
2	17	3	164
3	66	72	353

Almost every people who boarded from Q was 3 passenger class.
-> Since embark_port has some influence in passenger class, let’s check the relationship between embark_port and is_survived

embark_port ~ is_survived

embark_port_is_survived = titanic.groupby(["embark_port","is_survived"]).count().name.unstack().reset_index()
embark_port_is_survived = embark_port_is_survived.rename_axis(None, axis = 1)
embark_port_is_survived["total"] = embark_port_is_survived[0] + embark_port_is_survived[1]
embark_port_is_survived["ratio"] = np.round(embark_port_is_survived[1] / embark_port_is_survived.total, 2)
embark_port_is_survived

	embark_port	0	1	total	ratio
0	C	75	93	168	0.55
1	Q	47	30	77	0.39
2	S	427	219	646	0.34

plt.figure(figsize = (14, 8))

# bar graph for total students
color = "darkblue"
ax1 = sns.barplot(x = "embark_port", y = "total", color = color, alpha = 0.8, \
                  data = embark_port_is_survived)
top_bar = mpatches.Patch(color = color, label = 'Num of total passengers')

# bar graph for students have research experience
color = "lightblue"
ax2 = sns.barplot(x = "embark_port", y = 1,  color = color, alpha = 0.8, \
                  data = embark_port_is_survived)
ax2.set_xlabel("Port of Embarkation", fontsize = 16)
ax2.set_ylabel("Number of passengers", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of survived passengers')

# ratio
plt.text(s = "55%", x = -0.05, y = 68, fontsize = 16)
plt.text(s = "39%", x = 0.95, y = 5, fontsize = 16)
plt.text(s = "34%", x = 1.95, y = 194, fontsize = 16)

plt.legend(handles=[top_bar, low_bar])
plt.show()

png

People boarded from Q and S have similar survival rates: about 39% and 34%.
People boarded from C have a higher survival rate than people boarded from other ports.
-> Use embark_port column

name

np.sum(titanic.name.isnull())

There is no missing value in the name column.

len(titanic.name.unique())

All 891 rows have differnt name values.
-> Let’s think about more general features from name.

titanic.name.head(10)

                            Braund, Mr. Owen Harris
  Cumings, Mrs. John Bradley (Florence Briggs Th...
                             Heikkinen, Miss. Laina
       Futrelle, Mrs. Jacques Heath (Lily May Peel)
                           Allen, Mr. William Henry
                                   Moran, Mr. James
                            McCarthy, Mr. Timothy J
                     Palsson, Master. Gosta Leonard
  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
                Nasser, Mrs. Nicholas (Adele Achem)
Name: name, dtype: object

We can see that names are a form of Last name + title + first name.
-> Let’s extract the title from the name

titanic["name_title"] = titanic.name.str.extract(' ([A-Za-z]+)\.', expand=False)
titanic.name_title.value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: name_title, dtype: int64

There are too many uncommon titles that appear only a few times.
-> Let’s replace uncommon titles.

titanic['name_title'] = titanic['name_title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', \
                                                       'Rev', 'Sir', 'Jonkheer', 'Dona'], 'uncommon')

titanic['name_title'] = titanic['name_title'].replace('Mlle', 'Miss')
titanic['name_title'] = titanic['name_title'].replace('Ms', 'Miss')
titanic['name_title'] = titanic['name_title'].replace('Mme', 'Mrs')

titanic.name_title.value_counts()

Mr          517
Miss        185
Mrs         126
Master       40
uncommon     23
Name: name_title, dtype: int64

name_title ~ is_survived

Let’s check the relationship between name_title and is_survived.

name_title_is_survived = titanic.groupby(["name_title","is_survived"]).count().name.unstack().reset_index()
name_title_is_survived = name_title_is_survived.rename_axis(None, axis = 1)
name_title_is_survived["total"] = name_title_is_survived[0] + name_title_is_survived[1]
name_title_is_survived["ratio"] = np.round(name_title_is_survived[1] / name_title_is_survived.total, 2)
name_title_is_survived

	name_title	0	1	total	ratio
0	Master	17	23	40	0.57
1	Miss	55	130	185	0.70
2	Mr	436	81	517	0.16
3	Mrs	26	100	126	0.79
4	uncommon	15	8	23	0.35

plt.figure(figsize = (14, 8))

# bar graph for total students
color = "darkblue"
ax1 = sns.barplot(x = "name_title", y = "total", color = color, alpha = 0.8, \
                  data = name_title_is_survived)
top_bar = mpatches.Patch(color = color, label = 'Num of total passengers')

# bar graph for students have research experience
color = "lightblue"
ax2 = sns.barplot(x = "name_title", y = 1,  color = color, alpha = 0.8, \
                  data = name_title_is_survived)
ax2.set_xlabel("Name title", fontsize = 16)
ax2.set_ylabel("Number of passengers", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of survived passengers')

# ratio
for n in name_title_is_survived.index:
    plt.text(s = f"{np.round(name_title_is_survived.loc[n].ratio * 100, 2)} %", x = n - 0.1, y = name_title_is_survived.loc[n][1] + 5, color = "white")
 
    
plt.legend(handles=[top_bar, low_bar])
plt.show()

png

Passengers with Miss and Mrs titles showed an overwhelming survival rate of over 70%
-> Use name_title instead of name

sex

np.sum(titanic.sex.isnull())

There is no missing value in the sex column.

titanic.sex.value_counts()

male      577
female    314
Name: sex, dtype: int64

plt.figure(figsize = (14, 8))
sns.barplot(data = titanic.sex.value_counts().reset_index().rename(columns = {"index" : "sex", "sex" : "count"}),
            x = "sex",
            y = "count")
plt.xlabel("Sex", fontsize = 14)
plt.ylabel("Count", fontsize = 14)
plt.show()

png

There are about twice as many male passengers as female passengers.

sex ~ p_class

Let’s check the relationship between sex and p_class.

sex_p_class = pd.pivot_table(index = "p_class", columns = "sex", aggfunc = len, fill_value = 0, data = titanic[["p_class", "sex"]])
sex_p_class = sex_p_class.reset_index().rename_axis(None, axis = 1)
sex_p_class["total"] = sex_p_class["female"] + sex_p_class["male"]
sex_p_class["ratio"] = np.round(sex_p_class["male"] / sex_p_class.total, 2)
sex_p_class

	p_class	female	male	total	ratio
0	1	94	122	216	0.56
1	2	76	108	184	0.59
2	3	144	347	491	0.71

plt.figure(figsize = (14, 8))

# bar graph for total students
color = "darkblue"
ax1 = sns.barplot(x = "p_class", y = "total", color = color, alpha = 0.8, \
                  data = sex_p_class)
top_bar = mpatches.Patch(color = color, label = 'Num of total passengers')

# bar graph for students have research experience
color = "lightblue"
ax2 = sns.barplot(x = "p_class", y = "male",  color = color, alpha = 0.8, \
                  data = sex_p_class)
ax2.set_xlabel("Passenger class", fontsize = 16)
ax2.set_ylabel("Number of passengers", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of male passengers')

# ratio
plt.text(s = "56%", x = -0.05, y = 97, fontsize = 16)
plt.text(s = "59%", x = 0.95, y = 83, fontsize = 16)
plt.text(s = "71%", x = 1.95, y = 322, fontsize = 16)

plt.legend(handles=[top_bar, low_bar])
plt.show()

png

Passengers from classes 1 and 2 have a similar male rate: about 56% and 59%
Passengers from class 3 have a higher male rate than passengers from other classes.

sex ~ is_survived

sex_is_survived = pd.pivot_table(index = "sex", columns = "is_survived", aggfunc = len, fill_value = 0, data = titanic[["is_survived", "sex"]])
sex_is_survived = sex_is_survived.reset_index().rename_axis(None, axis = 1)
sex_is_survived["total"] = sex_is_survived[0] + sex_is_survived[1]
sex_is_survived["ratio"] = np.round(sex_is_survived[1] / sex_is_survived.total, 2)
sex_is_survived

	sex	0	1	total	ratio
0	female	81	233	314	0.74
1	male	468	109	577	0.19

plt.figure(figsize = (14, 8))

# bar graph for total students
color = "darkblue"
ax1 = sns.barplot(x = "sex", y = "total", color = color, alpha = 0.8, \
                  data = sex_is_survived)
top_bar = mpatches.Patch(color = color, label = 'Num of total passengers')

# bar graph for students have research experience
color = "lightblue"
ax2 = sns.barplot(x = "sex", y = 1,  color = color, alpha = 0.8, \
                  data = sex_is_survived)
ax2.set_xlabel("Sex", fontsize = 16)
ax2.set_ylabel("Number of passengers", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of survived passengers')

# ratio
plt.text(s = "74%", x = -0.05, y = 208, fontsize = 16)
plt.text(s = "19%", x = 0.95, y = 84, fontsize = 16)

plt.legend(handles=[top_bar, low_bar])
plt.show()

png

We can see that sex had a huge impact on survival. Only 19% of the male passengers survived, while 74% of the female passengers survived.
-> Use sex column

age

np.sum(titanic.age.isnull())

There are 177 missing values in the age column
-> Need to think about how to impute missing values.

titanic.age.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

fig, (ax_box, ax_hist) = plt.subplots(2, sharex = True, gridspec_kw = {"height_ratios": (.2, .8)}, figsize = (10, 7))

sns.boxplot(x = titanic.age, ax = ax_box, showfliers = False)
sns.histplot(x = titanic.age, ax = ax_hist)

plt.xlabel("Age", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
ax_box.set_xlabel("")

plt.show()

png

age shows a slightly bell-shaped distribution

For further analysis, let’s make new column of age category.

import math

math.log(0.04 * 0.05 * 0.06) * -2

18.056037630364457

math.log(0.6 * 0.25 * 0.01) * -2

13.004580341747944

titanic["age_category"] = pd.cut(titanic.age, 10)
titanic.age_category.value_counts()

(16.336, 24.294]    177
(24.294, 32.252]    169
(32.252, 40.21]     118
(40.21, 48.168]      70
(0.34, 8.378]        54
(8.378, 16.336]      46
(48.168, 56.126]     45
(56.126, 64.084]     24
(64.084, 72.042]      9
(72.042, 80.0]        2
Name: age_category, dtype: int64

**p_class ~ age

age_category**

Let’s check the relationship between p_class and age.

plt.figure(figsize = (14, 8))

sns.boxplot(data = titanic, x = "p_class", y = "age")
plt.xlabel("Passenger class", fontsize = 14)
plt.ylabel("Age", fontsize = 14)
plt.show()

png

The better the class, the older the passengers in it tend to be.

Let’s see the relationship of passenger class and age by age category.

pt = pd.pivot_table(index = "p_class", columns = "age_category", aggfunc = len, fill_value = 0, data = titanic[["p_class", "age_category"]])

plt.figure(figsize = (14, 8))
sns.heatmap(pt, annot = True, cmap = 'BrBG');
plt.xlabel("Age category", fontsize = 14)
plt.ylabel("Passenger class", fontsize = 14)
plt.show()

png

In particular, in class 3, it can be seen that the proportion of young people is high.
In classes 1 and 2, the proportion of passengers in the middle age group is high.
-> There seems to be some relationship between age and class. So, let’s look at the relationship between age and survival rate.

**is_survived ~ age

age_category**

Let’s check the relationship between is_survived and age.

plt.figure(figsize = (14, 8))

sns.boxplot(data = titanic, x = "is_survived", y = "age")
plt.xlabel("Is survived", fontsize = 14)
plt.ylabel("Age", fontsize = 14)
plt.show()

png

There appears to be no difference in the distribution of age between the survived and non-survived groups.

age_category_is_survived = pd.pivot_table(index = "age_category", columns = "is_survived", aggfunc = len, fill_value = 0, data = titanic[["is_survived", "age_category"]])
age_category_is_survived = age_category_is_survived.reset_index().rename_axis(None, axis = 1)
age_category_is_survived["total"] = age_category_is_survived[0] + age_category_is_survived[1]
age_category_is_survived["ratio"] = np.round(age_category_is_survived[1] / age_category_is_survived.total, 2)
age_category_is_survived

	age_category	0	1	total	ratio
0	(0.34, 8.378]	18	36	54	0.67
1	(8.378, 16.336]	27	19	46	0.41
2	(16.336, 24.294]	114	63	177	0.36
3	(24.294, 32.252]	104	65	169	0.38
4	(32.252, 40.21]	66	52	118	0.44
5	(40.21, 48.168]	46	24	70	0.34
6	(48.168, 56.126]	24	21	45	0.47
7	(56.126, 64.084]	15	9	24	0.38
8	(64.084, 72.042]	9	0	9	0.00
9	(72.042, 80.0]	1	1	2	0.50

plt.figure(figsize = (14, 8))

# bar graph for total students
color = "darkblue"
ax1 = sns.barplot(x = "age_category", y = "total", color = color, alpha = 0.8, \
                  data = age_category_is_survived)
top_bar = mpatches.Patch(color = color, label = 'Num of total passengers')

# bar graph for students have research experience
color = "lightblue"
ax2 = sns.barplot(x = "age_category", y = 1,  color = color, alpha = 0.8, \
                  data = age_category_is_survived)
ax2.set_xlabel("Age category", fontsize = 16)
ax2.set_ylabel("Number of passengers", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of survived passengers')

# ratio
for n in age_category_is_survived.index:
    plt.text(s = f"{age_category_is_survived.loc[n].ratio * 100} %", x = n - 0.2, y = age_category_is_survived.loc[n][1] + 5, color = "lightblue")
 
    
plt.legend(handles=[top_bar, low_bar])
plt.show()

png

It can be seen that the survival rate decreases with increasing age and then rises again.
-> Use age category and age. Then since age has 177 missing values, we have to consider how to impute missing values in age.

age ~ sex

Let’s check the relationship between age and sex.

plt.figure(figsize = (14, 8))

sns.boxplot(data = titanic, x = "sex", y = "age")
plt.xlabel("Sex", fontsize = 14)
plt.ylabel("Age", fontsize = 14)
plt.show()

png

There appears to be no difference in the distribution of age between the male and female groups.

Let’s check the distribution in more detail using age category.

age_sex = titanic.groupby(["age_category"]).sex.value_counts().unstack().fillna(0)
age_sex = age_sex.reset_index().rename_axis(None, axis = 1)
age_sex

	age_category	female	male
0	(0.34, 8.378]	26.0	28.0
1	(8.378, 16.336]	23.0	23.0
2	(16.336, 24.294]	68.0	109.0
3	(24.294, 32.252]	52.0	117.0
4	(32.252, 40.21]	44.0	74.0
5	(40.21, 48.168]	24.0	46.0
6	(48.168, 56.126]	16.0	29.0
7	(56.126, 64.084]	8.0	16.0
8	(64.084, 72.042]	0.0	9.0
9	(72.042, 80.0]	0.0	2.0

#define plot parameters
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(20, 8))

#specify background color and plot title
#fig.patch.set_facecolor('xkcd:light grey')
plt.figtext(.5,.9,"Population Pyramid ", fontsize=16, ha='center')
    
#define male and female bars
axes[0].barh(range(0, len(age_sex)), age_sex.male, align='center', color='darkblue')
axes[0].set(title='Males')
axes[1].barh(range(0, len(age_sex)), age_sex.female, align='center', color='darkred')
axes[1].set(title='Females')

#adjust grid parameters and specify labels for y-axis
axes[1].grid()
axes[0].set(yticks = range(0, len(age_sex)), yticklabels = age_sex['age_category'])
axes[0].invert_xaxis()
axes[0].grid()

#display plot
plt.show()

png

It can be seen that there are typical demographic distributions, with very few passengers in the older and young group and a large number of passengers in the middle-aged group.
-> Let’s check survival rate by sex and age_category.

sex ~ age ~ is_survived

male_age_sex = pd.DataFrame(age_sex.age_category.unique(), columns = ["age_category"])

male_age_sex = male_age_sex.merge(pd.pivot_table(index = "age_category", columns = "is_survived", aggfunc = len, fill_value = 0, 
                                                 data = titanic[titanic.sex == "male"][["is_survived", "age_category"]]) \
                                  .reset_index().rename_axis(None, axis = 1),
                                  on = "age_category", how = "left")
male_age_sex["total"] = male_age_sex[0] + male_age_sex[1]
male_age_sex["ratio"] = np.round(male_age_sex[1] / male_age_sex.total, 2)
male_age_sex[[0, 1, "total", "ratio"]] = male_age_sex[[0, 1, "total", "ratio"]].fillna(0)
male_age_sex

	age_category	0	1	total	ratio
0	(0.34, 8.378]	11	17	28	0.61
1	(8.378, 16.336]	18	5	23	0.22
2	(16.336, 24.294]	98	11	109	0.10
3	(24.294, 32.252]	88	29	117	0.25
4	(32.252, 40.21]	61	13	74	0.18
5	(40.21, 48.168]	37	9	46	0.20
6	(48.168, 56.126]	23	6	29	0.21
7	(56.126, 64.084]	14	2	16	0.12
8	(64.084, 72.042]	9	0	9	0.00
9	(72.042, 80.0]	1	1	2	0.50

female_age_sex = pd.DataFrame(age_sex.age_category.unique(), columns = ["age_category"])

female_age_sex = female_age_sex.merge(pd.pivot_table(index = "age_category", columns = "is_survived", aggfunc = len, fill_value = 0, 
                                                     data = titanic[titanic.sex == "female"][["is_survived", "age_category"]]) \
                                      .reset_index().rename_axis(None, axis = 1),
                                      on = "age_category", how = "left")
female_age_sex["total"] = female_age_sex[0] + female_age_sex[1]
female_age_sex["ratio"] = np.round(female_age_sex[1] / female_age_sex.total, 2)
female_age_sex[[0, 1, "total", "ratio"]] = female_age_sex[[0, 1, "total", "ratio"]].fillna(0)
female_age_sex

	age_category	0	1	total	ratio
0	(0.34, 8.378]	7.0	19.0	26.0	0.73
1	(8.378, 16.336]	9.0	14.0	23.0	0.61
2	(16.336, 24.294]	16.0	52.0	68.0	0.76
3	(24.294, 32.252]	16.0	36.0	52.0	0.69
4	(32.252, 40.21]	5.0	39.0	44.0	0.89
5	(40.21, 48.168]	9.0	15.0	24.0	0.62
6	(48.168, 56.126]	1.0	15.0	16.0	0.94
7	(56.126, 64.084]	1.0	7.0	8.0	0.88
8	(64.084, 72.042]	0.0	0.0	0.0	0.00
9	(72.042, 80.0]	0.0	0.0	0.0	0.00

#define plot parameters
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(20, 8))

#specify background color and plot title
#fig.patch.set_facecolor('xkcd:light grey')
plt.figtext(.5,.9,"Population Pyramid ", fontsize=16, ha='center')
    
#define male and female bars
axes[0].barh(range(0, len(age_sex)), male_age_sex.total, align='center', color='darkblue')
axes[0].barh(range(0, len(age_sex)), male_age_sex[1], align='center', color='lightblue')
axes[0].set(title='Males')
top_bar = mpatches.Patch(color = "darkblue", label = 'Num of total passengers')
low_bar = mpatches.Patch(color = "lightblue", label = 'Num of survived passengers')
axes[0].legend(handles = [top_bar, low_bar])


axes[1].barh(range(0, len(age_sex)), female_age_sex.total, align='center', color='darkred')
axes[1].barh(range(0, len(age_sex)), female_age_sex[1], align='center', color='pink')
axes[1].set(title='Females')
top_bar = mpatches.Patch(color = "darkred", label = 'Num of total passengers')
low_bar = mpatches.Patch(color = "pink", label = 'Num of survived passengers')
axes[1].legend(handles = [top_bar, low_bar])

#adjust grid parameters and specify labels for y-axis
axes[1].grid()
axes[0].set(yticks = range(0, len(age_sex)), yticklabels = age_sex['age_category'])
axes[0].invert_xaxis()
axes[0].grid()

#display plot
plt.show()

png

In all age groups, it can be seen that the survival rate of women is overwhelmingly higher than that of men.
Children under the age of 8 have a particularly high survival rate for both male and female.

name_title ~ age

Let’s check the relationship between name_title and age.

plt.figure(figsize = (14, 8))

sns.boxplot(data = titanic, x = "name_title", y = "age")
plt.xlabel("Name title", fontsize = 14)
plt.ylabel("Age", fontsize = 14)
plt.show()

png

It can be seen that there is a meaningful difference in the age distribution of name titles.

titanic[titanic.name_title == "Master"][["p_class", "name_title", "sex", "age", "age_category", "num_sb_sp", "num_pr_ch", "num_cmp"]].sort_values("age")

	p_class	name_title	sex	age	age_category	num_sb_sp	num_pr_ch	num_cmp
803	3	Master	male	0.42	(0.34, 8.378]	0	1	1
755	2	Master	male	0.67	(0.34, 8.378]	1	1	2
831	2	Master	male	0.83	(0.34, 8.378]	1	1	2
78	2	Master	male	0.83	(0.34, 8.378]	0	2	2
305	1	Master	male	0.92	(0.34, 8.378]	1	2	3
827	2	Master	male	1.00	(0.34, 8.378]	0	2	2
164	3	Master	male	1.00	(0.34, 8.378]	4	1	5
788	3	Master	male	1.00	(0.34, 8.378]	1	2	3
183	2	Master	male	1.00	(0.34, 8.378]	2	1	3
386	3	Master	male	1.00	(0.34, 8.378]	5	2	7
7	3	Master	male	2.00	(0.34, 8.378]	3	1	4
16	3	Master	male	2.00	(0.34, 8.378]	4	1	5
824	3	Master	male	2.00	(0.34, 8.378]	4	1	5
340	2	Master	male	2.00	(0.34, 8.378]	1	1	2
407	2	Master	male	3.00	(0.34, 8.378]	1	1	2
348	3	Master	male	3.00	(0.34, 8.378]	1	1	2
261	3	Master	male	3.00	(0.34, 8.378]	4	2	6
193	2	Master	male	3.00	(0.34, 8.378]	1	1	2
63	3	Master	male	4.00	(0.34, 8.378]	3	2	5
445	1	Master	male	4.00	(0.34, 8.378]	0	2	2
171	3	Master	male	4.00	(0.34, 8.378]	4	1	5
850	3	Master	male	4.00	(0.34, 8.378]	4	2	6
869	3	Master	male	4.00	(0.34, 8.378]	1	1	2
751	3	Master	male	6.00	(0.34, 8.378]	0	1	1
278	3	Master	male	7.00	(0.34, 8.378]	4	1	5
50	3	Master	male	7.00	(0.34, 8.378]	4	1	5
549	2	Master	male	8.00	(0.34, 8.378]	1	1	2
787	3	Master	male	8.00	(0.34, 8.378]	4	1	5
480	3	Master	male	9.00	(8.378, 16.336]	5	2	7
489	3	Master	male	9.00	(8.378, 16.336]	1	1	2
165	3	Master	male	9.00	(8.378, 16.336]	0	2	2
182	3	Master	male	9.00	(8.378, 16.336]	4	2	6
819	3	Master	male	10.00	(8.378, 16.336]	3	2	5
802	1	Master	male	11.00	(8.378, 16.336]	1	2	3
59	3	Master	male	11.00	(8.378, 16.336]	5	2	7
125	3	Master	male	12.00	(8.378, 16.336]	1	0	1
65	3	Master	male	NaN	NaN	1	1	2
159	3	Master	male	NaN	NaN	8	2	10
176	3	Master	male	NaN	NaN	3	1	4
709	3	Master	male	NaN	NaN	1	1	2

Passengers who have “Master” as their name_title are all under 12 ages.
-> Fill missing values with name_title “Master” with a mean age of passengers with “Master” name title

titanic.loc[(titanic.name_title == "Master") & (titanic.age.isnull()), "age"] = np.mean(titanic[titanic.name_title == "Master"].age)

titanic[(titanic.name_title == "Master") & (titanic.age.isnull())]

	is_survived	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp	ticket_fare_category	cabin_alphabet	name_title	age_category

titanic[(titanic.age < 15) & (titanic.sex == "female")][["p_class", "name_title", "sex", "age", "age_category", "num_sb_sp", "num_pr_ch", "num_cmp"]].sort_values("age")

	p_class	name_title	sex	age	age_category	num_sb_sp	num_pr_ch	num_cmp
644	3	Miss	female	0.75	(0.34, 8.378]	2	1	3
469	3	Miss	female	0.75	(0.34, 8.378]	2	1	3
172	3	Miss	female	1.00	(0.34, 8.378]	1	1	2
381	3	Miss	female	1.00	(0.34, 8.378]	0	2	2
479	3	Miss	female	2.00	(0.34, 8.378]	0	1	1
642	3	Miss	female	2.00	(0.34, 8.378]	3	2	5
297	1	Miss	female	2.00	(0.34, 8.378]	1	2	3
530	2	Miss	female	2.00	(0.34, 8.378]	1	1	2
205	3	Miss	female	2.00	(0.34, 8.378]	0	1	1
119	3	Miss	female	2.00	(0.34, 8.378]	4	2	6
374	3	Miss	female	3.00	(0.34, 8.378]	3	1	4
43	2	Miss	female	3.00	(0.34, 8.378]	1	2	3
184	3	Miss	female	4.00	(0.34, 8.378]	0	2	2
750	2	Miss	female	4.00	(0.34, 8.378]	1	1	2
691	3	Miss	female	4.00	(0.34, 8.378]	0	1	1
10	3	Miss	female	4.00	(0.34, 8.378]	1	1	2
618	2	Miss	female	4.00	(0.34, 8.378]	2	1	3
233	3	Miss	female	5.00	(0.34, 8.378]	4	2	6
58	2	Miss	female	5.00	(0.34, 8.378]	1	2	3
777	3	Miss	female	5.00	(0.34, 8.378]	0	0	1
448	3	Miss	female	5.00	(0.34, 8.378]	2	1	3
720	2	Miss	female	6.00	(0.34, 8.378]	0	1	2
813	3	Miss	female	6.00	(0.34, 8.378]	4	2	6
535	2	Miss	female	7.00	(0.34, 8.378]	0	2	2
237	2	Miss	female	8.00	(0.34, 8.378]	0	2	2
24	3	Miss	female	8.00	(0.34, 8.378]	3	1	4
541	3	Miss	female	9.00	(8.378, 16.336]	4	2	6
147	3	Miss	female	9.00	(8.378, 16.336]	2	2	4
634	3	Miss	female	9.00	(8.378, 16.336]	3	2	5
852	3	Miss	female	9.00	(8.378, 16.336]	1	1	2
419	3	Miss	female	10.00	(8.378, 16.336]	0	2	2
542	3	Miss	female	11.00	(8.378, 16.336]	4	2	6
780	3	Miss	female	13.00	(8.378, 16.336]	0	0	0
446	2	Miss	female	13.00	(8.378, 16.336]	0	1	1
9	2	Mrs	female	14.00	(8.378, 16.336]	1	0	1
39	3	Miss	female	14.00	(8.378, 16.336]	1	0	1
14	3	Miss	female	14.00	(8.378, 16.336]	0	0	0
435	1	Miss	female	14.00	(8.378, 16.336]	1	2	3
111	3	Miss	female	14.50	(8.378, 16.336]	1	0	1

Almost every female passenger under 15 has the name title “Miss”. But the age range of passengers with “Miss” is too broad.
-> let’s consider linear regression to impute other missing values in age.

First, we have to convert customized categorical variables (ticket_fare_category) to numerical variables.

# Make orderic variables for ticket_fare_category column

titanic.loc[titanic['ticket_fare'] <= 8.676, 'ticket_fare_category_order'] = 1
titanic.loc[(titanic['ticket_fare'] > 8.676) & (titanic['ticket_fare'] <= 26.25), 'ticket_fare_category_order'] = 2
titanic.loc[titanic['ticket_fare'] > 26.25, 'ticket_fare_category_order'] = 3

titanic.ticket_fare_category.value_counts()

(8.676, 26.25]                300
(4.010999999999999, 8.676]    297
(26.25, 512.329]              294
Name: ticket_fare_category, dtype: int64

titanic.ticket_fare_category_order.value_counts()

0    300
0    297
0    294
Name: ticket_fare_category_order, dtype: int64

X = titanic[titanic.age.isnull() == False][["p_class", "num_sb_sp", "num_pr_ch", "ticket_fare", "num_cmp", "ticket_fare_category_order", "sex", "embark_port", "name_title"]]
y = titanic[titanic.age.isnull() == False].age

num_features = ["p_class", "num_sb_sp", "num_pr_ch", "ticket_fare", "num_cmp", "ticket_fare_category_order"]
nonnum_features = ["sex", "embark_port", "name_title"]

full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("nonnum", OneHotEncoder(), nonnum_features),
])

X_prep = full_pipeline.fit_transform(X)
X_prep

array([[ 0.9065961 ,  0.48553535, -0.51099538, ...,  1.        ,
         0.        ,  0.        ],
       [-1.48215986,  0.48553535, -0.51099538, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.9065961 , -0.54282565, -0.51099538, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.48215986, -0.54282565, -0.51099538, ...,  0.        ,
         0.        ,  0.        ],
       [-1.48215986, -0.54282565, -0.51099538, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.9065961 , -0.54282565, -0.51099538, ...,  1.        ,
         0.        ,  0.        ]])

X_train, X_test, y_train, y_test = train_test_split(X_prep, y, test_size = 0.1, random_state = 42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((646, 16), (72, 16), (646,), (72,))

lm = LinearRegression()
lm.fit(X_train, y_train)

LinearRegression()

result = cross_validate(lm, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5)

-np.mean(result["test_score"]), np.std(result["test_score"])

(11.290133827852916, 0.5883957467368258)

RMSE in the test set is 11. Since the age has values between 0 and 80, and RMSE gives us an idea of the average distance between the observed data values and the predicted data values, RMSE 11 shows our model is pretty good.
-> Then let’s compare the result with predicting age with the median value of each name_title.

test_prediction = lm.predict(X_test)

test_mse = mean_squared_error(y_test, test_prediction)
test_rmse = np.sqrt(test_mse)

test_rmse

11.065128385616424

y_test.reset_index()

	index	age
0	148	36.5
1	406	51.0
2	53	29.0
3	796	49.0
4	646	19.0
...	...	...
67	352	15.0
68	743	24.0
69	829	62.0
70	536	45.0
71	827	1.0

72 rows × 2 columns

pd.DataFrame(test_prediction.reshape(-1))

	0
0	33.711484
1	28.799848
2	35.226230
3	42.334043
4	28.799088
...	...
67	25.547320
68	27.214570
69	42.273286
70	49.972924
71	10.913917

72 rows × 1 columns

predict_result = pd.concat([y_test.reset_index(), pd.DataFrame(test_prediction.reshape(-1))], axis = 1) \
                 .rename({0 : "predict_by_lm"}, axis = 1)

predict_result["name_title"] = titanic.loc[predict_result["index"]].name_title.values

predict_result = predict_result.merge(titanic.groupby("name_title").age.median().reset_index().rename(columns = {"age" : "predict_by_medain_name_title"}), 
                                      on = "name_title", how = "left")
predict_result

	index	age	predict_by_lm	name_title	predict_by_medain_name_title
0	148	36.5	33.711484	Mr	30.0
1	406	51.0	28.799848	Mr	30.0
2	53	29.0	35.226230	Mrs	35.0
3	796	49.0	42.334043	uncommon	48.5
4	646	19.0	28.799088	Mr	30.0
...	...	...	...	...	...
67	352	15.0	25.547320	Mr	30.0
68	743	24.0	27.214570	Mr	30.0
69	829	62.0	42.273286	Mrs	35.0
70	536	45.0	49.972924	uncommon	48.5
71	827	1.0	10.913917	Master	4.0

72 rows × 5 columns

fig, ax = plt.subplots(1, 2, figsize = (20, 8))
top_bar = mpatches.Patch(color = "darkblue", label = 'Age actual values')
middle_bar = mpatches.Patch(color = "red", label = 'Age predicted values by linear model')
low_bar = mpatches.Patch(color = "green", label = 'Age predicted values by medain values of each name title')

sns.lineplot(y = predict_result.age, x = predict_result.index, ax = ax[0], color = "darkblue", alpha = 0.8)
sns.lineplot(y = predict_result.predict_by_lm, x = predict_result.index, ax = ax[0], color = "red", alpha = 0.8)
sns.lineplot(y = predict_result.predict_by_medain_name_title, x = predict_result.index, ax = ax[0], color = "green", alpha = 0.8)
ax[0].set_xlabel("Index", fontsize = 14)
ax[0].set_ylabel("Age", fontsize = 14)
ax[0].set_title("Actual values vs. Prediced values (Not sorted)", fontsize = 18)
ax[0].legend(handles=[top_bar, middle_bar, low_bar])

sns.lineplot(y = predict_result.sort_values("age").age, x = predict_result.index, ax = ax[1], color = "darkblue", alpha = 0.8)
sns.lineplot(y = predict_result.sort_values("age").predict_by_lm, x = predict_result.index, ax = ax[1], color = "red", alpha = 0.8)
sns.lineplot(y = predict_result.sort_values("age").predict_by_medain_name_title, x = predict_result.index, ax = ax[1], color = "green", alpha = 0.8)
ax[1].set_xlabel("Index", fontsize = 14)
ax[1].set_ylabel("Age", fontsize = 14)
ax[1].set_title("Actual values vs. Prediced values (Sorted)", fontsize = 18)
ax[1].legend(handles=[top_bar, middle_bar, low_bar])

plt.show()

png

Even if we predict with the median value of each title, some accurate prediction is possible, but we can confirm that the prediction using a linear model is more accurate. Looking at the graph on the right where age is sorted, it can be seen that the accuracy of the linear model is better, especially in the age groups of children and the elderly, which have a great influence on the survival rate.
-> Let’s impute missing values in age with the linear model

X_missing_age = titanic[titanic.age.isnull()][["p_class", "num_sb_sp", "num_pr_ch", "ticket_fare", "num_cmp", "ticket_fare_category_order", "sex", "embark_port", "name_title"]]
X_missing_age

	p_class	num_sb_sp	num_pr_ch	ticket_fare	num_cmp	ticket_fare_category_order	sex	embark_port	name_title
5	3	0	0	8.4583	0	1.0	male	Q	Mr
17	2	0	0	13.0000	0	2.0	male	S	Mr
19	3	0	0	7.2250	0	1.0	female	C	Mrs
26	3	0	0	7.2250	0	1.0	male	C	Mr
28	3	0	0	7.8792	0	1.0	female	Q	Miss
...	...	...	...	...	...	...	...	...	...
859	3	0	0	7.2292	0	1.0	male	C	Mr
863	3	8	2	69.5500	10	3.0	female	S	Miss
868	3	0	0	9.5000	0	2.0	male	S	Mr
878	3	0	0	7.8958	0	1.0	male	S	Mr
888	3	1	2	23.4500	3	2.0	female	S	Miss

173 rows × 9 columns

X_missing_age_prep = full_pipeline.transform(X_missing_age)
X_missing_age_prep

array([[ 0.9065961 , -0.54282565, -0.51099538, ...,  1.        ,
         0.        ,  0.        ],
       [-0.28778188, -0.54282565, -0.51099538, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.9065961 , -0.54282565, -0.51099538, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.9065961 , -0.54282565, -0.51099538, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.9065961 , -0.54282565, -0.51099538, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.9065961 ,  0.48553535,  1.83337959, ...,  0.        ,
         0.        ,  0.        ]])

missing_age_prediction = lm.predict(X_missing_age_prep)

titanic.loc[titanic.age.isnull(), "age"] = missing_age_prediction

titanic.shape

(891, 19)

titanic[titanic.age.isnull()]

	is_survived	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp	ticket_fare_category	cabin_alphabet	name_title	age_category	ticket_fare_category_order

2. Data preparation

In the EDA, we have done some feature engineering. So, let’s do the same process on the test data set.

titanic.head()

	is_survived	p_class	name	sex	age	num_sb_sp	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp	ticket_fare_category	cabin_alphabet	name_title	age_category	ticket_fare_category_order
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	A/5	21171	1	(4.010999999999999, 8.676]	n	Mr	(16.336, 24.294]	1.0
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	PC	17599	1	(26.25, 512.329]	C	Mrs	(32.252, 40.21]	3.0
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	STON/O2.	3101282	0	(4.010999999999999, 8.676]	n	Miss	(24.294, 32.252]	1.0
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	non	113803	1	(26.25, 512.329]	C	Mrs	(32.252, 40.21]	3.0
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	non	373450	0	(4.010999999999999, 8.676]	n	Mr	(32.252, 40.21]	1.0

Dependent variable: is_survived
Independent variables:
- p_class
- name
  - name: do not use
  - name_title: use
- sex: use
- age
  - age:
    - Fill missing values with name_title “Master” with mean age of passengers with “Master” name title
    - After then, impute with linear model
  - age category: change to orderic variable age_category_order and use it instead of age_category
- num_sb_sp
- num_pr_ch
- ticket_number: do not use
  - make num_cmp
- ticket_fare
  - ticket_fare: use
  - ticket_fare_category: change to orderic variable ticket_fare_category_order and use it instead of ticket_fare_category
- cabin_number: do not use
- embark_port:
  - impute with mode value

test = pd.read_csv("./data/titanic_test.csv")

# check missing values in test data set

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

# Save passenger id sepeartely

test_passenger_id = test["PassengerId"]

test.drop("PassengerId", axis = 1, inplace = True)
test.head()

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

# Set the column name same as the train data.

test.rename(columns = {"Survived" : "is_survived", 
                          "Pclass" : "p_class",
                          "Name" : "name",
                          "Sex" : "sex", 
                          "Age" : "age",
                          "SibSp" : "num_sb_sp",
                          "Parch" : "num_pr_ch",
                          "Ticket" : "ticket_number",
                          "Fare" : "ticket_fare",
                          "Cabin" : "cabin_number",
                          "Embarked" : "embark_port"}, inplace = True)
test.head()

	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port
0	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

# Extract name_title

test["name_title"] = test.name.str.extract(' ([A-Za-z]+)\.', expand=False)
test['name_title'] = test['name_title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', \
                                                 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'uncommon')

test['name_title'] = test['name_title'].replace('Mlle', 'Miss')
test['name_title'] = test['name_title'].replace('Ms', 'Miss')
test['name_title'] = test['name_title'].replace('Mme', 'Mrs')
test.name_title.value_counts()

Mr          240
Miss         79
Mrs          72
Master       21
uncommon      6
Name: name_title, dtype: int64

# make num_cmp by ticket number

test = test.merge(test["ticket_number"].value_counts().reset_index().rename(columns = {"index" : "ticket_number", "ticket_number" : "num_cmp_by_ticket"}), \
                        on = "ticket_number", how = "left")
test["num_cmp_by_ticket"] = test["num_cmp_by_ticket"] - 1
test["num_cmp_by_sb_sp_pr_ch"] = test["num_sb_sp"] + test["num_pr_ch"]
test["num_cmp"] = test[["num_cmp_by_ticket", "num_cmp_by_sb_sp_pr_ch"]].max(axis = 1)
test.drop(["num_cmp_by_ticket", "num_cmp_by_sb_sp_pr_ch"], axis = 1, inplace = True)
test.head()

	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	name_title	num_cmp
0	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q	Mr	0
1	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S	Mrs	1
2	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q	Mr	0
3	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S	Mr	0
4	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S	Mrs	2

# Impute 0 ticket fare with median value of each p_class

test.loc[(test.ticket_fare == 0) & (test.p_class == 1), "ticket_fare"] = p_class_fare_median[p_class_fare_median.p_class == 1].ticket_fare_median.values[0]
test.loc[(test.ticket_fare == 0) & (test.p_class == 2), "ticket_fare"] = p_class_fare_median[p_class_fare_median.p_class == 2].ticket_fare_median.values[0]
test.loc[(test.ticket_fare == 0) & (test.p_class == 3), "ticket_fare"] = p_class_fare_median[p_class_fare_median.p_class == 3].ticket_fare_median.values[0]

test[test.ticket_fare == 0]

	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	name_title	num_cmp

test[test.ticket_fare.isnull()]

	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	name_title	num_cmp
152	3	Storey, Mr. Thomas	male	60.5	0	0	3701	NaN	NaN	S	Mr	0

test["ticket_fare"] = test.ticket_fare.fillna(p_class_fare_median[p_class_fare_median.p_class == 3].ticket_fare_median.values[0])

test[test.ticket_fare.isnull()]

	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	name_title	num_cmp

# Make ticket fare category order

titanic.ticket_fare_category.unique()

[(4.010999999999999, 8.676], (26.25, 512.329], (8.676, 26.25]]
Categories (3, interval[float64, right]): [(4.010999999999999, 8.676] < (8.676, 26.25] < (26.25, 512.329]]

test.loc[test['ticket_fare'] <= 8.676, 'ticket_fare_category_order'] = 1
test.loc[(test['ticket_fare'] > 8.676) & (test['ticket_fare'] <= 26.25), 'ticket_fare_category_order'] = 2
test.loc[test['ticket_fare'] > 26.25, 'ticket_fare_category_order'] = 3

test.ticket_fare_category_order.value_counts()

0    145
0    140
0    133
Name: ticket_fare_category_order, dtype: int64

np.sum(test.ticket_fare_category_order.value_counts())

np.sum(test.ticket_fare_category_order.isnull())

# Impute embark port with mode value : S

test[test.embark_port.isnull()]

	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	name_title	num_cmp	ticket_fare_category_order

titanic.embark_port.unique()

array(['S', 'C', 'Q'], dtype=object)

# Fill missing values with name_title "Master" with mean age of passengers with "Master" name title 

test.loc[(test.name_title == "Master") & (test.age.isnull()), "age"] = np.mean(titanic[titanic.name_title == "Master"].age)

test[(test.name_title == "Master") & (test.age.isnull())]

	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	name_title	num_cmp	ticket_fare_category_order

# Impute missing values in age with trained linear model

X_missing_age = test[test.age.isnull()][["p_class", "num_sb_sp", "num_pr_ch", "ticket_fare", "num_cmp", "ticket_fare_category_order", "sex", "embark_port", "name_title"]]
X_missing_age_prep = full_pipeline.transform(X_missing_age)
missing_age_prediction = lm.predict(X_missing_age_prep)
test.loc[test.age.isnull(), "age"] = missing_age_prediction

np.sum(test.age.isnull())

test.shape

(418, 13)

# Make age category order

titanic.age_category.value_counts()

(16.336, 24.294]    177
(24.294, 32.252]    169
(32.252, 40.21]     118
(40.21, 48.168]      70
(0.34, 8.378]        54
(8.378, 16.336]      46
(48.168, 56.126]     45
(56.126, 64.084]     24
(64.084, 72.042]      9
(72.042, 80.0]        2
Name: age_category, dtype: int64

test.loc[test['age'] <= 8.378, 'age_category_order'] = 1
test.loc[(test['age'] > 8.378) & (test['age'] <= 16.336), 'age_category_order'] = 2
test.loc[(test['age'] > 16.336) & (test['age'] <= 24.294), 'age_category_order'] = 3
test.loc[(test['age'] > 24.294) & (test['age'] <= 32.252), 'age_category_order'] = 4
test.loc[(test['age'] > 32.252) & (test['age'] <= 40.21), 'age_category_order'] = 5
test.loc[(test['age'] > 40.21) & (test['age'] <= 48.168), 'age_category_order'] = 6
test.loc[(test['age'] > 48.168) & (test['age'] <= 56.126), 'age_category_order'] = 7
test.loc[(test['age'] > 56.126) & (test['age'] <= 64.084), 'age_category_order'] = 8
test.loc[(test['age'] > 64.084) & (test['age'] <= 72.042), 'age_category_order'] = 9
test.loc[test['age'] > 72.042, 'age_category_order'] = 10

titanic.loc[titanic['age'] <= 8.378, 'age_category_order'] = 1
titanic.loc[(titanic['age'] > 8.378) & (titanic['age'] <= 16.336), 'age_category_order'] = 2
titanic.loc[(titanic['age'] > 16.336) & (titanic['age'] <= 24.294), 'age_category_order'] = 3
titanic.loc[(titanic['age'] > 24.294) & (titanic['age'] <= 32.252), 'age_category_order'] = 4
titanic.loc[(titanic['age'] > 32.252) & (titanic['age'] <= 40.21), 'age_category_order'] = 5
titanic.loc[(titanic['age'] > 40.21) & (titanic['age'] <= 48.168), 'age_category_order'] = 6
titanic.loc[(titanic['age'] > 48.168) & (titanic['age'] <= 56.126), 'age_category_order'] = 7
titanic.loc[(titanic['age'] > 56.126) & (titanic['age'] <= 64.084), 'age_category_order'] = 8
titanic.loc[(titanic['age'] > 64.084) & (titanic['age'] <= 72.042), 'age_category_order'] = 9
titanic.loc[titanic['age'] > 72.042, 'age_category_order'] = 10

test.age_category_order.value_counts()

0     122
0     112
0      59
0      47
0      23
0      20
0      17
0      16
0       1
0      1
Name: age_category_order, dtype: int64

np.sum(test.age_category_order.value_counts())

np.sum(test.age_category_order.isnull())

np.sum(titanic.age_category_order.value_counts())

np.sum(titanic.age_category_order.isnull())

# Drop unused columns

test.head()

	p_class	name	sex	age	num_sb_sp	num_pr_ch	ticket_number	ticket_fare	cabin_number	embark_port	name_title	num_cmp	ticket_fare_category_order	age_category_order
0	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q	Mr	0	1.0	5.0
1	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S	Mrs	1	1.0	6.0
2	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q	Mr	0	2.0	8.0
3	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S	Mr	0	1.0	4.0
4	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S	Mrs	2	2.0	3.0

test.drop(["name", "ticket_number", "cabin_number"], axis = 1, inplace = True)
test.head()

	p_class	sex	age	num_sb_sp	num_pr_ch	ticket_fare	embark_port	name_title	num_cmp	ticket_fare_category_order	age_category_order
0	3	male	34.5	0	0	7.8292	Q	Mr	0	1.0	5.0
1	3	female	47.0	1	0	7.0000	S	Mrs	1	1.0	6.0
2	2	male	62.0	0	0	9.6875	Q	Mr	0	2.0	8.0
3	3	male	27.0	0	0	8.6625	S	Mr	0	1.0	4.0
4	3	female	22.0	1	1	12.2875	S	Mrs	2	2.0	3.0

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   p_class                     418 non-null    int64  
 1   sex                         418 non-null    object 
 2   age                         418 non-null    float64
 3   num_sb_sp                   418 non-null    int64  
 4   num_pr_ch                   418 non-null    int64  
 5   ticket_fare                 418 non-null    float64
 6   embark_port                 418 non-null    object 
 7   name_title                  418 non-null    object 
 8   num_cmp                     418 non-null    int64  
 9   ticket_fare_category_order  418 non-null    float64
 10  age_category_order          418 non-null    float64
dtypes: float64(4), int64(4), object(3)
memory usage: 39.2+ KB

titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 is_survived                 891 non-null    int64   
 p_class                     891 non-null    int64   
 name                        891 non-null    object  
 sex                         891 non-null    object  
 age                         891 non-null    float64 
 num_sb_sp                   891 non-null    int64   
 num_pr_ch                   891 non-null    int64   
 ticket_number               891 non-null    object  
 ticket_fare                 891 non-null    float64 
 cabin_number                204 non-null    object  
embark_port                 891 non-null    object  
ticket_number_alphabet      891 non-null    object  
ticket_number_number        891 non-null    object  
num_cmp                     891 non-null    int64   
ticket_fare_category        891 non-null    category
cabin_alphabet              891 non-null    object  
name_title                  891 non-null    object  
age_category                714 non-null    category
ticket_fare_category_order  891 non-null    float64 
age_category_order          891 non-null    float64 
dtypes: category(2), float64(4), int64(5), object(9)
memory usage: 166.9+ KB

3. Modeling

3.1. Baseline model

Let’s make baseline model with original columns.

titanic_original = pd.read_csv("./data/titanic_train.csv")

titanic_original.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# Drop PassengerID that will not be used in modeling

titanic_original.drop("PassengerId", axis = 1, inplace = True)

# Change column name for just convenience

titanic_original.rename(columns = {"Survived" : "is_survived", 
                                   "Pclass" : "p_class",
                                   "Name" : "name",
                                   "Sex" : "sex", 
                                   "Age" : "age",
                                   "SibSp" : "num_sb_sp",
                                   "Parch" : "num_pr_ch",
                                   "Ticket" : "ticket_number",
                                   "Fare" : "ticket_fare",
                                   "Cabin" : "cabin_number",
                                   "Embarked" : "embark_port"}, inplace = True)
titanic_original.head()

	is_survived	p_class	name	sex	age	num_sb_sp	ticket_number	ticket_fare	cabin_number	embark_port
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# Drop name, ticket_number, cabin_number columns that will not be used in modeling

titanic_original.drop(["name", "ticket_number", "cabin_number"], axis = 1, inplace = True)

titanic_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   is_survived  891 non-null    int64  
 1   p_class      891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   num_sb_sp    891 non-null    int64  
 5   num_pr_ch    891 non-null    int64  
 6   ticket_fare  891 non-null    float64
 7   embark_port  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB

# Fill missing values in age just by mean of ages

titanic_original["age"] = titanic_original.age.fillna(np.mean(titanic_original.age))
np.sum(titanic_original.age.isnull())

# Fill missing values in embark_port just by mode of embark_ports

titanic_original["embark_port"] = titanic_original.embark_port.fillna(statistics.mode(titanic_original.embark_port))
np.sum(titanic_original.embark_port.isnull())

X = titanic_original.drop("is_survived", axis = 1)
y = titanic_original["is_survived"]

num_features = ["p_class", "age", "num_sb_sp", "num_pr_ch", "ticket_fare"]
nonnum_features = ["sex", "embark_port"]

full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("nonnum", OneHotEncoder(), nonnum_features),
])

X_original_prep = full_pipeline.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_original_prep, y, test_size = 0.1, random_state = 42)

names = ["Nearest Neighbors", 
         "Linear SVM", 
         "RBF SVM", 
         "Gaussian Process",
         "Decision Tree", 
         "Random Forest", 
         "Neural Net", 
         "AdaBoost",
         "Naive Bayes"
        ]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    ]

result_accuracy = pd.DataFrame(names, columns = ["model_name"])

for name, clf in zip(names, classifiers):
    try:
        clf.fit(X_train, y_train)
    except:
        clf.fit(X_train.toarray(), y_train)
    
    try:
        y_pred = clf.predict(X_test)
    except: 
        y_pred = clf.predict(X_test.toarray())
    
    # evaluate predictions
    try:
        accuracy = clf.score(X_test, y_test)
    except:
        accuracy = clf.score(X_test.toarray(), y_test)
    
    result_accuracy.loc[result_accuracy.model_name == name, "base_accuracy"] = round(accuracy * 100, 3)
    #print(f"{name} Accuracy: {round(accuracy * 100, 3)}")

result_accuracy

	model_name	base_accuracy
0	Nearest Neighbors	83.333
1	Linear SVM	81.111
2	RBF SVM	81.111
3	Gaussian Process	83.333
4	Decision Tree	76.667
5	Random Forest	82.222
6	Neural Net	76.667
7	AdaBoost	82.222
8	Naive Bayes	78.889

3.2. Dimension reduction + Classification

PCA

Let’s do PCA and use PC columns instead of original columns.

X_prep.shape

(718, 16)

pipe = Pipeline([
    ('scale',StandardScaler()),
    ('pca', PCA(n_components = 10, random_state = 42)),
])

X_pca = pipe.fit_transform(X_original_prep)

plt.figure(figsize = (14, 8))

plt.bar(range(0,len(pipe.named_steps.pca.explained_variance_ratio_)), pipe.named_steps.pca.explained_variance_ratio_, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(np.cumsum(pipe.named_steps.pca.explained_variance_ratio_))), np.cumsum(pipe.named_steps.pca.explained_variance_ratio_), where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio', fontsize = 14)
plt.xlabel('Principal component index', fontsize = 14)
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

png

4 PC variables can explain almost 80% of total variance.

X_pca[:, :2].shape

(891, 2)

X_pca_scatter = pd.concat([pd.DataFrame(X_pca[:, :2], columns = ["PC1", "PC2"]), pd.DataFrame(titanic_original.is_survived, columns = ["is_survived"])], axis = 1)
X_pca_scatter.head()

	PC1	PC2	is_survived
0	-1.549197	-0.628395	0
1	3.156084	1.732145	1
2	0.455716	-1.295524	1
3	1.562384	-0.519725	1
4	-1.695137	0.025692	0

plt.figure(figsize = (14, 8))
sns.scatterplot(data = X_pca_scatter, x = "PC1", y = "PC2", hue = "is_survived")
plt.xlabel("PC1", fontsize = 14)
plt.ylabel("PC2", fontsize = 14)
plt.show()

png

It can be seen that we can classify well with PC1, PC2 values.
-> Let’s do classification with PC1, PC2 variables.

X_pca[:, :2].shape

(891, 2)

X_pca_prep = X_pca[:, :2]

X_train, X_test, y_train, y_test = train_test_split(X_pca_prep, y, test_size = 0.1, random_state = 42)

names = ["Nearest Neighbors", 
         "Linear SVM", 
         "RBF SVM", 
         "Gaussian Process",
         "Decision Tree", 
         "Random Forest", 
         "Neural Net", 
         "AdaBoost",
         "Naive Bayes"
        ]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    ]

for name, clf in zip(names, classifiers):
    try:
        clf.fit(X_train, y_train)
    except:
        clf.fit(X_train.toarray(), y_train)
    
    try:
        y_pred = clf.predict(X_test)
    except: 
        y_pred = clf.predict(X_test.toarray())
    
    # evaluate predictions
    try:
        accuracy = clf.score(X_test, y_test)
    except:
        accuracy = clf.score(X_test.toarray(), y_test)
    
    result_accuracy.loc[result_accuracy.model_name == name, "pca_accuracy"] = round(accuracy * 100, 3)
    #print(f"{name} Accuracy: {round(accuracy * 100, 3)}")

result_accuracy

	model_name	base_accuracy	pca_accuracy
0	Nearest Neighbors	83.333	81.111
1	Linear SVM	81.111	80.000
2	RBF SVM	81.111	82.222
3	Gaussian Process	83.333	80.000
4	Decision Tree	76.667	82.222
5	Random Forest	82.222	81.111
6	Neural Net	76.667	78.889
7	AdaBoost	82.222	78.889
8	Naive Bayes	78.889	81.111

By using only PC1 and PC2 variables, we can obtain as high accuracy as baseline models.
In some models like RBF SVM, Decision Tree, Neural Net, Naive Bayes, PCA variables got higher accuracy than baseline models.

TSNE

Let’s do TSNE and use variabels from TSNE instead of original columns.

X_original_prep.shape

(891, 10)

fig, ax = plt.subplots(3, 2, figsize = (15, 20))
for i, perplexity in enumerate([1, 5, 10, 15, 25, 35]):
    tsne = TSNE(n_components = 2, random_state = 42, perplexity = perplexity)
    X_2d = tsne.fit_transform(X_original_prep)
    tsne_labelled = pd.concat([pd.DataFrame(X_2d, columns = ["d1", "d2"]), titanic_original[["is_survived"]].astype(str)], axis = 1)
    
    sns.scatterplot(data = tsne_labelled, x = "d1", y = "d2", hue = "is_survived", ax = ax[i // 2, i % 2])
    ax[i // 2, i % 2].set_title(f"Perplexity = {perplexity}", fontsize = 14)

png

From perplexity 15, we can figure out some distinct clusters. And some clusters have a higher rate of survival.
-> Let’s use higher perplexities.

fig, ax = plt.subplots(3, 2, figsize = (15, 20))
for i, perplexity in enumerate([15, 25, 30, 35, 40, 45]):
    tsne = TSNE(n_components = 2, random_state = 42, perplexity = perplexity)
    X_2d = tsne.fit_transform(X_original_prep)
    tsne_labelled = pd.concat([pd.DataFrame(X_2d, columns = ["d1", "d2"]), titanic_original[["is_survived"]].astype(str)], axis = 1)
    
    sns.scatterplot(data = tsne_labelled, x = "d1", y = "d2", hue = "is_survived", ax = ax[i // 2, i % 2])
    ax[i // 2, i % 2].set_title(f"Perplexity = {perplexity}", fontsize = 14)

png

From 15, the results are almost similar to each other.
When perplexity is 25, the distinction between clusters was clear.
-> Let’s use perplexity = 25

tsne = TSNE(n_components = 2, random_state = 42, perplexity = 25)
X_tsne_prep = tsne.fit_transform(X_original_prep)

X_train, X_test, y_train, y_test = train_test_split(X_tsne_prep, y, test_size = 0.1, random_state = 42)

names = ["Nearest Neighbors", 
         "Linear SVM", 
         "RBF SVM", 
         "Gaussian Process",
         "Decision Tree", 
         "Random Forest", 
         "Neural Net", 
         "AdaBoost",
         "Naive Bayes"
        ]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    ]

for name, clf in zip(names, classifiers):
    try:
        clf.fit(X_train, y_train)
    except:
        clf.fit(X_train.toarray(), y_train)
    
    try:
        y_pred = clf.predict(X_test)
    except: 
        y_pred = clf.predict(X_test.toarray())
    
    # evaluate predictions
    try:
        accuracy = clf.score(X_test, y_test)
    except:
        accuracy = clf.score(X_test.toarray(), y_test)
    
    result_accuracy.loc[result_accuracy.model_name == name, "tsne_accuracy"] = round(accuracy * 100, 3)
    #print(f"{name} Accuracy: {round(accuracy * 100, 3)}")

result_accuracy

	model_name	base_accuracy	pca_accuracy	tsne_accuracy
0	Nearest Neighbors	83.333	81.111	81.111
1	Linear SVM	81.111	80.000	73.333
2	RBF SVM	81.111	82.222	81.111
3	Gaussian Process	83.333	80.000	82.222
4	Decision Tree	76.667	82.222	83.333
5	Random Forest	82.222	81.111	83.333
6	Neural Net	76.667	78.889	76.667
7	AdaBoost	82.222	78.889	76.667
8	Naive Bayes	78.889	81.111	73.333

By using only 2 variables from tsne, we can obtain as high accuracy as baseline models.
In Decision Tree and Random Forest, tsne variables got the highest accuracy than baseline or PCA models.

3.3. EDA variables

Now, let’s use variables from EDA and compare the accuracy with accuracies from methods used before.

titanic.head()

	is_survived	p_class	name	sex	age	num_sb_sp	ticket_number	ticket_fare	cabin_number	embark_port	ticket_number_alphabet	ticket_number_number	num_cmp	ticket_fare_category	cabin_alphabet	name_title	age_category	ticket_fare_category_order	age_category_order
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	A/5	21171	1	(4.010999999999999, 8.676]	n	Mr	(16.336, 24.294]	1.0	3.0
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	PC	17599	1	(26.25, 512.329]	C	Mrs	(32.252, 40.21]	3.0	5.0
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	STON/O2.	3101282	0	(4.010999999999999, 8.676]	n	Miss	(24.294, 32.252]	1.0	4.0
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	non	113803	1	(26.25, 512.329]	C	Mrs	(32.252, 40.21]	3.0	5.0
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	non	373450	0	(4.010999999999999, 8.676]	n	Mr	(32.252, 40.21]	1.0	5.0

titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 is_survived                 891 non-null    int64   
 p_class                     891 non-null    int64   
 name                        891 non-null    object  
 sex                         891 non-null    object  
 age                         891 non-null    float64 
 num_sb_sp                   891 non-null    int64   
 num_pr_ch                   891 non-null    int64   
 ticket_number               891 non-null    object  
 ticket_fare                 891 non-null    float64 
 cabin_number                204 non-null    object  
embark_port                 891 non-null    object  
ticket_number_alphabet      891 non-null    object  
ticket_number_number        891 non-null    object  
num_cmp                     891 non-null    int64   
ticket_fare_category        891 non-null    category
cabin_alphabet              891 non-null    object  
name_title                  891 non-null    object  
age_category                714 non-null    category
ticket_fare_category_order  891 non-null    float64 
age_category_order          891 non-null    float64 
dtypes: category(2), float64(4), int64(5), object(9)
memory usage: 166.9+ KB

X = titanic.drop(["is_survived", "name", "ticket_number", "cabin_number", "ticket_number_alphabet", \
                  "ticket_number_number", "cabin_alphabet", "ticket_fare_category", "age_category"], axis = 1)
y = titanic["is_survived"]

	p_class	sex	age	num_sb_sp	num_pr_ch	ticket_fare	embark_port	num_cmp	name_title	ticket_fare_category_order	age_category_order
0	3	male	22.000000	1	0	7.2500	S	1	Mr	1.0	3.0
1	1	female	38.000000	1	0	71.2833	C	1	Mrs	3.0	5.0
2	3	female	26.000000	0	0	7.9250	S	0	Miss	1.0	4.0
3	1	female	35.000000	1	0	53.1000	S	1	Mrs	3.0	5.0
4	3	male	35.000000	0	0	8.0500	S	0	Mr	1.0	5.0
...	...	...	...	...	...	...	...	...	...	...	...
886	2	male	27.000000	0	0	13.0000	S	0	uncommon	2.0	4.0
887	1	female	19.000000	0	0	30.0000	S	0	Miss	3.0	3.0
888	3	female	15.685263	1	2	23.4500	S	3	Miss	2.0	2.0
889	1	male	26.000000	0	0	30.0000	C	0	Mr	3.0	4.0
890	3	male	32.000000	0	0	7.7500	Q	0	Mr	1.0	4.0

891 rows × 11 columns

num_features = ["p_class", "age", "num_sb_sp", "num_pr_ch", "ticket_fare", "num_cmp", "ticket_fare_category_order", "age_category_order"]
nonnum_features = ["sex", "embark_port", "name_title"]

full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("nonnum", OneHotEncoder(), nonnum_features),
])

X_eda_prep = full_pipeline.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_eda_prep, y, test_size = 0.1, random_state = 42)

names = ["Nearest Neighbors", 
         "Linear SVM", 
         "RBF SVM", 
         "Gaussian Process",
         "Decision Tree", 
         "Random Forest", 
         "Neural Net", 
         "AdaBoost",
         "Naive Bayes"
        ]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    ]

for name, clf in zip(names, classifiers):
    try:
        clf.fit(X_train, y_train)
    except:
        clf.fit(X_train.toarray(), y_train)
    
    try:
        y_pred = clf.predict(X_test)
    except: 
        y_pred = clf.predict(X_test.toarray())
    
    # evaluate predictions
    try:
        accuracy = clf.score(X_test, y_test)
    except:
        accuracy = clf.score(X_test.toarray(), y_test)
    
    result_accuracy.loc[result_accuracy.model_name == name, "eda_variables"] = round(accuracy * 100, 3)
    #print(f"{name} Accuracy: {round(accuracy * 100, 3)}")

result_accuracy

	model_name	base_accuracy	pca_accuracy	tsne_accuracy	eda_variables
0	Nearest Neighbors	83.333	81.111	81.111	84.444
1	Linear SVM	81.111	80.000	73.333	82.222
2	RBF SVM	81.111	82.222	81.111	80.000
3	Gaussian Process	83.333	80.000	82.222	84.444
4	Decision Tree	76.667	82.222	83.333	83.333
5	Random Forest	82.222	81.111	83.333	80.000
6	Neural Net	76.667	78.889	76.667	80.000
7	AdaBoost	82.222	78.889	76.667	82.222
8	Naive Bayes	78.889	81.111	73.333	78.889

In Nearest Neighbors, Linear SVM, Gaussian Process, AdaBoost, eda variables got the highest accuracy than other methods.
-> Let’s tune Nearest Neighbors, Linear SVM, Gaussian Process, Deicision Tree, and Random Forest and ensemble these models.

3.4. Hyperparameter Tuning

Let’s tune the hyperparameters for each of 5 models that I will use: Nearest Neighbors, Linear SVM, Gaussian Process, Deicision Tree, and Random Forest.

Nearest Neighbors

param_grid = {
    'leaf_size': [1, 3, 5, 7, 9],
    'n_neighbors': [1, 3, 5, 7, 9],
    'p' : [1,2]
}

model = KNeighborsClassifier()

eda_knn_grid = GridSearchCV(model, param_grid, cv = 5, n_jobs = -1)
eda_knn_grid.fit(X_eda_prep, y)

print(eda_knn_grid.best_estimator_)
print(eda_knn_grid.best_score_)

KNeighborsClassifier(leaf_size=1, n_neighbors=7)
0.8237838177138912

Best parameters: leaf_size = 1, n_neighbors = 7
Best score: about 82%

Linear SVM

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01],
}

model = SVC(kernel = "linear")

eda_svm_grid = GridSearchCV(model, param_grid, cv = 5, n_jobs = -1)
eda_svm_grid.fit(X_eda_prep, y)

print(eda_svm_grid.best_estimator_)
print(eda_svm_grid.best_score_)

SVC(C=0.1, gamma=1, kernel='linear')
0.8248948590797817

Best parameters: c = 0.1, gamma = 1
Best score: about 82%

Gaussian Process

param_grid = {
    "kernel": [1 * RBF(), 1 * DotProduct(), 1 * Matern(),  1 * RationalQuadratic(), 1 * WhiteKernel()]
}

model = GaussianProcessClassifier()

eda_gaussian_grid = GridSearchCV(model, param_grid, cv = 5, n_jobs = -1)
eda_gaussian_grid.fit(X_eda_prep, y)

print(eda_gaussian_grid.best_estimator_)
print(eda_gaussian_grid.best_score_)

/opt/anaconda3/lib/python3.9/site-packages/sklearn/gaussian_process/kernels.py:411: ConvergenceWarning: The optimal value found for dimension 0 of parameter k2__alpha is close to the specified upper bound 100000.0. Increasing the bound and calling fit again may find a better value.
  warnings.warn("The optimal value found for "
/opt/anaconda3/lib/python3.9/site-packages/sklearn/gaussian_process/kernels.py:411: ConvergenceWarning: The optimal value found for dimension 0 of parameter k2__alpha is close to the specified upper bound 100000.0. Increasing the bound and calling fit again may find a better value.
  warnings.warn("The optimal value found for "


GaussianProcessClassifier(kernel=1**2 * RBF(length_scale=1))
0.8271608813006089

Best parameters: kernel = 1 ** 2 * RBF(length_scale = 1)
Best score: about 83%

Decision tree

param_grid = {
    "splitter":["best","random"],
    "max_depth" : [1,3,5,7,9],
    "min_samples_leaf":[1,2,3],
    "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4],
    "max_features":["auto","log2","sqrt",None],
    "max_leaf_nodes":[None, 20, 40, 60]
}

model = DecisionTreeClassifier()

eda_decision_tree_grid = GridSearchCV(model, param_grid, cv = 5, n_jobs = -1)
eda_decision_tree_grid.fit(X_eda_prep, y)

print(eda_decision_tree_grid.best_estimator_)
print(eda_decision_tree_grid.best_score_)

DecisionTreeClassifier(max_depth=9, max_features='auto', max_leaf_nodes=40,
                       min_samples_leaf=2, min_weight_fraction_leaf=0.2)
0.7968740192078337

Best parameters: max_depth = 7, max_features = “sqrt”, max_leaf_nodes = 60, min_samples_leaf = 3, min_weight_fraction_leaf = 0.1
Best score: about 80%

Random forest

param_grid = {
    'bootstrap': [True, False],
    'max_depth': [10, 30, 50],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [5, 10, 15, 20]
}

model = RandomForestClassifier()

eda_random_forest_grid = GridSearchCV(model, param_grid, cv = 5, n_jobs = -1)
eda_random_forest_grid.fit(X_eda_prep, y)

print(eda_random_forest_grid.best_estimator_)
print(eda_random_forest_grid.best_score_)

RandomForestClassifier(max_depth=30, max_features='sqrt', min_samples_leaf=2,
                       min_samples_split=5, n_estimators=20)
0.8439834285355596

Best parameters: bootstrap = False, max_depth = 10, miin_samples_leaf = 4, min_samples_split = 10, n_estimators = 15
Best score: about 85%

3.5. Ensemble Models

Now, let’s tune classfiers and ensemble these models.

Nearest Neighbors, Linear SVM, Gaussian Process

X = titanic.drop(["is_survived", "name", "ticket_number", "cabin_number", "ticket_number_alphabet", \
                  "ticket_number_number", "cabin_alphabet", "ticket_fare_category", "age_category"], axis = 1)
y = titanic["is_survived"]

num_features = ["p_class", "age", "num_sb_sp", "num_pr_ch", "ticket_fare", "num_cmp", "ticket_fare_category_order", "age_category_order"]
nonnum_features = ["sex", "embark_port", "name_title"]

full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("nonnum", OneHotEncoder(), nonnum_features),
])

X_eda_prep = full_pipeline.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_eda_prep, y, test_size = 0.1, random_state = 42)

clf_knn = KNeighborsClassifier(leaf_size = eda_knn_grid.best_params_["leaf_size"], n_neighbors = eda_knn_grid.best_params_["n_neighbors"])

clf_svc = SVC(kernel = "linear", C = eda_svm_grid.best_params_["C"], gamma = eda_svm_grid.best_params_["gamma"])

clf_gaussian_process = GaussianProcessClassifier(kernel = eda_gaussian_grid.best_params_["kernel"])


clf_decision_tree = DecisionTreeClassifier(
    max_depth = eda_decision_tree_grid.best_params_["max_depth"],
    max_features = eda_decision_tree_grid.best_params_["max_features"],
    max_leaf_nodes = eda_decision_tree_grid.best_params_["max_leaf_nodes"],
    min_samples_leaf = eda_decision_tree_grid.best_params_["min_samples_leaf"],
    min_weight_fraction_leaf = eda_decision_tree_grid.best_params_["min_weight_fraction_leaf"],
    splitter = eda_decision_tree_grid.best_params_["splitter"],
)

clf_random_forest = RandomForestClassifier(
    bootstrap = eda_random_forest_grid.best_params_["bootstrap"],
    max_depth = eda_random_forest_grid.best_params_["max_depth"],
    max_features = eda_random_forest_grid.best_params_["max_features"],
    min_samples_leaf = eda_random_forest_grid.best_params_["min_samples_leaf"],
    min_samples_split = eda_random_forest_grid.best_params_["min_samples_split"],
    n_estimators = eda_random_forest_grid.best_params_["n_estimators"],
)

clf_ensemble = VotingClassifier(
    estimators = [("knn", clf_knn), ("svc", clf_svc), ("gp", clf_gaussian_process),
                  ("dt", clf_decision_tree), ("rf", clf_random_forest)], 
    voting = "hard"
)

clf_ensemble = clf_ensemble.fit(X_train, y_train)
y_pred = clf_ensemble.predict(X_test)
accuracy = clf_ensemble.score(X_test, y_test)

round(accuracy * 100, 3)

82.222

Get about 82% accuracy from the ensemble model.

4. Prepare submission

X.head()

	p_class	sex	age	num_sb_sp	ticket_fare	embark_port	num_cmp	name_title	ticket_fare_category_order	age_category_order
0	3	male	22.0	1	7.2500	S	1	Mr	1.0	3.0
1	1	female	38.0	1	71.2833	C	1	Mrs	3.0	5.0
2	3	female	26.0	0	7.9250	S	0	Miss	1.0	4.0
3	1	female	35.0	1	53.1000	S	1	Mrs	3.0	5.0
4	3	male	35.0	0	8.0500	S	0	Mr	1.0	5.0

test = test[X.columns]
test.head()

	p_class	sex	age	num_sb_sp	num_pr_ch	ticket_fare	embark_port	num_cmp	name_title	ticket_fare_category_order	age_category_order
0	3	male	34.5	0	0	7.8292	Q	0	Mr	1.0	5.0
1	3	female	47.0	1	0	7.0000	S	1	Mrs	1.0	6.0
2	2	male	62.0	0	0	9.6875	Q	0	Mr	2.0	8.0
3	3	male	27.0	0	0	8.6625	S	0	Mr	1.0	4.0
4	3	female	22.0	1	1	12.2875	S	2	Mrs	2.0	3.0

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   p_class                     418 non-null    int64  
 1   sex                         418 non-null    object 
 2   age                         418 non-null    float64
 3   num_sb_sp                   418 non-null    int64  
 4   num_pr_ch                   418 non-null    int64  
 5   ticket_fare                 418 non-null    float64
 6   embark_port                 418 non-null    object 
 7   num_cmp                     418 non-null    int64  
 8   name_title                  418 non-null    object 
 9   ticket_fare_category_order  418 non-null    float64
 10  age_category_order          418 non-null    float64
dtypes: float64(4), int64(4), object(3)
memory usage: 39.2+ KB

X_submission_prep = full_pipeline.transform(test)

y_pred = clf_ensemble.predict(X_submission_prep)

submission = pd.DataFrame({
    "PassengerId": test_passenger_id,
    "Survived": y_pred
})

submission.head()

	PassengerId	Survived
0	892	0
1	893	1
2	894	0
3	895	0
4	896	1

#submission.to_csv('./data/titanic_submission.csv', index=False)

title

Youngjun Woo