HW4. Visualization, Correlation, and Linear Models

86 minute read

Topics: Data visualization, Correlation, Linear models

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.patches as mpatches

warnings.filterwarnings('ignore')

MY_UNIQNAME = 'yjwoo'

We will be using two different datasets for the two different parts of this homework. Download the data from:

https://www.kaggle.com/datasnaek/youtube-new

YouTube provides a list of trending videos on it’s site, determined by user interaction metrics such as likes, comments, and views. This dataset includes months of daily trending video across five different regions: the United States (“US”), Canada (“CA”), Great Britain (“GB”), Germany (“DE”), and France (“FR”).

https://www.kaggle.com/abcsds/pokemon

This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed.

Part 1: Answer the questions below based on the YouTube dataset

Write Python code that can answer the following questions, and
Explain your answers in plain English.

Q1. For 15 Points: Compare the distributions of comments, views, likes, and dislikes for

Plot histograms for these metrics for Canada. What can you say about them?
Try to apply a log transformation, and plot the histograms again. How do they look now?
Create a pairplot for Canada, as we did in this week’s class. Do you see anything interesting?
Create additional pairplots for the other four regions. Do they look similar?

df_youtube_canada = pd.read_csv("./data/CAvideos.csv")

df_youtube_canada.head()

	video_id	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description
0	n1WpP7iowLc	17.14.11	Eminem - Walk On Water (Audio) ft. Beyoncé	EminemVEVO	10	2017-11-10T17:00:03.000Z	Eminem\|"Walk"\|"On"\|"Water"\|"Aftermath/Shady/In...	17158579	787425	43420	125882	https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg	False	False	False	Eminem's new track Walk on Water ft. Beyoncé i...
1	0dBIkQ4Mz1M	17.14.11	PLUSH - Bad Unboxing Fan Mail	iDubbbzTV	23	2017-11-13T17:00:00.000Z	plush\|"bad unboxing"\|"unboxing"\|"fan mail"\|"id...	1014651	127794	1688	13030	https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg	False	False	False	STill got a lot of packages. Probably will las...
2	5qpjK5DgCt4	17.14.11	Racist Superman \| Rudy Mancuso, King Bach & Le...	Rudy Mancuso	23	2017-11-12T19:05:24.000Z	racist superman\|"rudy"\|"mancuso"\|"king"\|"bach"...	3191434	146035	5339	8181	https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg	False	False	False	WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3	d380meD0W0M	17.14.11	I Dare You: GOING BALD!?	nigahiga	24	2017-11-12T18:01:41.000Z	ryan\|"higa"\|"higatv"\|"nigahiga"\|"i dare you"\|"...	2095828	132239	1989	17518	https://i.ytimg.com/vi/d380meD0W0M/default.jpg	False	False	False	I know it's been a while since we did this sho...
4	2Vv-BfVoq4g	17.14.11	Ed Sheeran - Perfect (Official Music Video)	Ed Sheeran	10	2017-11-09T11:04:14.000Z	edsheeran\|"ed sheeran"\|"acoustic"\|"live"\|"cove...	33523622	1634130	21082	85067	https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg	False	False	False	🎧: https://ad.gt/yt-perfect\n💰: https://atlant...

df_youtube_canada.shape

(40881, 16)

The youtube_canada data consists of a total of 40881 rows and 16 columns.

df_youtube_canada.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40881 entries, 0 to 40880
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                40881 non-null  object
 1   trending_date           40881 non-null  object
 2   title                   40881 non-null  object
 3   channel_title           40881 non-null  object
 4   category_id             40881 non-null  int64 
 5   publish_time            40881 non-null  object
 6   tags                    40881 non-null  object
 7   views                   40881 non-null  int64 
 8   likes                   40881 non-null  int64 
 9   dislikes                40881 non-null  int64 
 10  comment_count           40881 non-null  int64 
 11  thumbnail_link          40881 non-null  object
 12  comments_disabled       40881 non-null  bool  
 13  ratings_disabled        40881 non-null  bool  
 14  video_error_or_removed  40881 non-null  bool  
 15  description             39585 non-null  object
dtypes: bool(3), int64(5), object(8)
memory usage: 4.2+ MB

You can check that there are no missing values except for the description column.

Views

df_youtube_canada["views"].describe()

count    4.088100e+04
mean     1.147036e+06
std      3.390913e+06
min      7.330000e+02
25%      1.439020e+05
50%      3.712040e+05
75%      9.633020e+05
max      1.378431e+08
Name: views, dtype: float64

sns.set(style = "darkgrid")
plt.figure(figsize = (8, 7))

sns.histplot(x = df_youtube_canada["views"])
plt.xlabel("Views", fontsize = 16)
plt.ylabel("Count", fontsize = 16)

plt.show()

png

The view column has a very large deviation with a minimum value of 733 and a maximum value of 130 million. Therefore, the standard deviation is also very large of 3.3 million. For this reason, even if you draw a histogram, you cannot check the distribution properly as above. To see the distribution in more detail, let’s check only the data within the 95% percentile

percentile_95_views = np.percentile(df_youtube_canada["views"], 95)
percentile_95_views

4090835.0

95% percentile of views column is about 4 million

sns.set(style="darkgrid")

fig, (ax_box, ax_hist) = plt.subplots(2, sharex = True, gridspec_kw = {"height_ratios": (.2, .8)}, figsize = (10, 7))

sns.boxplot(x = df_youtube_canada.views, ax = ax_box, showfliers = False)
sns.histplot(x = df_youtube_canada[df_youtube_canada["views"] <= percentile_95_views].views, ax = ax_hist)

plt.xlabel("Views", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
ax_box.set_xlabel("")

plt.show()

png

If you only look at data below 95% percentile, it is a right-skewed graph with a median of about 370,000.

np.log(df_youtube_canada["views"]).describe()

count    40881.000000
mean        12.810707
std          1.508807
min          6.597146
25%         11.876888
50%         12.824507
75%         13.778122
max         18.741627
Name: views, dtype: float64

sns.set(style="darkgrid")
plt.figure(figsize = (8, 7))

sns.histplot(x = np.log(df_youtube_canada["views"]))
plt.xlabel("Views", fontsize = 16)
plt.ylabel("Count", fontsize = 16)

plt.show()

png

After log transformation, it seems to follow a normal distribution with some bell shape. However, since it is not possible to judge only by the shape of histogram, let’s check the lag plot, QQplot, and run sequence plot together.

def checkPlots(data, column, isLogTransform):
    
    if isLogTransform:
        series = np.log(data[column])
    
    else:
        series = data[column]

    fig, axs = plt.subplots(2, 2, figsize = (15, 10))
    plt.tight_layout(pad = 0.4, w_pad = 4, h_pad = 1.0)

    # Histogram
    ax = sns.histplot(series, ax = axs[0, 0])
    if isLogTransform:
        ax.set_xlabel(f"log({column})", fontsize = 12)
        ax.set_title("Histogram", fontsize = 16)
    else:
        ax.set_xlabel(column, fontsize = 12)
    ax.set_ylabel("Count", fontsize = 12)
    ax.set_title("Histogram", fontsize = 16)

    # Lag plot
    lag = series.copy()
    lag = np.array(lag[:-1])
    current = series[1:]
    ax = sns.regplot(x = current, y = lag, fit_reg = False, ax = axs[0,1])
    ax.set_ylabel("y_i-1", fontsize = 12)
    ax.set_xlabel("y_i", fontsize = 12)
    ax.set_title("Lag plot", fontsize = 16)

    # QQ plot
    qntls, xr = stats.probplot(series, fit=False)
    ax = sns.regplot(x = xr, y = qntls, ax = axs[1,0])
    ax.set_title("QQ plot", fontsize = 16)

    # Run sequence
    ax = sns.regplot(x = np.arange(len(series)),y = series, ax = axs[1,1])
    ax.set_ylabel("val", fontsize = 12)
    ax.set_xlabel("i", fontsize = 12)
    ax.set_title("Run sequence plot", fontsize = 16)
    
    plt.tight_layout()
    plt.show()

checkPlots(data = df_youtube_canada, column = "views", isLogTransform = True)

png

The lag plot and run sequence plot should not show a certain pattern, and the qq plot means that the more points on the diagonal, the closer to the bell shape. Since the log transformation of views satisfies all of these conditions, it can be seen that it is close to a normal distribution.

Likes

df_youtube_canada["likes"].describe()

count    4.088100e+04
mean     3.958269e+04
std      1.326895e+05
min      0.000000e+00
25%      2.191000e+03
50%      8.780000e+03
75%      2.871700e+04
max      5.053338e+06
Name: likes, dtype: float64

sns.set(style="darkgrid")
plt.figure(figsize = (8, 7))

sns.histplot(x = df_youtube_canada["likes"])
plt.xlabel("Likes", fontsize = 16)
plt.ylabel("Count", fontsize = 16)

plt.show()

png

The likes column has a very large deviation with a minimum value of 0 and a maximum value of 5 million. Therefore, the standard deviation is also very large of one hundred thirty-five thousand. For this reason, even if you draw a histogram, you cannot check the distribution properly as above. To see the distribution in more detail, let’s check only the data within the 95% percentile

percentile_95_likes = np.percentile(df_youtube_canada["likes"], 95)
percentile_95_likes

165252.0

95% percentile of likes column is about one hundred sixty-five thousand.

sns.set(style="darkgrid")

fig, (ax_box, ax_hist) = plt.subplots(2, sharex = True, gridspec_kw = {"height_ratios": (.2, .8)}, figsize = (10, 7))

sns.boxplot(x = df_youtube_canada.likes, ax = ax_box, showfliers = False)
sns.histplot(x = df_youtube_canada[df_youtube_canada["likes"] <= percentile_95_likes].likes, ax = ax_hist)

plt.xlabel("Likes", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
ax_box.set_xlabel("")

plt.show()

png

If you only look at data below 95% percentile, it is a right-skewed graph with a median of about 8,780.

np.log(df_youtube_canada[df_youtube_canada["likes"] > 0].likes).describe()

count    40597.000000
mean         8.951208
std          1.969017
min          0.000000
25%          7.727535
50%          9.095154
75%         10.275051
max         15.435560
Name: likes, dtype: float64

checkPlots(data = df_youtube_canada, column = "likes", isLogTransform = True)

png

Looking at the log-transformed graph, the lag plot and run sequence plot do not show any pattern. Although the qq plot deviated a little from the diagonal, it can be seen that it is almost like a bell shape. Therefore, the log-transformed likes approximates the normal distribution.

Dislikes

df_youtube_canada["dislikes"].describe()

count    4.088100e+04
mean     2.009195e+03
std      1.900837e+04
min      0.000000e+00
25%      9.900000e+01
50%      3.030000e+02
75%      9.500000e+02
max      1.602383e+06
Name: dislikes, dtype: float64

sns.set(style="darkgrid")
plt.figure(figsize = (8, 7))

sns.histplot(x = df_youtube_canada["dislikes"])
plt.xlabel("Dislikes", fontsize = 16)
plt.ylabel("Count", fontsize = 16)

plt.show()

png

The dislikes column has a very large deviation with a minimum value of 0 and a maximum value of about 1,600,000. Therefore, the standard deviation is also very large of 19,000. For this reason, even if you draw a histogram, you cannot check the distribution properly as above. To see the distribution in more detail, let’s check only the data within the 95% percentile.

percentile_95_dislikes = np.percentile(df_youtube_canada["dislikes"], 95)
percentile_95_dislikes

6479.0

95% percentile of likes column is 6,479.

sns.set(style="darkgrid")

fig, (ax_box, ax_hist) = plt.subplots(2, sharex = True, gridspec_kw = {"height_ratios": (.2, .8)}, figsize = (10, 7))

sns.boxplot(x = df_youtube_canada.dislikes, ax = ax_box, showfliers = False)
sns.histplot(x = df_youtube_canada[df_youtube_canada["dislikes"] <= percentile_95_dislikes].dislikes, ax = ax_hist)

plt.xlabel("Dislikes", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
ax_box.set_xlabel("")

plt.show()

png

If you only look at data below 95% percentile, it is a right-skewed graph with a median of about 303.

np.log(df_youtube_canada[df_youtube_canada["dislikes"] > 0].dislikes).describe()

count    40488.000000
mean         5.761168
std          1.796411
min          0.000000
25%          4.634729
50%          5.733341
75%          6.870313
max         14.287002
Name: dislikes, dtype: float64

checkPlots(data = df_youtube_canada, column = "dislikes", isLogTransform = True)

png

Looking at the log-transformed graph, the lag plot and run sequence plot do not show any pattern. Also, qqplot is almost diagonal. Therefore, the log-transformed dislikes approximates the normal distribution.

Comment counts

df_youtube_canada["comment_count"].describe()

count    4.088100e+04
mean     5.042975e+03
std      2.157902e+04
min      0.000000e+00
25%      4.170000e+02
50%      1.301000e+03
75%      3.713000e+03
max      1.114800e+06
Name: comment_count, dtype: float64

sns.set(style="darkgrid")
plt.figure(figsize = (8, 7))

sns.histplot(x = df_youtube_canada["comment_count"])
plt.xlabel("Comment counts", fontsize = 16)
plt.ylabel("Count", fontsize = 16)

plt.show()

png

The comment counts column has a very large deviation with a minimum value of 0 and a maximum value of 1,114,800. Therefore, the standard deviation is also very large of 21,579. For this reason, even if you draw a histogram, you cannot check the distribution properly as above. To see the distribution in more detail, let’s check only the data within the 95% percentile

percentile_95_comments = np.percentile(df_youtube_canada["comment_count"], 95)
percentile_95_comments

19210.0

95% percentile of comment counts column is 19210.

sns.set(style="darkgrid")

fig, (ax_box, ax_hist) = plt.subplots(2, sharex = True, gridspec_kw = {"height_ratios": (.2, .8)}, figsize = (10, 7))

sns.boxplot(x = df_youtube_canada.comment_count, ax = ax_box, showfliers = False)
sns.histplot(x = df_youtube_canada[df_youtube_canada["comment_count"] <= percentile_95_comments].comment_count, ax = ax_hist)

plt.xlabel("Comment counts", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
ax_box.set_xlabel("")

plt.show()

png

If you only look at data below 95% percentile, it is a right-skewed graph with a median of about 1,301.

np.log(df_youtube_canada[df_youtube_canada["comment_count"] > 0].comment_count).describe()

count    40235.000000
mean         7.116182
std          1.752948
min          0.000000
25%          6.102559
50%          7.202661
75%          8.238801
max         13.924186
Name: comment_count, dtype: float64

checkPlots(data = df_youtube_canada, column = "comment_count", isLogTransform = True)

png

Pair plot

plt.figure(figsize = (20,20))
sns.pairplot(df_youtube_canada, vars = ['views', 'likes', 'dislikes', 'comment_count'])
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

Above is a pair plot of views, likes, dislikes, and comment_counts. It is difficult to check the overall distribution because of some extreme values, so let’s draw only the data below the 95% percentile for each column. Also, since there are too many points, it is difficult to check the distribution with a simple scatter plot, so let’s check it with a kde plot.

df_youtube_canada["country"] = "Canada"

df_youtube_canada_under_percentile = df_youtube_canada[(df_youtube_canada.views <= np.percentile(df_youtube_canada.views, 95)) & \
                                                       (df_youtube_canada.likes <= np.percentile(df_youtube_canada.likes, 95)) & \
                                                       (df_youtube_canada.dislikes <= np.percentile(df_youtube_canada.dislikes, 95)) & \
                                                       (df_youtube_canada.comment_count <= np.percentile(df_youtube_canada.comment_count, 95))]

df_youtube_canada.shape, df_youtube_canada_under_percentile.shape

((40881, 17), (37131, 17))

When only data that falls below the 95% percentile for each column are selected, only 3000 data out of a total of 40,000 are excluded, so it can be seen that the distribution is not affected significantly.

plt.figure(figsize = (20,20))
sns.pairplot(df_youtube_canada_under_percentile, vars = ['views', 'likes', 'dislikes', 'comment_count'], kind = "kde")
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

The above shows a kde pair plot of columns. But for kde plot, the code takes too long to run, so let’s sample only 3000 and draw it.

np.random.seed(0)
plt.figure(figsize = (20,20))
sns.pairplot(df_youtube_canada_under_percentile.sample(3000), vars = ['views', 'likes', 'dislikes', 'comment_count'], kind = "kde")
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

Comparing the kde plot of the total 37000 data and the kde plot of the sampled 3000 data shows almost similarities. Of course, it would be good to compare the distribution through the entire data, but since there is a problem that the code does not run, let’s sample only 3000 data for each country and compare the kde plot.

# read France data
df_youtube_france = pd.read_csv("./data/FRvideos.csv")
df_youtube_france["country"] = "France"

df_youtube_france_under_percentile = df_youtube_france[(df_youtube_france.views <= np.percentile(df_youtube_france.views, 95)) & \
                                                       (df_youtube_france.likes <= np.percentile(df_youtube_france.likes, 95)) & \
                                                       (df_youtube_france.dislikes <= np.percentile(df_youtube_france.dislikes, 95)) & \
                                                       (df_youtube_france.comment_count <= np.percentile(df_youtube_france.comment_count, 95))]

# read US data
df_youtube_us = pd.read_csv("./data/USvideos.csv")
df_youtube_us["country"] = "United States"

df_youtube_us_under_percentile = df_youtube_us[(df_youtube_us.views <= np.percentile(df_youtube_us.views, 95)) & \
                                               (df_youtube_us.likes <= np.percentile(df_youtube_us.likes, 95)) & \
                                               (df_youtube_us.dislikes <= np.percentile(df_youtube_us.dislikes, 95)) & \
                                               (df_youtube_us.comment_count <= np.percentile(df_youtube_us.comment_count, 95))]

# read Germany data
df_youtube_germany = pd.read_csv("./data/DEvideos.csv")
df_youtube_germany["country"] = "Germany"

df_youtube_germany_under_percentile = df_youtube_germany[(df_youtube_germany.views <= np.percentile(df_youtube_germany.views, 95)) & \
                                                         (df_youtube_germany.likes <= np.percentile(df_youtube_germany.likes, 95)) & \
                                                         (df_youtube_germany.dislikes <= np.percentile(df_youtube_germany.dislikes, 95)) & \
                                                         (df_youtube_germany.comment_count <= np.percentile(df_youtube_germany.comment_count, 95))]

# read Great Britain data
df_youtube_gb = pd.read_csv("./data/GBvideos.csv")
df_youtube_gb["country"] = "Great Britain"

df_youtube_gb_under_percentile = df_youtube_gb[(df_youtube_gb.views <= np.percentile(df_youtube_gb.views, 95)) & \
                                               (df_youtube_gb.likes <= np.percentile(df_youtube_gb.likes, 95)) & \
                                               (df_youtube_gb.dislikes <= np.percentile(df_youtube_gb.dislikes, 95)) & \
                                               (df_youtube_gb.comment_count <= np.percentile(df_youtube_gb.comment_count, 95))]

plt.figure(figsize = (20,20))
sns.color_palette()
g = sns.pairplot(df_youtube_france_under_percentile.sample(3000), \
                 vars = ['views', 'likes', 'dislikes', 'comment_count'], \
                 plot_kws = {'alpha':0.5}, corner = True)
g.map_lower(sns.kdeplot, color = "darkBlue")
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

plt.figure(figsize = (20,20))
sns.color_palette()
g = sns.pairplot(df_youtube_us_under_percentile.sample(3000), \
                 vars = ['views', 'likes', 'dislikes', 'comment_count'], \
                 plot_kws = {'alpha':0.5}, corner = True)
g.map_lower(sns.kdeplot, color = "darkBlue")
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

plt.figure(figsize = (20,20))
sns.color_palette()
g = sns.pairplot(df_youtube_germany_under_percentile.sample(3000), \
                 vars = ['views', 'likes', 'dislikes', 'comment_count'], \
                 plot_kws = {'alpha':0.5}, corner = True)
g.map_lower(sns.kdeplot, color = "darkBlue")
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

plt.figure(figsize = (20,20))
sns.color_palette()
g = sns.pairplot(df_youtube_gb_under_percentile.sample(3000), \
                 vars = ['views', 'likes', 'dislikes', 'comment_count'], \
                 plot_kws = {'alpha':0.5}, corner = True)
g.map_lower(sns.kdeplot, color = "darkBlue")
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

If you look at the pairplots of the other regions, you can see that they all have a similar distribution, starting from (0,0), and gradually spreading as the x-axis and y-axis values increase.

Q2. For 10 Points: Create a heatmap of correlations between the variables for a region of your choice

A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.

Seaborn makes it easy to create a heatmap with seaborn.heatmap()

Create a correlation matrix for your numeric variables using Pandas with DataFrame.corr(). That is, if your dataframe is called df, use df.corr().
Pass in your correlation matrix to seaborn.heatmap(), and annotate it with the parameter annot=True.
Experiment with colormaps that are different from the default one and choose one that you think is best. Comment on why you think so.
Are there any interesting correlations? What are they?

us_corr = df_youtube_us.corr()
us_corr

	category_id	views	likes	dislikes	comment_count	comments_disabled	ratings_disabled	video_error_or_removed
category_id	1.000000	-0.168231	-0.173921	-0.033547	-0.076307	0.048949	-0.013506	-0.030011
views	-0.168231	1.000000	0.849177	0.472213	0.617621	0.002677	0.015355	-0.002256
likes	-0.173921	0.849177	1.000000	0.447186	0.803057	-0.028918	-0.020888	-0.002641
dislikes	-0.033547	0.472213	0.447186	1.000000	0.700184	-0.004431	-0.008230	-0.001853
comment_count	-0.076307	0.617621	0.803057	0.700184	1.000000	-0.028277	-0.013819	-0.003725
comments_disabled	0.048949	0.002677	-0.028918	-0.004431	-0.028277	1.000000	0.319230	-0.002970
ratings_disabled	-0.013506	0.015355	-0.020888	-0.008230	-0.013819	0.319230	1.000000	-0.001526
video_error_or_removed	-0.030011	-0.002256	-0.002641	-0.001853	-0.003725	-0.002970	-0.001526	1.000000

The above table shows the correlation matrix of numerical variables

df_corr = us_corr.unstack().reset_index().rename(columns = {"level_0" : "var1", "level_1" : "var2", 0 : "correlation"})
mask_dups = (df_corr[['var1', 'var2']].apply(frozenset, axis=1).duplicated()) | (df_corr['var1'] == df_corr['var2']) 
df_corr = df_corr[~mask_dups]
df_corr.sort_values("correlation", key = abs, ascending = False).head(5)

	var1	var2	correlation
10	views	likes	0.849177
20	likes	comment_count	0.803057
28	dislikes	comment_count	0.700184
12	views	comment_count	0.617621
11	views	dislikes	0.472213

If 5 pairs with high correlation are selected, it is as above. Since views and likes, likes and comment counts have very high values of 0.8 or more, it can be seen that there is a strong linear relationship between the two variables.

sns.heatmap(us_corr, annot = True, linewidths = .5, center = 0)

<AxesSubplot:>

png

The above graph is a heatmap of the correlation matrix. Let’s try with colormaps that are different from the default one.

sns.heatmap(us_corr, annot = True, linewidths = .5, cmap="YlGnBu", center = 0)

<AxesSubplot:>

png

sns.heatmap(us_corr, annot = True, linewidths = .5, cmap="Blues", center = 0)

<AxesSubplot:>

png

sns.heatmap(us_corr, annot = True, linewidths = .5, cmap="BuPu", center = 0)

<AxesSubplot:>

png

sns.heatmap(us_corr, annot = True, linewidths = .5, cmap="Greens", center = 0)

<AxesSubplot:>

png

sns.heatmap(us_corr, annot = True, linewidths = .5, cmap="PiYG", center = 0)

<AxesSubplot:>

png

sns.heatmap(us_corr, annot = True, linewidths = .5, cmap="BrBG", center = 0)

<AxesSubplot:>

png

I think BrBG is the best because the distinction between high and low values is clear.

Although correlation was obtained with all numerical columns, category_id, comments_disabled, ratings_disabled, and video_error_or_removed columns do not have numerical meaning. Therefore, it can be seen that only the views, likes, dislikes, and comment_counts columns that have real numerical meaning have high correlation values. Among them, the linear relationship between view and like, like and comment_count, and dislike and comment_count was strong.

Q3. For 15 points: Create and compare OLS models using variables of your choice, for a region of your choice

Use statsmodels to perform an ANOVA (categorical regression) of a variable of your choice as the dependent variable (for example, views) and the video category as the independent variable. Note that you need to use a categorical variable as your independent variable.
Provide your interpretation of the results.
Create two different regression models where the dependent variable is the same, and the independent variables are different. Note that your independent variable needs to be a continuous numerical variables. What does your interpretation say about the two models?

sns.boxplot(x = "category_id", y = "dislikes", data = df_youtube_us, showfliers = False)
plt.show()

png

The boxplot between category_id and dislikes in us region is as above. Among category_ids, 1, 10, and 20 seem to have a higher number of dislikes, especially compared to other categories. Let’s check this numerically through anova analysis.

lm0 = smf.ols("dislikes ~ category_id", data = df_youtube_us)
res0 = lm0.fit()
print(res0.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               dislikes   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     46.13
Date:                Tue, 15 Feb 2022   Prob (F-statistic):           1.12e-11
Time:                        15:46:37   Log-Likelihood:            -4.7888e+05
No. Observations:               40949   AIC:                         9.578e+05
Df Residuals:                   40947   BIC:                         9.578e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    6281.3987    404.626     15.524      0.000    5488.323    7074.474
category_id  -128.6773     18.945     -6.792      0.000    -165.809     -91.545
==============================================================================
Omnibus:                   118799.008   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):       6797851742.784
Skew:                          40.289   Prob(JB):                         0.00
Kurtosis:                    1997.416   Cond. No.                         60.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

When anova analysis is performed, the F-statistic probability is almost 0. This means that the null hypothesis that there is no significant difference between Generations can be rejected, so it can be seen that there is statistical difference in dislikes between category_id.

tukeyhsd_res0 = pairwise_tukeyhsd(df_youtube_us["dislikes"], df_youtube_us["category_id"])
tukeyhsd_res0.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1	group2	meandiff	p-adj	lower	upper	reject
1	2	-1957.8429	0.9	-7401.0862	3485.4004	False
1	10	5317.0763	0.001	2933.8638	7700.2887	True
1	15	-2017.4434	0.9	-5863.9747	1829.0879	False
1	17	-229.3424	0.9	-3173.1743	2714.4894	False
1	19	-1743.8481	0.9	-7081.3482	3593.652	False
1	20	8651.015	0.001	4634.1106	12667.9194	True
1	22	583.1195	0.9	-2102.9121	3269.151	False
1	23	-499.1596	0.9	-3144.3733	2146.0541	False
1	24	1723.6163	0.4009	-545.8102	3993.0428	False
1	25	-909.9219	0.9	-3756.0023	1936.1585	False
1	26	-1270.3971	0.9	-3825.2314	1284.4373	False
1	27	-1774.2732	0.8543	-4948.0451	1399.4986	False
1	28	-696.3033	0.9	-3567.0137	2174.4071	False
1	29	55486.1782	0.001	42231.4548	68740.9016	True
1	43	-2160.7165	0.9	-15415.4399	11094.0068	False
2	10	7274.9192	0.001	2081.6175	12468.2209	True
2	15	-59.6005	0.9	-6066.8032	5947.6022	False
2	17	1728.5005	0.9	-3744.7826	7201.7835	False
2	19	213.9948	0.9	-6841.4704	7269.46	False
2	20	10608.8579	0.001	4491.1621	16726.5537	True
2	22	2540.9624	0.9	-2798.0868	7880.0116	False
2	23	1458.6833	0.9	-3859.9478	6777.3144	False
2	24	3681.4592	0.5062	-1460.6199	8823.5384	False
2	25	1047.921	0.9	-4373.4123	6469.2543	False
2	26	687.4458	0.9	-4586.8181	5961.7097	False
2	27	183.5697	0.9	-5416.7436	5783.883	False
2	28	1261.5396	0.9	-4172.7643	6695.8436	False
2	29	57444.0211	0.001	43409.1227	71478.9195	True
2	43	-202.8736	0.9	-14237.772	13832.0247	False
10	15	-7334.5197	0.001	-10818.3808	-3850.6586	True
10	17	-5546.4187	0.001	-7997.4655	-3095.3719	True
10	19	-7060.9244	0.001	-12143.2853	-1978.5635	True
10	20	3333.9387	0.1255	-337.1655	7005.0429	False
10	22	-4733.9568	0.001	-6868.4943	-2599.4193	True
10	23	-5816.2359	0.001	-7899.1762	-3733.2956	True
10	24	-3593.46	0.001	-5171.9977	-2014.9222	True
10	25	-6226.9982	0.001	-8559.7344	-3894.2619	True
10	26	-6587.4734	0.001	-8554.3652	-4620.5815	True
10	27	-7091.3495	0.001	-9814.273	-4368.426	True
10	28	-6013.3795	0.001	-8376.1032	-3650.6559	True
10	29	50169.1019	0.001	37015.0464	63323.1574	True
10	43	-7477.7928	0.8334	-20631.8483	5676.2627	False
15	17	1788.101	0.9	-2100.8233	5677.0253	False
15	19	273.5953	0.9	-5637.9607	6185.1512	False
15	20	10668.4584	0.001	5915.2377	15421.6792	True
15	22	2600.5629	0.5342	-1097.0515	6298.1772	False
15	23	1518.2838	0.9	-2149.7868	5186.3544	False
15	24	3741.0597	0.0158	334.0254	7148.0941	True
15	25	1107.5215	0.9	-2707.9418	4922.9848	False
15	26	747.0463	0.9	-2856.3916	4350.4843	False
15	27	243.1702	0.9	-3822.591	4308.9314	False
15	28	1321.1401	0.9	-2512.7306	5155.0108	False
15	29	57503.6216	0.001	44007.5008	70999.7424	True
15	43	-143.2731	0.9	-13639.394	13352.8477	False
17	19	-1514.5057	0.9	-6882.6372	3853.6259	False
17	20	8880.3574	0.001	4822.8397	12937.8752	True
17	22	812.4619	0.9	-1933.9347	3558.8586	False
17	23	-269.8172	0.9	-2976.3065	2436.6722	False
17	24	1952.9588	0.2363	-387.6022	4293.5197	False
17	25	-680.5795	0.9	-3583.6989	2222.54	False
17	26	-1041.0546	0.9	-3659.2807	1577.1714	False
17	27	-1544.9308	0.9	-4769.9512	1680.0896	False
17	28	-466.9608	0.9	-3394.2304	2460.3087	False
17	29	55715.5206	0.001	42448.4328	68982.6085	True
17	43	-1931.3741	0.9	-15198.4619	11335.7138	False
19	20	10394.8631	0.001	4371.0594	16418.6669	True
19	22	2326.9676	0.9	-2904.2327	7558.1679	False
19	23	1244.6885	0.9	-3965.671	6455.048	False
19	24	3467.4644	0.5652	-1562.5442	8497.4731	False
19	25	833.9262	0.9	-4481.228	6149.0804	False
19	26	473.451	0.9	-4691.6113	5638.5134	False
19	27	-30.4251	0.9	-5528.0172	5467.1669	False
19	28	1047.5448	0.9	-4280.8385	6375.9282	False
19	29	57230.0263	0.001	43235.7996	71224.253	True
19	43	-416.8684	0.9	-14411.0952	13577.3583	False
20	22	-8067.8955	0.001	-11942.4368	-4193.3543	True
20	23	-9150.1746	0.001	-12996.5313	-5303.8179	True
20	24	-6927.3987	0.001	-10525.6762	-3329.1212	True
20	25	-9560.9369	0.001	-13548.1011	-5573.7727	True
20	26	-9921.4121	0.001	-13706.182	-6136.6422	True
20	27	-10425.2882	0.001	-14652.5961	-6197.9803	True
20	28	-9347.3183	0.001	-13352.1007	-5342.5358	True
20	29	46835.1632	0.001	33289.4998	60380.8265	True
20	43	-10811.7315	0.3086	-24357.3949	2733.9318	False
22	23	-1082.2791	0.9	-3505.8517	1341.2935	False
22	24	1140.4968	0.8337	-866.2033	3147.1969	False
22	25	-1493.0414	0.8405	-4134.39	1148.3072	False
22	26	-1853.5166	0.3103	-4178.1084	471.0753	False
22	27	-2357.3927	0.3306	-5348.9436	634.1581	False
22	28	-1279.4228	0.9	-3947.2921	1388.4466	False
22	29	54903.0587	0.001	41690.7826	68115.3348	True
22	43	-2743.836	0.9	-15956.1121	10468.4401	False
23	24	2222.7759	0.0094	271.0497	4174.5022	True
23	25	-410.7623	0.9	-3010.5916	2189.067	False
23	26	-771.2375	0.9	-3048.5423	1506.0674	False
23	27	-1275.1136	0.9	-4230.0699	1679.8426	False
23	28	-197.1437	0.9	-2823.913	2429.6256	False
23	29	55985.3378	0.001	42781.2994	69189.3762	True
23	43	-1661.5569	0.9	-14865.5953	11542.4815	False
24	25	-2633.5382	0.0048	-4849.8986	-417.1778	True
24	26	-2994.0134	0.001	-4821.3772	-1166.6496	True
24	27	-3497.8896	0.001	-6121.8003	-873.9788	True
24	28	-2419.9196	0.0207	-4667.8204	-172.0188	True
24	29	53762.5619	0.001	40628.6451	66896.4787	True
24	43	-3884.3329	0.9	-17018.2497	9249.584	False
25	26	-360.4752	0.9	-2868.2901	2147.3397	False
25	27	-864.3513	0.9	-4000.3973	2271.6947	False
25	28	213.6186	0.9	-2615.3273	3042.5645	False
25	29	56396.1001	0.001	43150.3593	69641.8409	True
25	43	-1250.7946	0.9	-14496.5354	11994.9461	False
26	27	-503.8762	0.9	-3378.2091	2370.4567	False
26	28	574.0938	0.9	-1961.6388	3109.8264	False
26	29	56756.5753	0.001	43570.3456	69942.805	True
26	43	-890.3195	0.9	-14076.5491	12295.9102	False
27	28	1077.97	0.9	-2080.4456	4236.3856	False
27	29	57260.4514	0.001	43940.4551	70580.4478	True
27	43	-386.4433	0.9	-13706.4396	12933.553	False
28	29	56182.4815	0.001	42931.4267	69433.5362	True
28	43	-1464.4133	0.9	-14715.468	11786.6415	False
29	43	-57646.8947	0.001	-76168.1573	-39125.6322	True

The above shows the turkeyhsd table for dislikes between category_id. When the reject column is true, it can be interpreted that there is a statistically significant difference between the two groups. As can be seen from the table above, there are many pairs with a significant difference in dislikes between category_id.

Also, after merging the data from all 5 regions, let’s check if there is a significant difference in the dislike value by region.

df_youtube = pd.concat([df_youtube_canada, df_youtube_france, df_youtube_us, df_youtube_germany,
                        df_youtube_gb], axis = 0, ignore_index = True)

sns.boxplot(x = "country", y = "dislikes", data=df_youtube, showfliers = False)

<AxesSubplot:xlabel='country', ylabel='dislikes'>

png

Looking at the boxplot, it seems that United States and Great Britain regions have particularly high dislikes. Let’s check this numerically through anova analysis.

lm0 = smf.ols("dislikes ~ country", data = df_youtube)
res0 = lm0.fit()
print(res0.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               dislikes   R-squared:                       0.007
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     365.5
Date:                Tue, 15 Feb 2022   Prob (F-statistic):          3.44e-314
Time:                        15:52:38   Log-Likelihood:            -2.3623e+06
No. Observations:              202310   AIC:                         4.725e+06
Df Residuals:                  202305   BIC:                         4.725e+06
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                 2009.1954    140.942     14.256      0.000    1732.953    2285.438
country[T.France]        -1194.2331    199.514     -5.986      0.000   -1585.275    -803.191
country[T.Germany]        -612.0595    199.372     -3.070      0.002   -1002.823    -221.296
country[T.Great Britain]  5603.3645    201.822     27.764      0.000    5207.798    5998.931
country[T.United States]  1702.2054    199.239      8.544      0.000    1311.702    2092.709
==============================================================================
Omnibus:                   609442.615   Durbin-Watson:                   1.976
Prob(Omnibus):                  0.000   Jarque-Bera (JB):      50644814857.182
Skew:                          44.672   Prob(JB):                         0.00
Kurtosis:                    2452.490   Cond. No.                         5.80
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

tukeyhsd_res0 = pairwise_tukeyhsd(df_youtube["dislikes"], df_youtube["country"])
tukeyhsd_res0.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1	group2	meandiff	p-adj	lower	upper	reject
Canada	France	-1194.2331	0.001	-1738.4618	-650.0043	True
Canada	Germany	-612.0595	0.0182	-1155.9009	-68.2181	True
Canada	Great Britain	5603.3645	0.001	5052.839	6153.8901	True
Canada	United States	1702.2054	0.001	1158.7263	2245.6846	True
France	Germany	582.1735	0.0291	37.8085	1126.5386	True
France	Great Britain	6797.5976	0.001	6246.5548	7348.6404	True
France	United States	2896.4385	0.001	2352.4353	3440.4417	True
Germany	Great Britain	6215.4241	0.001	5664.7638	6766.0843	True
Germany	United States	2314.265	0.001	1770.6493	2857.8806	True
Great Britain	United States	-3901.1591	0.001	-4451.4617	-3350.8565	True

The above shows the turkeyhsd table for dislikes between regeions. When the reject column is true, it can be interpreted that there is a statistically significant difference between the two groups. As can be seen from the table above, All different regeion pairs showed statistically significant differences in dislike values.

Part 2: Answer the questions below based on the Pokémon dataset </span>

Write Python code that can answer the following questions, and
Explain your answers in plain English.

df_pokemon = pd.read_csv("./data/Pokemon.csv")

df_pokemon.head()

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

Delete the “#” column because it is an index.

df_pokemon.drop("#", axis = 1, inplace = True)
df_pokemon.head()

	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

df_pokemon.shape

(800, 12)

Pokemon data consists of a total of 800 rows and 12 columns.

df_pokemon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        800 non-null    object
 1   Type 1      800 non-null    object
 2   Type 2      414 non-null    object
 3   Total       800 non-null    int64 
 4   HP          800 non-null    int64 
 5   Attack      800 non-null    int64 
 6   Defense     800 non-null    int64 
 7   Sp. Atk     800 non-null    int64 
 8   Sp. Def     800 non-null    int64 
 9   Speed       800 non-null    int64 
 10  Generation  800 non-null    int64 
 11  Legendary   800 non-null    bool  
dtypes: bool(1), int64(8), object(3)
memory usage: 69.7+ KB

It can be seen that only type2 has a missing value.

Q4. For 10 Points: Plot the pairs of different ability points (HP, Attack, Sp. Attack, Defense, etc.).

Which pairs have the most/least correlation coefficients?

sns.pairplot(data = df_pokemon, vars = ["Total", "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"], \
             kind = "reg", plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.5}}, corner = True)

<seaborn.axisgrid.PairGrid at 0x7f77fbf8e9d0>

png

The graph above is a pair plot of different abilities. Through the red regression line in each scatter plot, it is possible to check how much linear relationship there is between the two variables. When visually confirmed, the linear relationship between HP and Attack, and Total and Sp.Atk was strongest, and the linear relationship between Defense and Speed was the smallest. Let’s check this numerically.

pokemon_ability_corr = df_pokemon[["Total", "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]].corr()
pokemon_ability_corr

	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
Total	1.000000	0.618748	0.736211	0.612787	0.747250	0.717609	0.575943
HP	0.618748	1.000000	0.422386	0.239622	0.362380	0.378718	0.175952
Attack	0.736211	0.422386	1.000000	0.438687	0.396362	0.263990	0.381240
Defense	0.612787	0.239622	0.438687	1.000000	0.223549	0.510747	0.015227
Sp. Atk	0.747250	0.362380	0.396362	0.223549	1.000000	0.506121	0.473018
Sp. Def	0.717609	0.378718	0.263990	0.510747	0.506121	1.000000	0.259133
Speed	0.575943	0.175952	0.381240	0.015227	0.473018	0.259133	1.000000

sns.heatmap(pokemon_ability_corr, annot = True, linewidths = .5, cmap="BrBG", center = 0)

<AxesSubplot:>

png

Looking at the above table and heat map, it can be seen that total has a strong linear relationship with all other abilities. Since the stat name is total, it can be inferred that total is the sum of other abilities.

np.sum(df_pokemon.Total != df_pokemon["HP"] + df_pokemon["Attack"] + df_pokemon["Defense"] + \
                           df_pokemon["Sp. Atk"] + df_pokemon["Sp. Def"] + df_pokemon["Speed"])

As expected, it can be seen that total is the simple sum of the remaining 6 ability values. Therefore, let’s check the linear relationship between the other abilities except for the total ability.

pokemon_ability_corr = df_pokemon[["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]].corr()
pokemon_ability_corr

	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
HP	1.000000	0.422386	0.239622	0.362380	0.378718	0.175952
Attack	0.422386	1.000000	0.438687	0.396362	0.263990	0.381240
Defense	0.239622	0.438687	1.000000	0.223549	0.510747	0.015227
Sp. Atk	0.362380	0.396362	0.223549	1.000000	0.506121	0.473018
Sp. Def	0.378718	0.263990	0.510747	0.506121	1.000000	0.259133
Speed	0.175952	0.381240	0.015227	0.473018	0.259133	1.000000

sns.heatmap(pokemon_ability_corr, annot = True, linewidths = .5, cmap="BrBG", center = 0)

<AxesSubplot:>

png

When Total was excluded, Defense and Sp.Def, and Sp.Atk and Sp.Def showed the highest correlation at 0.51. Also, since the correlation between Speed and Defense is 0.015, as confirmed in the pair plot earlier, it can be seen that there is little linear relationship between the two abilities.

Q5. For 15 Points: Plot the distribution of ability points per Pokémon type

How would you describe each Pokémon type with different ability points?

There are two types of Pokemon: Type1 and Type2.

df_pokemon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        800 non-null    object
 1   Type 1      800 non-null    object
 2   Type 2      414 non-null    object
 3   Total       800 non-null    int64 
 4   HP          800 non-null    int64 
 5   Attack      800 non-null    int64 
 6   Defense     800 non-null    int64 
 7   Sp. Atk     800 non-null    int64 
 8   Sp. Def     800 non-null    int64 
 9   Speed       800 non-null    int64 
 10  Generation  800 non-null    int64 
 11  Legendary   800 non-null    bool  
dtypes: bool(1), int64(8), object(3)
memory usage: 69.7+ KB

Looking at the table above, all Pokemon have Type1, but about half do not have Type2.

print(sorted(df_pokemon["Type 1"].unique()))

['Bug', 'Dark', 'Dragon', 'Electric', 'Fairy', 'Fighting', 'Fire', 'Flying', 'Ghost', 'Grass', 'Ground', 'Ice', 'Normal', 'Poison', 'Psychic', 'Rock', 'Steel', 'Water']

There are a total of 18 types in Type1 as shown in the list above.

print(sorted(df_pokemon[df_pokemon["Type 2"].isnull() == False]["Type 2"].unique()))

['Bug', 'Dark', 'Dragon', 'Electric', 'Fairy', 'Fighting', 'Fire', 'Flying', 'Ghost', 'Grass', 'Ground', 'Ice', 'Normal', 'Poison', 'Psychic', 'Rock', 'Steel', 'Water']

As in the list above, in Type2, there are a total of 18 types. If we compare Type1 and Type2, we can confirm that they are all the same. It is said that the order of Type1 and Type2 is not important in the real Pokemon world. (https://forums.serebii.net/threads/does-type-order-really-matter.316547/)

# Create data by combining type1 and type2 columns as one Type column. Dual type Pokemon will have 2 rows.
df_pokemon_type_indifference = df_pokemon[['Name', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', \
                                           'Sp. Def', 'Speed', 'Generation', 'Legendary']] \
                                   .merge(pd.concat([df_pokemon[["Name", "Type 1"]].rename(columns = {"Type 1" : "Type"}), \
                                                     df_pokemon[["Name", "Type 2"]].rename(columns = {"Type 2" : "Type"})], \
                                                     axis = 0, ignore_index = True), \
                                          on = "Name", how = "left") \
                                   [["Name", "Type", "Total", "HP", "Attack", 'Defense', 'Sp. Atk', \
                                     'Sp. Def', 'Speed', 'Generation', 'Legendary']]

type_count = df_pokemon_type_indifference.Type.value_counts().reset_index().rename(columns = {"index" : "Type", "Type" : "Total Count"}) \
             .merge(df_pokemon[df_pokemon["Type 2"].isnull()]["Type 1"].value_counts().reset_index() \
             .rename(columns = {"index" : "Type", "Type 1" : "Single-type Count"}), on = "Type", how = "left")
type_count["Single-type Ratio"] = np.round(type_count["Single-type Count"] / type_count["Total Count"] * 100, 2)
type_count

	Type	Total Count	Single-type Count	Single-type Ratio
0	Water	126	59	46.83
1	Normal	102	61	59.80
2	Flying	101	2	1.98
3	Grass	95	33	34.74
4	Psychic	90	38	42.22
5	Bug	72	17	23.61
6	Ground	67	13	19.40
7	Fire	64	28	43.75
8	Poison	62	15	24.19
9	Rock	58	9	15.52
10	Fighting	53	20	37.74
11	Dark	51	10	19.61
12	Dragon	50	11	22.00
13	Electric	50	27	54.00
14	Steel	49	5	10.20
15	Ghost	46	10	21.74
16	Fairy	40	15	37.50
17	Ice	38	13	34.21

plt.figure(figsize = (15, 7))

# bar graph for total pokemon count
color = "darkblue"
ax1 = sns.barplot(x = "Type", y = "Total Count", color = color, alpha = 0.8, \
                  data = type_count)
ax1.set_xlabel("Type", fontsize = 16)
top_bar = mpatches.Patch(color = color, label = 'Num of Total Pokemon')

# bar graph for total non-type2 pokemon count
color = "lightblue"
ax2 = sns.barplot(x = "Type", y = "Single-type Count",  color = color, alpha = 0.8, \
                  data = type_count)
ax2.set_ylabel("Num of Pokemon", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of single-type Pokemon')

plt.legend(handles=[top_bar, low_bar])
plt.show()

png

The table and graph above show the total number of Pokemon and total number of single-type Pokemon for each type. Looking at the number of total pokemon, the most common type is Water type, followed by Normal, Flying, Grass, and Psychic type. The types with the fewest Pokemon are fairy and ice types. It can be seen that Pokemon belonging to the water type are three times larger than the Pokemon belonging to the ice type. Also, when looking at the ratio of single-type Pokemon, about 20 to 50% of Pokemon belonging to each type were single-type Pokémon. However, the ratio of Flying type is 1.98% and Steel type1 is 10.2%, which is very small compared to other types. In other words, most Flying and Steel type Pokemon are dual type Pokemon with another type. On the other hand, in electric, the ratio of single-type Pokemon was about 54% and normal was about 59.8%, which was significantly higher than other types. In other words, about half of electric and normal type Pokemon are single type Pokemon that do not have another type.

df_pokemon_type_indifference[df_pokemon_type_indifference.Legendary == True] \
                       .groupby(["Type", "Legendary"]).count().Name.sort_values(ascending = False) \
                       .reset_index().rename(columns = {"Name" : "Count"})

	Type	Legendary	Count
0	Psychic	True	19
1	Dragon	True	16
2	Flying	True	15
3	Fire	True	8
4	Electric	True	5
5	Ground	True	5
6	Ice	True	5
7	Steel	True	5
8	Water	True	5
9	Fighting	True	4
10	Rock	True	4
11	Dark	True	3
12	Fairy	True	3
13	Ghost	True	3
14	Grass	True	3
15	Normal	True	2

plt.figure(figsize = (15, 7))
sns.barplot(x = "Type", y = "Count", color = "darkblue", alpha = 0.8, \
            data = df_pokemon_type_indifference[df_pokemon_type_indifference.Legendary == True] \
                       .groupby(["Type", "Legendary"]).count().Name.sort_values(ascending = False) \
                       .reset_index().rename(columns = {"Name" : "Count"}))
plt.xlabel("Type", fontsize = 16)
plt.ylabel("Number of Legendary Pokemon", fontsize = 16)
plt.show()

png

The table and graph above show the number of legendary Pokemon for each type. While most types have fewer than 5 legendary Pokemon, Psychic, Dragon, and Flying Pokemon have more than 15 legendary Pokemon each.

sns.pairplot(data = df_pokemon_type_indifference, vars = ["Total", "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"], \
             plot_kws = {'alpha': 0.5}, corner = True, hue = "Legendary")
plt.show()

png

It can be seen that legendary Pokemon have higher overall abilities compared to other normal Pokemon as above. Therefore, when comparing the abilities of each Pokemon type, I think that the more general characteristics of each type can be grasped by excluding the legendary Pokemon. Therefore, in the comparison of the abilities of each type, the analysis proceeds by excluding the legendary Pokemon.

df_pokemon_type_indifference_non_legend = df_pokemon_type_indifference[df_pokemon_type_indifference.Legendary == False]

Now let’s compare the abilities of each Pokemon type.

plt.figure(figsize = (15, 7))

sns.boxplot(data = df_pokemon_type_indifference_non_legend, x = "Type", y = "Total")
plt.xlabel("Type", fontsize = 16)
plt.ylabel("Total", fontsize = 16)
plt.show()

png

First, comparing the Total, the dragon type seems a bit higher than the other types.

Except for Total, Pokemon’s stats include HP, Attack, Defense, Sp.Atc, Sp.Def, and Speed. Sp.Atc means special attack, Sp.Def means special defense. The higher the Speed, the first to attack. Therefore, among the six detailed stats, Hp, Defense, and Sp.Def are defense-related stats, and attack, Sp.atc, and Speed are attack-related stats. First, let’s take a look at the difference between attack-related stats and defense-related stats for each type.

df_pokemon_type_indifference_non_legend["Total_attack"] = df_pokemon_type_indifference_non_legend["Attack"] \
                                                          + df_pokemon_type_indifference_non_legend["Sp. Atk"] \
                                                          + df_pokemon_type_indifference_non_legend["Speed"]

df_pokemon_type_indifference_non_legend["Total_defense"] = df_pokemon_type_indifference_non_legend["Defense"] \
                                                           + df_pokemon_type_indifference_non_legend["Sp. Def"] \
                                                           + df_pokemon_type_indifference_non_legend["HP"]

sns.set(style="darkgrid")

attack_by_type = df_pokemon_type_indifference_non_legend.groupby("Type").mean()["Total_attack"]
attack_by_type = pd.DataFrame((attack_by_type - np.mean(attack_by_type)) / np.std(attack_by_type)).reset_index()
attack_by_type["color"] = ['red' if x < 0 else 'green' for x in attack_by_type['Total_attack']]
attack_by_type.sort_values('Total_attack', inplace = True)

plt.figure(figsize = (12,12))
plt.hlines(y = attack_by_type.Type, xmin = 0, xmax = attack_by_type.Total_attack)
for x, y, tex in zip(attack_by_type.Total_attack, attack_by_type.Type, attack_by_type.Total_attack):
     t = plt.text(x, y, round(tex, 2),
                  horizontalalignment = 'right' if x < 0 else 'left',
                  verticalalignment = 'center',
                  fontdict = {'color':'red' if x < 0 else 'green', 'size' : 14})

plt.yticks(attack_by_type.Type, attack_by_type.Type, fontsize = 12)
plt.ylabel("Type", fontsize = 16)
plt.xlabel("Total attack ability comparison", fontsize = 16)
plt.xlim(-3, 3)
plt.show()

png

The average of the Total_attack for each type was obtained, and this was standardized using the overall average and variance to compare which type had relatively high or low values. Dragon type showed higher total attack ability compared to other types, and Bug and Fairy showed low total attack ability.

sns.set(style = "darkgrid")

defense_by_type = df_pokemon_type_indifference_non_legend.groupby("Type").mean()["Total_defense"]
defense_by_type = pd.DataFrame((defense_by_type - np.mean(defense_by_type)) / np.std(defense_by_type)).reset_index()
defense_by_type["color"] = ['red' if x < 0 else 'green' for x in defense_by_type['Total_defense']]
defense_by_type.sort_values('Total_defense', inplace = True)

plt.figure(figsize = (12,12))
plt.hlines(y = defense_by_type.Type, xmin = 0, xmax = defense_by_type.Total_defense)
for x, y, tex in zip(defense_by_type.Total_defense, defense_by_type.Type, defense_by_type.Total_defense):
     t = plt.text(x, y, round(tex, 2),
                  horizontalalignment = 'right' if x < 0 else 'left',
                  verticalalignment = 'center',
                  fontdict = {'color':'red' if x < 0 else 'green', 'size' : 14})

plt.yticks(defense_by_type.Type, defense_by_type.Type, fontsize = 12)
plt.ylabel("Type", fontsize = 16)
plt.xlabel("Total defense ability comparison", fontsize = 16)
plt.xlim(-3, 3)
plt.show()

png

The average of the Total_defense for each type was obtained, and this was standardized using the overall average and variance to compare which type had relatively high or low values. Steel, Rock type showed higher total defense ability compared to other types, and Bug and Poison showed low total defense ability. The Dragon had the highest total attack stat, and the total defense stat was also higher than other types. Due to this, the total ability value that was first seen through the boxplot appears higher than other types.

Comparing the overall attack ability and the overall defensive ability together,
- Above-average attack & above-average defense : Dragon, Fighting, Ice
- Above-average attack & below-average defense : Dark, Fire, Electric, Flying
- Below Average Attack & Above Average Defense : Steel, Psychic, Ground, Rock, Fairy
- Below Average Attack & Below Average Defense : Ghost, Water, Poison, Grass, Normal, Bug

Q6. For 15 Points: Explore how the Pokémon in each generation differ from each other?

Do you think designers of Pokémon tried to address different distributions of ability points in each generation?

sns.set(style = "ticks")
plt.figure(figsize = (15, 7))

# bar plot for value count of each generation
color = "tab:blue"
ax1 = sns.barplot(x = "Generation", y = "Count", color = color, alpha = 0.5, \
                  data = df_pokemon.Generation.value_counts().reset_index().rename(columns = {"index" : "Generation", "Generation" : "Count"}))
ax1.set_xlabel("Generation", fontsize = 16)
ax1.set_ylabel("Number of Pokemon", color = color, fontsize = 16)

# line plot for number of Type1 of each generation
ax2 = ax1.twinx()
color = "tab:green"
ax2 = sns.lineplot(data = df_pokemon_type_indifference.groupby("Generation")["Type"].nunique().values, color = color, linewidth = 3, alpha = 0.5)
ax2.set_ylabel("Number of distinct type", color = color, fontsize = 16)
plt.yticks([18])

plt.show()

png

n the graph above, the blue bar graph shows the number of Pokémon per generation, and the green line graph shows the number of distinct types per generation. 1st generation Pokemon has the most, followed by 3rd and 5th generation Pokemon. Generation 6 Pokemon is the smallest with about 80. Also, it can be seen that 18 types of Pokemon exist in all of the 1st to 6th generations.

Next, let’s look at the number of Pokemon of each type by each generation.

generation_type = pd.DataFrame(columns = ["generation", "type"])
for i in range(1, 7):
    temp = pd.DataFrame()
    temp = pd.DataFrame(pd.concat([df_pokemon[df_pokemon["Generation"] == i]["Type 1"], df_pokemon[df_pokemon["Generation"] == i]["Type 2"]], 
                        ignore_index = True)).rename(columns = {0 : "type"})
    temp["generation"] = i
    generation_type = pd.concat([generation_type, temp])

generation_type["values"] = 0 # just dummy column for counting in pivot_table function

generation_type.pivot_table(index = "generation", columns = "type", values = "values", aggfunc='count')

type	Bug	Dark	Dragon	Electric	Fairy	Fighting	Fire	Flying	Ghost	Grass	Ground	Ice	Normal	Poison	Psychic	Rock	Steel	Water
generation
1	14	1	4	9	5	9	14	23	4	15	14	5	24	36	18	12	2	35
2	12	8	2	9	8	4	11	19	1	10	11	5	15	4	10	8	6	18
3	14	13	15	5	8	9	9	14	8	18	16	7	18	5	28	12	12	31
4	11	7	8	12	2	10	6	16	9	17	12	8	18	8	10	7	12	15
5	18	16	12	12	3	17	16	21	9	20	12	9	19	7	16	10	12	18
6	3	6	9	3	14	4	8	8	15	15	2	4	8	2	8	9	5	9

plt.figure(figsize = (15, 7))
sns.heatmap(generation_type.pivot_table(index = "generation", columns = "type", values = "values", aggfunc='count'), \
            annot = True, linewidths = .5, cmap="Blues")
plt.xlabel("Type", fontsize = 16)
plt.ylabel("Generation", fontsize = 16)
plt.show()

png

Looking at the number of Pokémon belonging to each type by generation, it can be seen that the Flying, Grass, and Normal types are evenly distributed across generations 1 to 6. On the other hand, most of the Poison type Pokemon are distributed in the first generation.

plt.figure(figsize = (15, 7))
sns.barplot(x = "Generation", y = "Count", color = "darkblue", alpha = 0.8, \
            data = df_pokemon[df_pokemon.Legendary == True] \
                   .groupby(["Generation", "Legendary"]).count().Name.sort_values(ascending = False) \
                   .reset_index().rename(columns = {"Name" : "Count"}))
plt.xlabel("Generation", fontsize = 16)
plt.ylabel("Number of Legendary Pokemon", fontsize = 16)
plt.show()

png

The graph above shows the number of legendary Pokemon per generation. The number of legendary Pokemon was the lowest in Generations 1 and 2, then more than doubled in Generations 3, 4, and 5, and decreased in the 6th generation, conversely.

When comparing abilities by generation, for the same reason as when comparing abilities by type, the analysis proceeds by excluding legendary Pokemon.

df_pokemon_non_legend = df_pokemon[df_pokemon.Legendary == False]

df_pokemon_non_legend["Total_attack"] = df_pokemon_non_legend["Attack"] \
                                        + df_pokemon_non_legend["Sp. Atk"] \
                                        + df_pokemon_non_legend["Speed"]

df_pokemon_non_legend["Total_defense"] = df_pokemon_non_legend["Defense"] \
                                         + df_pokemon_non_legend["Sp. Def"] \
                                         + df_pokemon_non_legend["HP"]

column_list = ["Total", "Total_attack", "Total_defense"]

fig, axes = plt.subplots(1, 3, figsize = (30,10))

for i, column in enumerate(column_list):
    sns.boxplot(ax = axes[i%3], x = "Generation", y = column, data = df_pokemon_non_legend)
    axes[i%3].set_xlabel("Generation", fontsize = 16)
    axes[i%3].set_ylabel(column, fontsize = 16)

png

As with the analysis by Type, the Tota_attack column was created by combining Attack, Sp.Atk, and Speed, and the Total_defense column was created by combining Defense, Sp.Def, and HP. Afterwards, Total, Total_attack, and Total_defense were compared by Generation. Compared to other generations, the 4th Generation tends to have higher Total_attack and Total_defense abilities, so it seems that the 4th Generation has the highest Total. To check this more precisely, let’s proceed with anova analysis.

lm_total = smf.ols("Total ~ Generation", data = df_pokemon_non_legend)
lm_total_res = lm_total.fit()
print(lm_total_res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Total   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.1754
Date:                Tue, 15 Feb 2022   Prob (F-statistic):              0.675
Time:                        14:38:35   Log-Likelihood:                -4475.2
No. Observations:                 735   AIC:                             8954.
Df Residuals:                     733   BIC:                             8964.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    413.9727      8.684     47.673      0.000     396.925     431.020
Generation     0.9868      2.356      0.419      0.675      -3.639       5.612
==============================================================================
Omnibus:                       46.635   Durbin-Watson:                   2.123
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.358
Skew:                          -0.030   Prob(JB):                     0.000170
Kurtosis:                       2.250   Cond. No.                         8.60
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

When anova analysis is performed, the F-statistic probability is very high at 0.67. This means that the null hypothesis that there is no significant difference between Generations cannot be rejected, so it can be seen that there is no statistical difference in Total between generations.

tukeyhsd_total = pairwise_tukeyhsd(df_pokemon_non_legend["Total"], df_pokemon_non_legend["Generation"])
tukeyhsd_total.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1	group2	meandiff	p-adj	lower	upper	reject
1	2	-9.6467	0.9	-48.3976	29.1041	False
1	3	-8.6761	0.9	-43.8306	26.4784	False
1	4	19.9359	0.6427	-18.0373	57.9091	False
1	5	-1.3238	0.9	-35.978	33.3305	False
1	6	-3.8492	0.9	-46.7152	39.0169	False
2	3	0.9706	0.9	-38.7193	40.6605	False
2	4	29.5826	0.3419	-12.6242	71.7894	False
2	5	8.323	0.9	-30.9245	47.5705	False
2	6	5.7976	0.9	-40.8602	52.4554	False
3	4	28.612	0.2885	-10.319	67.543	False
3	5	7.3524	0.9	-28.3488	43.0536	False
3	6	4.827	0.9	-38.8898	48.5438	False
4	5	-21.2596	0.5975	-59.7395	17.2203	False
4	6	-23.785	0.6561	-69.799	22.2289	False
5	6	-2.5254	0.9	-45.841	40.7902	False

The above shows the turkeyhsd table for Total between generations. When the reject column is true, it can be interpreted that there is a statistically significant difference between the two groups. However, as can be seen from the table above, there is no pair with a significant difference in Total between generations.

lm_total_attack = smf.ols("Total_attack ~ Generation", data = df_pokemon_non_legend)
lm_total_attack_res = lm_total_attack.fit()
print(lm_total_attack_res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           Total_attack   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                   0.03249
Date:                Tue, 15 Feb 2022   Prob (F-statistic):              0.857
Time:                        14:42:14   Log-Likelihood:                -4105.9
No. Observations:                 735   AIC:                             8216.
Df Residuals:                     733   BIC:                             8225.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    210.4234      5.254     40.053      0.000     200.109     220.737
Generation    -0.2569      1.425     -0.180      0.857      -3.055       2.542
==============================================================================
Omnibus:                        7.219   Durbin-Watson:                   1.953
Prob(Omnibus):                  0.027   Jarque-Bera (JB):                7.081
Skew:                           0.210   Prob(JB):                       0.0290
Kurtosis:                       2.767   Cond. No.                         8.60
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

tukeyhsd_total_attack = pairwise_tukeyhsd(df_pokemon_non_legend["Total_attack"], df_pokemon_non_legend["Generation"])
tukeyhsd_total_attack.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1	group2	meandiff	p-adj	lower	upper	reject
1	2	-20.3667	0.1282	-43.7456	3.0121	False
1	3	-6.365	0.9	-27.5741	14.8442	False
1	4	3.7588	0.9	-19.1509	26.6685	False
1	5	-4.9142	0.9	-25.8215	15.9932	False
1	6	-12.2064	0.73	-38.068	13.6552	False
2	3	14.0017	0.5444	-9.9436	37.9471	False
2	4	24.1255	0.0752	-1.3383	49.5894	False
2	5	15.4525	0.427	-8.226	39.131	False
2	6	8.1603	0.9	-19.9889	36.3095	False
3	4	10.1238	0.7975	-13.3638	33.6113	False
3	5	1.4508	0.9	-20.0882	22.9898	False
3	6	-5.8415	0.9	-32.2163	20.5334	False
4	5	-8.673	0.8921	-31.8883	14.5424	False
4	6	-15.9652	0.5602	-43.726	11.7956	False
5	6	-7.2923	0.9	-33.4251	18.8406	False

The same goes for Total_attack ability. There is no statistically significant difference in Total_attack between generations.

lm_total_defense = smf.ols("Total_defense ~ Generation", data = df_pokemon_non_legend)
lm_total_defense_res = lm_total_defense.fit()
print(lm_total_defense_res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Total_defense   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.8683
Date:                Tue, 15 Feb 2022   Prob (F-statistic):              0.352
Time:                        14:43:28   Log-Likelihood:                -4057.5
No. Observations:                 735   AIC:                             8119.
Df Residuals:                     733   BIC:                             8128.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    203.5493      4.919     41.378      0.000     193.892     213.207
Generation     1.2437      1.335      0.932      0.352      -1.377       3.864
==============================================================================
Omnibus:                       15.901   Durbin-Watson:                   1.958
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               16.396
Skew:                           0.363   Prob(JB):                     0.000275
Kurtosis:                       3.095   Cond. No.                         8.60
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

tukeyhsd_total_defense = pairwise_tukeyhsd(df_pokemon_non_legend["Total_defense"], df_pokemon_non_legend["Generation"])
tukeyhsd_total_defense.summary()

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1	group2	meandiff	p-adj	lower	upper	reject
1	2	10.72	0.7022	-11.2058	32.6458	False
1	3	-2.3112	0.9	-22.2021	17.5798	False
1	4	16.1771	0.2622	-5.3087	37.6629	False
1	5	3.5904	0.9	-16.0175	23.1983	False
1	6	8.3573	0.9	-15.897	32.6115	False
2	3	-13.0312	0.5517	-35.4883	9.426	False
2	4	5.4571	0.9	-18.4242	29.3383	False
2	5	-7.1296	0.9	-29.3364	15.0773	False
2	6	-2.3627	0.9	-28.7624	24.037	False
3	4	18.4883	0.1582	-3.5395	40.516	False
3	5	5.9016	0.9	-14.2987	26.1019	False
3	6	10.6684	0.797	-14.0672	35.4041	False
4	5	-12.5867	0.5553	-34.3592	9.1859	False
4	6	-7.8198	0.9	-33.8552	18.2156	False
5	6	4.7668	0.9	-19.7418	29.2755	False

The same goes for Total_defense. There is no statistically significant difference in Total_defense between generations.

When comparing the Total, Total_attack, and Total_defense by generation, there was no statistically significant difference by generation. Next, let’s compare the detailed 6 abilities.

column_list = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]

fig, axes = plt.subplots(2, 3, figsize = (20,10))

for i, column in enumerate(column_list):
    sns.boxplot(ax = axes[i//3 ,i%3], x = "Generation", y = column, data = df_pokemon_non_legend)

png

When comparing other detailed abilities, there was little difference by generation. Therefore, it can be seen that there is no significant difference in the abilities of Pokemon by generation.

Next, let’s check if there is a distributional difference even if there is no numerical difference in ability values.

plt.figure(figsize = (20, 20))

g = sns.pairplot(data = df_pokemon_non_legend, vars = ["Total", "Total_attack", "Total_defense"], \
                 plot_kws = {'alpha': 0.5}, hue = "Generation", palette = 'Dark2')
g.map_lower(sns.kdeplot, levels=4, color=".2")
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

First, let’s compare the distribution of Total, Total_attack, and Total_defense ability by generation. There are 6 generations, so it is difficult to compare, but if you look at the histogram in diagnor, you can see that there is almost no difference in the histogram for each generation.

plt.figure(figsize = (20, 20))

g = sns.pairplot(data = df_pokemon_non_legend, vars = ["Attack", "Sp. Atk", "Speed", "Defense", "Sp. Def", "HP"], \
                 plot_kws = {'alpha': 0.5}, hue = "Generation", palette = 'Dark2')
g.map_lower(sns.kdeplot, levels=4, color=".2")
plt.show()

<Figure size 1440x1440 with 0 Axes>

png

Next, even if we compare the numerical values of more detailed ability values, it is difficult to confirm the difference in the distribution by generation.

In other words, although the number of legend Pokemon and the distribution of Pokemon types are different for each generation, there is no difference in the distribution of detailed stats or numerical differences.

Youngjun Woo