HW3. Data Visualization

42 minute read

Topics: Data visualization

Background

This homework assignment focuses on the visual analysis of the COVID-19 data avaiable here: https://covid19datahub.io/articles/api/python.html. A description of the dataset can be found at https://covid19datahub.io/articles/doc/data.html

Your main task in this assignment is to explore the data using the data manipulation, analysis, and visualization methods we covered in class as well as those in the assigned readings. You may need to consult pandas, matplotlib and seaborn documentation, as well as Stack Overflow, or other online resources.

MY_UNIQNAME = 'yjwoo' # please fill in your uniqname

Getting the data

The following cell should install the most up-to-date version of the COVID-19 datahub. Alternatively, you can consult the datahub documentation to download the data files directly.

pip install --upgrade covid19dh

Requirement already satisfied: covid19dh in /opt/anaconda3/lib/python3.9/site-packages (2.3.0)
Requirement already satisfied: requests in /opt/anaconda3/lib/python3.9/site-packages (from covid19dh) (2.26.0)
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.9/site-packages (from covid19dh) (1.3.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas->covid19dh) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas->covid19dh) (2021.3)
Requirement already satisfied: numpy>=1.17.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas->covid19dh) (1.20.3)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas->covid19dh) (1.16.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.9/site-packages (from requests->covid19dh) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from requests->covid19dh) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/lib/python3.9/site-packages (from requests->covid19dh) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/anaconda3/lib/python3.9/site-packages (from requests->covid19dh) (1.26.7)
Note: you may need to restart the kernel to use updated packages.

Restart the kernel to import the module and access the data

from covid19dh import covid19

Answer the questions below.
For each question, you should

Write code that can help you answer the following questions, and
Explain your answers in plain English. You should use complete sentences that would be understood by an educated professional who is not necessarily a data scientist (like a product manager).

# Load all the modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.dates import datestr2num

import warnings
warnings.filterwarnings('ignore')

Q1 How many different countries are represented in the country-level data set?

Refer to the documentation to call the covid19() function with appropriate parameters (https://covid19datahub.io/articles/api/python.html)

df_covid_country_level, src = covid19(level = 1) 
df_covid_country_level.head()

We have invested a lot of time and effort in creating COVID-19 Data Hub, please cite the following when using it:

	[1mGuidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.[0m

A BibTeX entry for LaTeX users is

	@Article{,
		title = {COVID-19 Data Hub},
		year = {2020},
		doi = {10.21105/joss.02376},
		author = {Emanuele Guidotti and David Ardia},
		journal = {Journal of Open Source Software},
		volume = {5},
		number = {51},
		pages = {2376},
	}

[33mTo hide this message use 'verbose = False'.[0m

	id	date	confirmed	deaths	recovered	tests	vaccines	people_vaccinated	people_fully_vaccinated	hosp	...	iso_alpha_3	iso_alpha_2	iso_numeric	iso_currency	key_local	key_google_mobility	key_apple_mobility	key_jhu_csse	key_nuts	key_gadm
87758	0094b645	2020-01-22	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	LCA	LC	662.0	XCD	NaN	NaN	NaN	LC	NaN	LCA
87759	0094b645	2020-01-23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	LCA	LC	662.0	XCD	NaN	NaN	NaN	LC	NaN	LCA
87760	0094b645	2020-01-24	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	LCA	LC	662.0	XCD	NaN	NaN	NaN	LC	NaN	LCA
87761	0094b645	2020-01-25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	LCA	LC	662.0	XCD	NaN	NaN	NaN	LC	NaN	LCA
87762	0094b645	2020-01-26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	LCA	LC	662.0	XCD	NaN	NaN	NaN	LC	NaN	LCA

5 rows × 47 columns

df_covid_country_level.shape

(167987, 47)

There is a total of 168,987 rows and 47 columns in the country-level data set.

len(df_covid_country_level["administrative_area_level_1"].unique())

We can check that there is a total of 236 countries in the country-level data set. It can be seen that almost all countries are included in the data.

df_covid_country_level.groupby(["administrative_area_level_1"]).nunique().date.sort_values()

administrative_area_level_1
Grand Princess      10
Niue                43
Costa Atlantica     81
Pitcairn            85
Tokelau            114
                  ... 
United Kingdom     758
Argentina          766
Thailand           766
China              767
Mexico             768
Name: date, Length: 236, dtype: int64

df_covid_country_level.groupby(["administrative_area_level_1"]).nunique().date.sort_values().plot.hist(bins = 10)
plt.show()

png

Most countries have more than 600 days of data, but some like Grand Princess, Niue, Costa Atlantica have less than 100 days of data.

Q2 Create a line chart that shows the total number of cases over time.

df_confirmed_case = df_covid_country_level[["date", "confirmed", "administrative_area_level_1"]].sort_values(["administrative_area_level_1", "date"]).rename(columns = {"confirmed" : "cum_confirmed", "administrative_area_level_1" : "country"})

df_confirmed_case["daily_confirmed"] = df_confirmed_case.groupby("country").cum_confirmed.diff().fillna(df_confirmed_case['cum_confirmed'])
df_confirmed_case = df_confirmed_case[["date", "country", "daily_confirmed", "cum_confirmed"]]

df_confirmed_case['year_month'] = df_confirmed_case['date'].dt.strftime('%Y-%m')

df_confirmed_case_monthly = df_confirmed_case[df_confirmed_case.daily_confirmed > 0].groupby("year_month").daily_confirmed.sum() \
                                .cumsum().reset_index().rename({"daily_confirmed" : "cum_confirmed"}, axis = 1) \
                                .merge(df_confirmed_case[df_confirmed_case.daily_confirmed > 0].groupby("year_month").daily_confirmed.sum(), on = "year_month", how = "left") \
                                .rename(columns = {"daily_confirmed" : "monthly_confirmed"})[["year_month", "monthly_confirmed", "cum_confirmed"]]
df_confirmed_case_monthly

	year_month	monthly_confirmed	cum_confirmed
0	2020-01	9863.0	9863.0
1	2020-02	76682.0	86545.0
2	2020-03	794611.0	881156.0
3	2020-04	2365557.0	3246713.0
4	2020-05	2901122.0	6147835.0
5	2020-06	4325821.0	10473656.0
6	2020-07	7163395.0	17637051.0
7	2020-08	7945496.0	25582547.0
8	2020-09	8590723.0	34173270.0
9	2020-10	12270240.0	46443510.0
10	2020-11	17166584.0	63610094.0
11	2020-12	20545820.0	84155914.0
12	2021-01	19477426.0	103633340.0
13	2021-02	11261962.0	114895302.0
14	2021-03	14932777.0	129828079.0
15	2021-04	22558324.0	152386403.0
16	2021-05	19702300.0	172088703.0
17	2021-06	14202707.0	186291410.0
18	2021-07	15836422.0	202127832.0
19	2021-08	20077425.0	222205257.0
20	2021-09	15994751.0	238200008.0
21	2021-10	12989702.0	251189710.0
22	2021-11	15805885.0	266995595.0
23	2021-12	25519483.0	292515078.0
24	2022-01	80602125.0	373117203.0
25	2022-02	12923794.0	386040997.0

fig, ax1 = plt.subplots(figsize = (30, 8))

# cumulative line chart
color = "tab:green"
ax1 = sns.lineplot(x = "year_month", y = "cum_confirmed", color = color, linewidth = 3, \
                   data = df_confirmed_case_monthly)
ax1.set_xlabel("Year-Month", fontsize = 16)
ax1.set_ylabel("Cumulative cases", color = color, fontsize = 16)

# monthly bar chart
ax2 = ax1.twinx()
color = "tab:blue"
ax2 = sns.barplot(x = "year_month", y = "monthly_confirmed", color = color, alpha = 0.5, \
                  data = df_confirmed_case_monthly)
ax2.set_ylabel("Monthly cases", color = color, fontsize = 16)
plt.show()

png

The table and graph above show the total number of confirmed cases by monthly and cumulatively. In January 2020, there were a total of 9863 cases, and it steadily increased, and in February of 22, there were approximately 380 million cases. The number of monthly cases steadily increased and then remained at around 15 to 20 million, and then surged to 80 million per month in January 2022.

Q3 Use the country-level data set to create a histogram to:

Show the distribution of values for the number of hospitalizations per 1000 people.
Draw a red vertical line that shows the median value on the histogram.

df_hospitalization = df_covid_country_level[["date", "hosp", "population", "administrative_area_level_1"]].dropna()
df_hospitalization = df_hospitalization[df_hospitalization.hosp > 0]
df_hospitalization["hosp/pop"] = df_hospitalization.hosp / df_hospitalization.population * 1000

df_hospitalization["hosp/pop"].describe()

count    28379.000000
mean         0.306831
std          5.254355
min          0.000024
25%          0.029287
50%          0.088823
75%          0.235695
max        239.165329
Name: hosp/pop, dtype: float64

75% of the data are below 0.23, while the maximum is 239. In other words, we think that there is an outlier on the large value side.

df_hospitalization.sort_values("hosp/pop", ascending = False).head(10)

	date	hosp	population	administrative_area_level_1	hosp/pop
205	2020-05-07	149.0	623.0	Costa Atlantica	239.165329
203	2020-05-05	149.0	623.0	Costa Atlantica	239.165329
206	2020-05-08	149.0	623.0	Costa Atlantica	239.165329
204	2020-05-06	149.0	623.0	Costa Atlantica	239.165329
194	2020-04-26	148.0	623.0	Costa Atlantica	237.560193
195	2020-04-27	148.0	623.0	Costa Atlantica	237.560193
196	2020-04-28	148.0	623.0	Costa Atlantica	237.560193
197	2020-04-29	148.0	623.0	Costa Atlantica	237.560193
198	2020-04-30	148.0	623.0	Costa Atlantica	237.560193
199	2020-05-01	148.0	623.0	Costa Atlantica	237.560193

If you look at the cases where the number is abnormally large, you can see that it is a country called Costa Atlantica, and since the population is very small, 623, this large number comes out. Therefore, let’s look at only data up to 99.5 quantiles.

percentile_99 = np.percentile(df_hospitalization["hosp/pop"], 99.5)
percentile_99

1.4660090896687556

The 99.5% percentile of the number of hospitalizations per 1000 people is about 1.47.

hospital_median = df_hospitalization["hosp/pop"].median()

plt.figure(figsize = (30, 8))
sns.distplot(df_hospitalization[(df_hospitalization["hosp/pop"] <= percentile_99)]["hosp/pop"])
plt.axvline(x = hospital_median, color = "red")
plt.annotate(f'Median = {round(hospital_median, 4)}', xy=(hospital_median + 0.01, 8), fontsize = 15, color = "red")
plt.show()

png

The histogram above shows the distribution of the number of hospitalizations per 1000 population excluding values above 99.5 percentile. More than half show a figure of 0.1 per 1000 population. The median number of hospitalizations per 1000 people is about 0.088.

Q4 Create a visualization that shows the number of tests per day in the United States and indicate the median value on your visualization.

US = df_covid_country_level[df_covid_country_level.administrative_area_level_1 == "United States"].sort_values("date")

US_test = US[["date", "tests"]].rename(columns = {"tests" : "cum_tests"})

US_test["daily_tests"] = US_test.cum_tests.diff().fillna(US_test['cum_tests'])
US_test = US_test[["date", "daily_tests", "cum_tests"]].dropna()

US_test["daily_tests_moving_avg"] = US_test.daily_tests.rolling(7).mean()

US_test

	date	daily_tests	cum_tests	daily_tests_moving_avg
159337	2020-03-01	348.0	348.0	NaN
159338	2020-03-02	514.0	862.0	NaN
159339	2020-03-03	622.0	1484.0	NaN
159340	2020-03-04	887.0	2371.0	NaN
159341	2020-03-05	1201.0	3572.0	NaN
...	...	...	...	...
160035	2022-01-28	1725137.0	783594943.0	1.650536e+06
160036	2022-01-29	1121147.0	784716090.0	1.571547e+06
160037	2022-01-30	689304.0	785405394.0	1.514444e+06
160038	2022-01-31	1068185.0	786473579.0	1.460625e+06
160039	2022-02-01	1347217.0	787820796.0	1.397621e+06

703 rows × 4 columns

us_daily_test_median = US_test.daily_tests.median()

plt.figure(figsize = (30, 8))
sns.lineplot(x = "date", y = "daily_tests_moving_avg", color = "blue", linewidth = 2, \
             data = US_test)
sns.lineplot(x = "date", y = "daily_tests", color = "tab:blue", alpha = 0.3, \
             data = US_test)
plt.axhline(y = us_daily_test_median, color = "red")
plt.annotate(f'Median = {us_daily_test_median}', xy=(datestr2num("2020-03-01"), us_daily_test_median + 100000), fontsize = 15, color = "red")
plt.xlabel("Date", fontsize = 16)
plt.ylabel("The number of tests", fontsize = 16)
plt.show()

png

In the graph above, light blue is the number of tests per day, and dark blue is the smoothed graph of the light blue graph using the 7-day rolling average. We can check the overall trend by looking at the smoothed graph. It shows an increasing trend until January 2021 and then shows a decreasing trend until July 2021. And again, it shows an increasing trend from July 2021, then sharply increases on January 22, and has since decreased. The median number of tests per day in the United States is 1,040,682.

For questions below:

You will have to call the covid19() function again with parameters specific to Canada.
Set the parameter level = 2 in the call to covid19().

canada,src = covid19("CAN",level = 2)

We have invested a lot of time and effort in creating COVID-19 Data Hub, please cite the following when using it:

	[1mGuidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.[0m

A BibTeX entry for LaTeX users is

	@Article{,
		title = {COVID-19 Data Hub},
		year = {2020},
		doi = {10.21105/joss.02376},
		author = {Emanuele Guidotti and David Ardia},
		journal = {Journal of Open Source Software},
		volume = {5},
		number = {51},
		pages = {2376},
	}

[33mTo hide this message use 'verbose = False'.[0m

canada.tail()

	id	date	confirmed	deaths	recovered	tests	vaccines	people_vaccinated	people_fully_vaccinated	hosp	...	iso_alpha_3	iso_alpha_2	iso_numeric	iso_currency	key_local	key_google_mobility	key_apple_mobility	key_jhu_csse	key_nuts	key_gadm
496252	eef40c88	2022-02-04	6550.0	17.0	5581.0	39811.0	96926.0	NaN	NaN	NaN	...	CAN	CA	124.0	CAD	61	ChIJDcHTs_Q4EVERjVnGRNguMhk	Northwest Territories	CANT	NaN	CAN.6_1
496253	eef40c88	2022-02-05	6550.0	17.0	5581.0	39852.0	96926.0	NaN	NaN	NaN	...	CAN	CA	124.0	CAD	61	ChIJDcHTs_Q4EVERjVnGRNguMhk	Northwest Territories	CANT	NaN	CAN.6_1
496254	eef40c88	2022-02-06	6550.0	17.0	5581.0	39852.0	96926.0	NaN	NaN	NaN	...	CAN	CA	124.0	CAD	61	ChIJDcHTs_Q4EVERjVnGRNguMhk	Northwest Territories	CANT	NaN	CAN.6_1
496255	eef40c88	2022-02-07	6846.0	17.0	5925.0	39878.0	97798.0	NaN	NaN	NaN	...	CAN	CA	124.0	CAD	61	ChIJDcHTs_Q4EVERjVnGRNguMhk	Northwest Territories	CANT	NaN	CAN.6_1
496256	eef40c88	2022-02-08	NaN	NaN	NaN	NaN	97798.0	NaN	NaN	NaN	...	CAN	CA	124.0	CAD	61	ChIJDcHTs_Q4EVERjVnGRNguMhk	Northwest Territories	CANT	NaN	CAN.6_1

5 rows × 47 columns

Q5 Create a bar plot to show the number of tests conducted in each province.

Order the provinces from west to east (use your best judgement for cases where the order is unclear). Which Canadian province that has conducted the most tests?

canada.columns

Index(['id', 'date', 'confirmed', 'deaths', 'recovered', 'tests', 'vaccines',
       'people_vaccinated', 'people_fully_vaccinated', 'hosp', 'icu', 'vent',
       'school_closing', 'workplace_closing', 'cancel_events',
       'gatherings_restrictions', 'transport_closing',
       'stay_home_restrictions', 'internal_movement_restrictions',
       'international_movement_restrictions', 'information_campaigns',
       'testing_policy', 'contact_tracing', 'facial_coverings',
       'vaccination_policy', 'elderly_people_protection',
       'government_response_index', 'stringency_index',
       'containment_health_index', 'economic_support_index',
       'administrative_area_level', 'administrative_area_level_1',
       'administrative_area_level_2', 'administrative_area_level_3',
       'latitude', 'longitude', 'population', 'iso_alpha_3', 'iso_alpha_2',
       'iso_numeric', 'iso_currency', 'key_local', 'key_google_mobility',
       'key_apple_mobility', 'key_jhu_csse', 'key_nuts', 'key_gadm'],
      dtype='object')

canada.administrative_area_level_2.unique()

array(['Prince Edward Island', 'Manitoba', 'Yukon', 'Nunavut', 'Ontario',
       'Quebec', 'Nova Scotia', 'British Columbia',
       'Newfoundland and Labrador', 'New Brunswick', 'Saskatchewan',
       'Alberta', 'Northwest Territories'], dtype=object)

canada_test_conducted = canada[["id", "date", "tests", "administrative_area_level_2", "latitude", "longitude", "population"]].rename(columns = {"administrative_area_level_2" : "province"})
canada_test_conducted.dropna(inplace = True)

canada_test_conducted.sort_values(["province","date"], ascending = [True, False])

	id	date	tests	province	latitude	longitude	population
475439	e61d6191	2022-02-07	6763655.0	Alberta	54.500614	-115.002842	4413146
475438	e61d6191	2022-02-06	6746196.0	Alberta	54.500614	-115.002842	4413146
475437	e61d6191	2022-02-05	6746196.0	Alberta	54.500614	-115.002842	4413146
475436	e61d6191	2022-02-04	6746196.0	Alberta	54.500614	-115.002842	4413146
475435	e61d6191	2022-02-03	6739970.0	Alberta	54.500614	-115.002842	4413146
...	...	...	...	...	...	...	...
124171	38791b01	2020-03-15	49.0	Yukon	64.819450	-136.804579	41078
124170	38791b01	2020-03-14	37.0	Yukon	64.819450	-136.804579	41078
124169	38791b01	2020-03-13	36.0	Yukon	64.819450	-136.804579	41078
124168	38791b01	2020-03-12	34.0	Yukon	64.819450	-136.804579	41078
124167	38791b01	2020-03-11	23.0	Yukon	64.819450	-136.804579	41078

9128 rows × 7 columns

canada_test_conducted["date_order"] = canada_test_conducted.sort_values(["province","date"], ascending = [True, False]).groupby("province").cumcount() + 1

canada_test_conducted[canada_test_conducted.date_order == 1].sort_values(["longitude", "latitude"])

	id	date	tests	province	latitude	longitude	population	date_order
124865	38791b01	2022-02-07	9129.0	Yukon	64.819450	-136.804579	41078	1
407259	c229681f	2022-02-07	5495428.0	British Columbia	54.499851	-124.993506	5110917	1
496255	eef40c88	2022-02-07	39878.0	Northwest Territories	65.280365	-121.562220	44904	1
475439	e61d6191	2022-02-07	6763655.0	Alberta	54.500614	-115.002842	4413146	1
474047	e4c07903	2022-02-07	1452358.0	Saskatchewan	54.500038	-105.927063	1181666	1
264912	7fc88543	2022-02-07	32179.0	Nunavut	66.001041	-100.263618	38780	1
99343	2a9fd65a	2022-02-07	1434252.0	Manitoba	54.510344	-97.212207	1377517	1
276313	83fc0fa9	2022-02-07	22606233.0	Ontario	49.269156	-87.166464	14711827	1
347935	a7ce33b9	2022-02-07	16015314.0	Quebec	53.889046	-73.288937	8537674	1
443254	d177e539	2022-02-07	704856.0	New Brunswick	46.551245	-66.411970	779993	1
390749	b91ff4d1	2022-02-07	1728364.0	Nova Scotia	44.727022	-64.602949	977457	1
4180	015d95fc	2022-02-07	255565.0	Prince Edward Island	46.503836	-63.615584	158158	1
441298	d07806cb	2022-02-07	551866.0	Newfoundland and Labrador	49.120554	-56.692576	521365	1

Since the test column is a cumulative number, if you check the data of the last date in each province, you can check the number of tests conducted in each province.

plt.figure(figsize = (30, 8))
sns.barplot(x = "province", y = "tests", \
            data = canada_test_conducted[canada_test_conducted.date_order == 1].sort_values(["longitude", "latitude"]))
plt.ylabel("The number of tests conducted")
plt.xlabel("Province")
plt.show()

png

In the above bar chart, provinces are sorted in the order from west to east by sorting longitude first, and then latitude values. Looking at the bar chart above, it can be seen that Ontario and Quebec tested overwhelmingly, and there were more tests in the west than in the east. However, since this is an interpretation that does not take into account the population of each province, it is necessary to consider the population as well.

Q6 Create a bar plot that shows the number of tests conducted per 1000 people for each province in Canada.

How would you interpret the results of this bar plot given the results of bar plot in question 5.

canada_test_conducted.loc[canada_test_conducted.date_order == 1, "tests/pop"] = canada_test_conducted[canada_test_conducted.date_order == 1].tests \
                                                                            / canada_test_conducted[canada_test_conducted.date_order == 1].population \
                                                                            * 1000

canada_test_conducted[canada_test_conducted.date_order == 1].sort_values(["longitude", "latitude"])

	id	date	tests	province	latitude	longitude	population	date_order	tests/pop
124865	38791b01	2022-02-07	9129.0	Yukon	64.819450	-136.804579	41078	1	222.235747
407259	c229681f	2022-02-07	5495428.0	British Columbia	54.499851	-124.993506	5110917	1	1075.233270
496255	eef40c88	2022-02-07	39878.0	Northwest Territories	65.280365	-121.562220	44904	1	888.072332
475439	e61d6191	2022-02-07	6763655.0	Alberta	54.500614	-115.002842	4413146	1	1532.615282
474047	e4c07903	2022-02-07	1452358.0	Saskatchewan	54.500038	-105.927063	1181666	1	1229.076575
264912	7fc88543	2022-02-07	32179.0	Nunavut	66.001041	-100.263618	38780	1	829.783394
99343	2a9fd65a	2022-02-07	1434252.0	Manitoba	54.510344	-97.212207	1377517	1	1041.186425
276313	83fc0fa9	2022-02-07	22606233.0	Ontario	49.269156	-87.166464	14711827	1	1536.602694
347935	a7ce33b9	2022-02-07	16015314.0	Quebec	53.889046	-73.288937	8537674	1	1875.840422
443254	d177e539	2022-02-07	704856.0	New Brunswick	46.551245	-66.411970	779993	1	903.669648
390749	b91ff4d1	2022-02-07	1728364.0	Nova Scotia	44.727022	-64.602949	977457	1	1768.225098
4180	015d95fc	2022-02-07	255565.0	Prince Edward Island	46.503836	-63.615584	158158	1	1615.884116
441298	d07806cb	2022-02-07	551866.0	Newfoundland and Labrador	49.120554	-56.692576	521365	1	1058.502201

plt.figure(figsize = (30, 8))
sns.barplot(x = "province", y = "tests/pop", \
            data = canada_test_conducted[canada_test_conducted.date_order == 1].sort_values(["longitude", "latitude"]))
plt.ylabel("The number of tests conducted pero 1000 people")
plt.xlabel("Province")
plt.show()

png

In the graph in Q5, it came out that Ontario and Quebec had an overwhelming number of tests, but if you check the graph in Q6, you can confirm that this was because the population of the two provinces was overwhelmingly large. Comparing the number of tests per 1000 population, it can be seen that Ontario is lower than Nova Scotia and Prince Edward Island. In addition, it can be seen that the number of tests per 1000 population is slightly higher in the east than in the west.

Q7 Create a visualization that shows which Canadian province has the highest recovery rate.

Recovery rate is calculated as the number of recovered cases divided by the number of confirmed cases.

canada_recovery = canada[["id", "date", "confirmed", "recovered", "administrative_area_level_2", "latitude", "longitude"]].rename(columns = {"administrative_area_level_2" : "province"})
canada_recovery.dropna(inplace = True)

canada_recovery["date_order"] = canada_recovery.sort_values(["province","date"], ascending = [True, False]).groupby("province").cumcount() + 1

canada_recovery.loc[canada_recovery.date_order == 1, "recovery_rate"] = canada_recovery[canada_recovery.date_order == 1].recovered \
                                                                            / canada_recovery[canada_recovery.date_order == 1].confirmed

canada_recovery.loc[canada_recovery.date_order == 1].sort_values("recovery_rate", ascending = False)

	id	date	confirmed	recovered	province	latitude	longitude	date_order	recovery_rate
124865	38791b01	2022-02-07	3235.0	3105.0	Yukon	64.819450	-136.804579	1	0.959815
276313	83fc0fa9	2022-02-07	1056149.0	1010878.0	Ontario	49.269156	-87.166464	1	0.957136
347935	a7ce33b9	2022-02-07	883192.0	834633.0	Quebec	53.889046	-73.288937	1	0.945019
475439	e61d6191	2022-02-07	508051.0	474284.0	Alberta	54.500614	-115.002842	1	0.933536
407259	c229681f	2022-02-07	333925.0	306419.0	British Columbia	54.499851	-124.993506	1	0.917628
474047	e4c07903	2022-02-07	123258.0	113023.0	Saskatchewan	54.500038	-105.927063	1	0.916963
441298	d07806cb	2022-02-07	18740.0	17156.0	Newfoundland and Labrador	49.120554	-56.692576	1	0.915475
390749	b91ff4d1	2022-02-07	40767.0	37300.0	Nova Scotia	44.727022	-64.602949	1	0.914956
443254	d177e539	2022-02-07	31017.0	27298.0	New Brunswick	46.551245	-66.411970	1	0.880098
496255	eef40c88	2022-02-07	6846.0	5925.0	Northwest Territories	65.280365	-121.562220	1	0.865469
99343	2a9fd65a	2022-02-07	123739.0	103595.0	Manitoba	54.510344	-97.212207	1	0.837206
4180	015d95fc	2022-02-07	9104.0	7268.0	Prince Edward Island	46.503836	-63.615584	1	0.798330
264912	7fc88543	2022-02-07	1989.0	1444.0	Nunavut	66.001041	-100.263618	1	0.725993

plt.figure(figsize = (30, 8))
sns.barplot(x = "province", y = "recovery_rate", \
            data = canada_recovery[canada_recovery.date_order == 1].sort_values(["longitude", "latitude"]))
plt.ylabel("Recovery rate = # of recovered cases / # of confirmed cases")
plt.xlabel("Province")
plt.show()

png

Yukon has the highest recovery rate of 0.96, Nunavut has the lowest 0.73. Yukon had the lowest number of tests per 1000 population but the highest recovery rate. In Nunavut, the number of tests per 1000 people was 829, which was much lower than in other provinces, but the recovery rate was also the lowest.

Q8 Create visualizations that show the impacts of _at least_ three policy measures on mortality or infection rates.

See https://covid19datahub.io/articles/doc/data.html for descriptions of the available policy measures. You are not limited to histograms and bar charts. Remember that you can use subplots!

Gatherings restrictions

0 - no restrictions
1 - restrictions on very large gatherings (the limit is above 1000 people)
2 - restrictions on gatherings between 101-1000 people
3 - restrictions on gatherings between 11-100 people
4 - restrictions on gatherings of 10 people or less

If gathering restrictions are targeted to a specific geographical region, then they have a negative value. Otherwise, if they are a general policy that is applied across the whole country/territory, then they have positive values. Since I am only interested in the impacts of the general policy on infection rates, I will focus on the data that have the positive gathering_restirctions values.

canada_gatherings_restriction = canada[["date", "confirmed", "population", "gatherings_restrictions", "administrative_area_level_2"]].rename(columns = {"administrative_area_level_2" : "province", "confirmed" : "cum_confirmed"})
canada_gatherings_restriction = canada_gatherings_restriction[canada_gatherings_restriction.cum_confirmed.isnull() == False]
canada_gatherings_restriction = canada_gatherings_restriction[canada_gatherings_restriction.gatherings_restrictions >= 0].sort_values(["province", "date"])
canada_gatherings_restriction["daily_confirmed"] = canada_gatherings_restriction.sort_values(["province", "date"]).groupby("province").cum_confirmed.diff().fillna(canada_gatherings_restriction.cum_confirmed)
canada_gatherings_restriction = canada_gatherings_restriction[canada_gatherings_restriction.daily_confirmed >= 0]
canada_gatherings_restriction = canada_gatherings_restriction[["date", "province", "population", "gatherings_restrictions", "daily_confirmed", "cum_confirmed"]]

canada_gatherings_restriction_infection_rate = canada_gatherings_restriction.groupby(["gatherings_restrictions", "date"]).sum()[["daily_confirmed", "population"]].reset_index() 
canada_gatherings_restriction_infection_rate["infection_rate"] = canada_gatherings_restriction_infection_rate.daily_confirmed / canada_gatherings_restriction_infection_rate.population * 100
canada_gatherings_restriction_infection_rate

	gatherings_restrictions	date	daily_confirmed	population	infection_rate
0	0.0	2020-01-31	4.0	19822744	0.000020
1	0.0	2020-02-08	3.0	19822744	0.000015
2	0.0	2020-02-16	1.0	19822744	0.000005
3	0.0	2020-02-21	1.0	19822744	0.000005
4	0.0	2020-02-24	1.0	19822744	0.000005
...	...	...	...	...	...
1854	4.0	2022-01-31	2073.0	4499128	0.046076
1855	4.0	2022-02-01	2188.0	4458050	0.049080
1856	4.0	2022-02-02	3172.0	4458050	0.071152
1857	4.0	2022-02-03	2512.0	4458050	0.056348
1858	4.0	2022-02-04	2273.0	4458050	0.050986

1859 rows × 5 columns

plt.figure(figsize = (20, 8))
sns.boxplot(data = canada_gatherings_restriction_infection_rate, x = "gatherings_restrictions", y = "infection_rate", showfliers = False)
plt.xlabel("Gathering restrictions", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

The graph above shows a box plot of the daily infection rate for each gathering restriction policy. It can be seen that the infection rate was high when there was no restriction, but when the restriction was level 1 to 3, the infection rate was significantly reduced. However, when the restriction was the highest at level 4, the infection rate was rather high.

canada_gatherings_restriction_infection_rate["infection_rate_moving_avg"] = canada_gatherings_restriction_infection_rate.sort_values(["gatherings_restrictions","date"]).groupby("gatherings_restrictions").infection_rate.rolling(7).mean().reset_index().infection_rate

plt.figure(figsize = (30, 8))
sns.lineplot(data = canada_gatherings_restriction_infection_rate, x = "date", y = "infection_rate_moving_avg", hue = "gatherings_restrictions")
plt.xlabel("date", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

It is a graph showing the infection rate for each restriction level over time. All five levels show a similar trend, but it can be seen that the infection rate is particularly high in the 4th level of restriction area from October 2020 to June 2021. In addition, the infection rate was particularly high in the level 0 restricted area from July 2021 to December 2021.

Stay home restrictions

0 - no measures
1 - recommend not leaving house
2 - require not leaving house with exceptions for daily exercise, grocery shopping, and ‘essential’ trips
3 - require not leaving house with minimal exceptions (eg allowed to leave once a week, or only one person can leave at a time, etc)

If stay-home restrictions are targeted to a specific geographical region, then they have a negative value. Otherwise, if they are a general policy that is applied across the whole country/territory, then they have positive values. Since I am only interested in the impacts of the general policy on infection rates, I will focus on the data that have positive stay-home restrictions values.

canada_stayhome_restriction = canada[["date", "confirmed", "population", "stay_home_restrictions", "administrative_area_level_2"]].rename(columns = {"administrative_area_level_2" : "province", "confirmed" : "cum_confirmed"})
canada_stayhome_restriction = canada_stayhome_restriction[canada_stayhome_restriction.cum_confirmed.isnull() == False]
canada_stayhome_restriction = canada_stayhome_restriction[canada_stayhome_restriction.stay_home_restrictions >= 0].sort_values(["province", "date"])
canada_stayhome_restriction["daily_confirmed"] = canada_stayhome_restriction.sort_values(["province", "date"]).groupby("province").cum_confirmed.diff().fillna(canada_stayhome_restriction.cum_confirmed)
canada_stayhome_restriction = canada_stayhome_restriction[canada_stayhome_restriction.daily_confirmed >= 0]
canada_stayhome_restriction = canada_stayhome_restriction[["date", "province", "population", "stay_home_restrictions", "daily_confirmed", "cum_confirmed"]]

canada_stayhome_restriction_infection_rate = canada_stayhome_restriction.groupby(["stay_home_restrictions", "date"]).sum()[["daily_confirmed", "population"]].reset_index() 
canada_stayhome_restriction_infection_rate["infection_rate"] = canada_stayhome_restriction_infection_rate.daily_confirmed / canada_stayhome_restriction_infection_rate.population * 100
canada_stayhome_restriction_infection_rate

	stay_home_restrictions	date	daily_confirmed	population	infection_rate
0	0.0	2020-01-31	4.0	19822744	0.000020
1	0.0	2020-02-08	3.0	19822744	0.000015
2	0.0	2020-02-16	1.0	19822744	0.000005
3	0.0	2020-02-21	1.0	19822744	0.000005
4	0.0	2020-02-24	1.0	19822744	0.000005
...	...	...	...	...	...
1389	2.0	2022-01-12	8351.0	8537674	0.097814
1390	2.0	2022-01-13	8793.0	8537674	0.102991
1391	2.0	2022-01-14	7382.0	8537674	0.086464
1392	2.0	2022-01-15	6705.0	8537674	0.078534
1393	2.0	2022-01-16	5946.0	8537674	0.069644

1394 rows × 5 columns

plt.figure(figsize = (20, 8))
sns.boxplot(data = canada_stayhome_restriction_infection_rate, x = "stay_home_restrictions", y = "infection_rate", showfliers = False)
plt.xlabel("Stay home restrictions", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

The graph above shows a box plot of the daily infection rate for each stay-home restriction. Contrary to our common sense, it can be seen that the higher the restriction level, the higher the infection rate.

canada_stayhome_restriction_infection_rate["infection_rate_moving_avg"] = canada_stayhome_restriction_infection_rate.sort_values(["stay_home_restrictions","date"]).groupby("stay_home_restrictions").infection_rate.rolling(7).mean().reset_index().infection_rate

plt.figure(figsize = (30, 8))
sns.lineplot(data = canada_stayhome_restriction_infection_rate, x = "date", y = "infection_rate_moving_avg", hue = "stay_home_restrictions")
plt.xlabel("date", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

The overall trend is similar, but in the case of the second-stage restriction, it can be seen that the infection rate was particularly high from April 2020 to January 2021.

Transport closing

0 - no measures
1 - recommend closing (or significantly reduce volume/route/means of transport available)
2 - require closing (or prohibit most citizens from using it)

If transport closing restrictions are targeted to a specific geographical region, then they have a negative value. Otherwise, if they are a general policy that is applied across the whole country/territory, then they have positive values. Since I am only interested in the impacts of the general policy on infection rates, I will focus on the data that have positive transport closing values.

canada_transport_closing = canada[["date", "confirmed", "population", "transport_closing", "administrative_area_level_2"]].rename(columns = {"administrative_area_level_2" : "province", "confirmed" : "cum_confirmed"})
canada_transport_closing = canada_transport_closing[canada_transport_closing.cum_confirmed.isnull() == False]
canada_transport_closing = canada_transport_closing[canada_transport_closing.transport_closing >= 0].sort_values(["province", "date"])
canada_transport_closing["daily_confirmed"] = canada_transport_closing.sort_values(["province", "date"]).groupby("province").cum_confirmed.diff().fillna(canada_transport_closing.cum_confirmed)
canada_transport_closing = canada_transport_closing[canada_transport_closing.daily_confirmed >= 0]
canada_transport_closing = canada_transport_closing[["date", "province", "population", "transport_closing", "daily_confirmed", "cum_confirmed"]]

canada_transport_closing_infection_rate = canada_transport_closing.groupby(["transport_closing", "date"]).sum()[["daily_confirmed", "population"]].reset_index() 
canada_transport_closing_infection_rate["infection_rate"] = canada_transport_closing_infection_rate.daily_confirmed / canada_transport_closing_infection_rate.population * 100
canada_transport_closing_infection_rate

	transport_closing	date	daily_confirmed	population	infection_rate
0	0.0	2020-01-31	4.0	19822744	0.000020
1	0.0	2020-02-08	3.0	19822744	0.000015
2	0.0	2020-02-16	1.0	19822744	0.000005
3	0.0	2020-02-21	1.0	19822744	0.000005
4	0.0	2020-02-24	1.0	19822744	0.000005
...	...	...	...	...	...
1256	1.0	2022-01-27	378.0	521365	0.072502
1257	1.0	2022-01-28	265.0	521365	0.050828
1258	1.0	2022-01-29	208.0	521365	0.039895
1259	1.0	2022-01-30	210.0	521365	0.040279
1260	1.0	2022-01-31	183.0	521365	0.035100

1261 rows × 5 columns

plt.figure(figsize = (20, 8))
sns.boxplot(data = canada_transport_closing_infection_rate, x = "transport_closing", y = "infection_rate", showfliers = False)
plt.xlabel("Transport closing", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

The graph above shows a box plot of the daily infection rate for each transport closing restriction. There was no level 2 restriction in Canada. It can be seen that the first level restriction, which only recommends transport closing, has little effect on the infection rate.

canada_transport_closing_infection_rate["infection_rate_moving_avg"] = canada_transport_closing_infection_rate.sort_values(["transport_closing","date"]).groupby("transport_closing").infection_rate.rolling(7).mean().reset_index().infection_rate

plt.figure(figsize = (30, 8))
sns.lineplot(data = canada_transport_closing_infection_rate, x = "date", y = "infection_rate_moving_avg", hue = "transport_closing")
plt.xlabel("date", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

Level 0 and Level 1 restrictions show almost no difference in the infection rate over time.

Youngjun Woo