HW3. Data Visualization

42 minute read

Topics: Data visualization


Background

This homework assignment focuses on the visual analysis of the COVID-19 data avaiable here: https://covid19datahub.io/articles/api/python.html. A description of the dataset can be found at https://covid19datahub.io/articles/doc/data.html

Your main task in this assignment is to explore the data using the data manipulation, analysis, and visualization methods we covered in class as well as those in the assigned readings. You may need to consult pandas, matplotlib and seaborn documentation, as well as Stack Overflow, or other online resources.

MY_UNIQNAME = 'yjwoo' # please fill in your uniqname

Getting the data

The following cell should install the most up-to-date version of the COVID-19 datahub. Alternatively, you can consult the datahub documentation to download the data files directly.

pip install --upgrade covid19dh
Requirement already satisfied: covid19dh in /opt/anaconda3/lib/python3.9/site-packages (2.3.0)
Requirement already satisfied: requests in /opt/anaconda3/lib/python3.9/site-packages (from covid19dh) (2.26.0)
Requirement already satisfied: pandas in /opt/anaconda3/lib/python3.9/site-packages (from covid19dh) (1.3.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas->covid19dh) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas->covid19dh) (2021.3)
Requirement already satisfied: numpy>=1.17.3 in /opt/anaconda3/lib/python3.9/site-packages (from pandas->covid19dh) (1.20.3)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas->covid19dh) (1.16.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.9/site-packages (from requests->covid19dh) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from requests->covid19dh) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/lib/python3.9/site-packages (from requests->covid19dh) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/anaconda3/lib/python3.9/site-packages (from requests->covid19dh) (1.26.7)
Note: you may need to restart the kernel to use updated packages.

Restart the kernel to import the module and access the data

from covid19dh import covid19

Answer the questions below.
For each question, you should

  1. Write code that can help you answer the following questions, and
  2. Explain your answers in plain English. You should use complete sentences that would be understood by an educated professional who is not necessarily a data scientist (like a product manager).
# Load all the modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.dates import datestr2num
import warnings
warnings.filterwarnings('ignore')

Q1 How many different countries are represented in the country-level data set?

  • Refer to the documentation to call the covid19() function with appropriate parameters (https://covid19datahub.io/articles/api/python.html)
df_covid_country_level, src = covid19(level = 1) 
df_covid_country_level.head()
We have invested a lot of time and effort in creating COVID-19 Data Hub, please cite the following when using it:

	Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

A BibTeX entry for LaTeX users is

	@Article{,
		title = {COVID-19 Data Hub},
		year = {2020},
		doi = {10.21105/joss.02376},
		author = {Emanuele Guidotti and David Ardia},
		journal = {Journal of Open Source Software},
		volume = {5},
		number = {51},
		pages = {2376},
	}

To hide this message use 'verbose = False'.
id date confirmed deaths recovered tests vaccines people_vaccinated people_fully_vaccinated hosp ... iso_alpha_3 iso_alpha_2 iso_numeric iso_currency key_local key_google_mobility key_apple_mobility key_jhu_csse key_nuts key_gadm
87758 0094b645 2020-01-22 NaN NaN NaN NaN NaN NaN NaN NaN ... LCA LC 662.0 XCD NaN NaN NaN LC NaN LCA
87759 0094b645 2020-01-23 NaN NaN NaN NaN NaN NaN NaN NaN ... LCA LC 662.0 XCD NaN NaN NaN LC NaN LCA
87760 0094b645 2020-01-24 NaN NaN NaN NaN NaN NaN NaN NaN ... LCA LC 662.0 XCD NaN NaN NaN LC NaN LCA
87761 0094b645 2020-01-25 NaN NaN NaN NaN NaN NaN NaN NaN ... LCA LC 662.0 XCD NaN NaN NaN LC NaN LCA
87762 0094b645 2020-01-26 NaN NaN NaN NaN NaN NaN NaN NaN ... LCA LC 662.0 XCD NaN NaN NaN LC NaN LCA

5 rows × 47 columns

df_covid_country_level.shape
(167987, 47)

There is a total of 168,987 rows and 47 columns in the country-level data set.

len(df_covid_country_level["administrative_area_level_1"].unique())
236

We can check that there is a total of 236 countries in the country-level data set. It can be seen that almost all countries are included in the data.

df_covid_country_level.groupby(["administrative_area_level_1"]).nunique().date.sort_values()
administrative_area_level_1
Grand Princess      10
Niue                43
Costa Atlantica     81
Pitcairn            85
Tokelau            114
                  ... 
United Kingdom     758
Argentina          766
Thailand           766
China              767
Mexico             768
Name: date, Length: 236, dtype: int64
df_covid_country_level.groupby(["administrative_area_level_1"]).nunique().date.sort_values().plot.hist(bins = 10)
plt.show()

png

Most countries have more than 600 days of data, but some like Grand Princess, Niue, Costa Atlantica have less than 100 days of data.

Q2 Create a line chart that shows the total number of cases over time.

df_confirmed_case = df_covid_country_level[["date", "confirmed", "administrative_area_level_1"]].sort_values(["administrative_area_level_1", "date"]).rename(columns = {"confirmed" : "cum_confirmed", "administrative_area_level_1" : "country"})
df_confirmed_case["daily_confirmed"] = df_confirmed_case.groupby("country").cum_confirmed.diff().fillna(df_confirmed_case['cum_confirmed'])
df_confirmed_case = df_confirmed_case[["date", "country", "daily_confirmed", "cum_confirmed"]]
df_confirmed_case['year_month'] = df_confirmed_case['date'].dt.strftime('%Y-%m')
df_confirmed_case_monthly = df_confirmed_case[df_confirmed_case.daily_confirmed > 0].groupby("year_month").daily_confirmed.sum() \
                                .cumsum().reset_index().rename({"daily_confirmed" : "cum_confirmed"}, axis = 1) \
                                .merge(df_confirmed_case[df_confirmed_case.daily_confirmed > 0].groupby("year_month").daily_confirmed.sum(), on = "year_month", how = "left") \
                                .rename(columns = {"daily_confirmed" : "monthly_confirmed"})[["year_month", "monthly_confirmed", "cum_confirmed"]]
df_confirmed_case_monthly
year_month monthly_confirmed cum_confirmed
0 2020-01 9863.0 9863.0
1 2020-02 76682.0 86545.0
2 2020-03 794611.0 881156.0
3 2020-04 2365557.0 3246713.0
4 2020-05 2901122.0 6147835.0
5 2020-06 4325821.0 10473656.0
6 2020-07 7163395.0 17637051.0
7 2020-08 7945496.0 25582547.0
8 2020-09 8590723.0 34173270.0
9 2020-10 12270240.0 46443510.0
10 2020-11 17166584.0 63610094.0
11 2020-12 20545820.0 84155914.0
12 2021-01 19477426.0 103633340.0
13 2021-02 11261962.0 114895302.0
14 2021-03 14932777.0 129828079.0
15 2021-04 22558324.0 152386403.0
16 2021-05 19702300.0 172088703.0
17 2021-06 14202707.0 186291410.0
18 2021-07 15836422.0 202127832.0
19 2021-08 20077425.0 222205257.0
20 2021-09 15994751.0 238200008.0
21 2021-10 12989702.0 251189710.0
22 2021-11 15805885.0 266995595.0
23 2021-12 25519483.0 292515078.0
24 2022-01 80602125.0 373117203.0
25 2022-02 12923794.0 386040997.0
fig, ax1 = plt.subplots(figsize = (30, 8))

# cumulative line chart
color = "tab:green"
ax1 = sns.lineplot(x = "year_month", y = "cum_confirmed", color = color, linewidth = 3, \
                   data = df_confirmed_case_monthly)
ax1.set_xlabel("Year-Month", fontsize = 16)
ax1.set_ylabel("Cumulative cases", color = color, fontsize = 16)

# monthly bar chart
ax2 = ax1.twinx()
color = "tab:blue"
ax2 = sns.barplot(x = "year_month", y = "monthly_confirmed", color = color, alpha = 0.5, \
                  data = df_confirmed_case_monthly)
ax2.set_ylabel("Monthly cases", color = color, fontsize = 16)
plt.show()

png

The table and graph above show the total number of confirmed cases by monthly and cumulatively. In January 2020, there were a total of 9863 cases, and it steadily increased, and in February of 22, there were approximately 380 million cases. The number of monthly cases steadily increased and then remained at around 15 to 20 million, and then surged to 80 million per month in January 2022.

Q3 Use the country-level data set to create a histogram to:

  • Show the distribution of values for the number of hospitalizations per 1000 people.
  • Draw a red vertical line that shows the median value on the histogram.
df_hospitalization = df_covid_country_level[["date", "hosp", "population", "administrative_area_level_1"]].dropna()
df_hospitalization = df_hospitalization[df_hospitalization.hosp > 0]
df_hospitalization["hosp/pop"] = df_hospitalization.hosp / df_hospitalization.population * 1000
df_hospitalization["hosp/pop"].describe()
count    28379.000000
mean         0.306831
std          5.254355
min          0.000024
25%          0.029287
50%          0.088823
75%          0.235695
max        239.165329
Name: hosp/pop, dtype: float64

75% of the data are below 0.23, while the maximum is 239. In other words, we think that there is an outlier on the large value side.

df_hospitalization.sort_values("hosp/pop", ascending = False).head(10)
date hosp population administrative_area_level_1 hosp/pop
205 2020-05-07 149.0 623.0 Costa Atlantica 239.165329
203 2020-05-05 149.0 623.0 Costa Atlantica 239.165329
206 2020-05-08 149.0 623.0 Costa Atlantica 239.165329
204 2020-05-06 149.0 623.0 Costa Atlantica 239.165329
194 2020-04-26 148.0 623.0 Costa Atlantica 237.560193
195 2020-04-27 148.0 623.0 Costa Atlantica 237.560193
196 2020-04-28 148.0 623.0 Costa Atlantica 237.560193
197 2020-04-29 148.0 623.0 Costa Atlantica 237.560193
198 2020-04-30 148.0 623.0 Costa Atlantica 237.560193
199 2020-05-01 148.0 623.0 Costa Atlantica 237.560193

If you look at the cases where the number is abnormally large, you can see that it is a country called Costa Atlantica, and since the population is very small, 623, this large number comes out. Therefore, let’s look at only data up to 99.5 quantiles.

percentile_99 = np.percentile(df_hospitalization["hosp/pop"], 99.5)
percentile_99
1.4660090896687556

The 99.5% percentile of the number of hospitalizations per 1000 people is about 1.47.

hospital_median = df_hospitalization["hosp/pop"].median()
plt.figure(figsize = (30, 8))
sns.distplot(df_hospitalization[(df_hospitalization["hosp/pop"] <= percentile_99)]["hosp/pop"])
plt.axvline(x = hospital_median, color = "red")
plt.annotate(f'Median = {round(hospital_median, 4)}', xy=(hospital_median + 0.01, 8), fontsize = 15, color = "red")
plt.show()

png

The histogram above shows the distribution of the number of hospitalizations per 1000 population excluding values above 99.5 percentile. More than half show a figure of 0.1 per 1000 population. The median number of hospitalizations per 1000 people is about 0.088.

Q4 Create a visualization that shows the number of tests per day in the United States and indicate the median value on your visualization.

US = df_covid_country_level[df_covid_country_level.administrative_area_level_1 == "United States"].sort_values("date")
US_test = US[["date", "tests"]].rename(columns = {"tests" : "cum_tests"})
US_test["daily_tests"] = US_test.cum_tests.diff().fillna(US_test['cum_tests'])
US_test = US_test[["date", "daily_tests", "cum_tests"]].dropna()
US_test["daily_tests_moving_avg"] = US_test.daily_tests.rolling(7).mean()
US_test
date daily_tests cum_tests daily_tests_moving_avg
159337 2020-03-01 348.0 348.0 NaN
159338 2020-03-02 514.0 862.0 NaN
159339 2020-03-03 622.0 1484.0 NaN
159340 2020-03-04 887.0 2371.0 NaN
159341 2020-03-05 1201.0 3572.0 NaN
... ... ... ... ...
160035 2022-01-28 1725137.0 783594943.0 1.650536e+06
160036 2022-01-29 1121147.0 784716090.0 1.571547e+06
160037 2022-01-30 689304.0 785405394.0 1.514444e+06
160038 2022-01-31 1068185.0 786473579.0 1.460625e+06
160039 2022-02-01 1347217.0 787820796.0 1.397621e+06

703 rows × 4 columns

us_daily_test_median = US_test.daily_tests.median()
plt.figure(figsize = (30, 8))
sns.lineplot(x = "date", y = "daily_tests_moving_avg", color = "blue", linewidth = 2, \
             data = US_test)
sns.lineplot(x = "date", y = "daily_tests", color = "tab:blue", alpha = 0.3, \
             data = US_test)
plt.axhline(y = us_daily_test_median, color = "red")
plt.annotate(f'Median = {us_daily_test_median}', xy=(datestr2num("2020-03-01"), us_daily_test_median + 100000), fontsize = 15, color = "red")
plt.xlabel("Date", fontsize = 16)
plt.ylabel("The number of tests", fontsize = 16)
plt.show()

png

In the graph above, light blue is the number of tests per day, and dark blue is the smoothed graph of the light blue graph using the 7-day rolling average. We can check the overall trend by looking at the smoothed graph. It shows an increasing trend until January 2021 and then shows a decreasing trend until July 2021. And again, it shows an increasing trend from July 2021, then sharply increases on January 22, and has since decreased. The median number of tests per day in the United States is 1,040,682.

For questions below:

  • You will have to call the covid19() function again with parameters specific to Canada.
  • Set the parameter level = 2 in the call to covid19().
canada,src = covid19("CAN",level = 2)
We have invested a lot of time and effort in creating COVID-19 Data Hub, please cite the following when using it:

	Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

A BibTeX entry for LaTeX users is

	@Article{,
		title = {COVID-19 Data Hub},
		year = {2020},
		doi = {10.21105/joss.02376},
		author = {Emanuele Guidotti and David Ardia},
		journal = {Journal of Open Source Software},
		volume = {5},
		number = {51},
		pages = {2376},
	}

To hide this message use 'verbose = False'.
canada.tail()
id date confirmed deaths recovered tests vaccines people_vaccinated people_fully_vaccinated hosp ... iso_alpha_3 iso_alpha_2 iso_numeric iso_currency key_local key_google_mobility key_apple_mobility key_jhu_csse key_nuts key_gadm
496252 eef40c88 2022-02-04 6550.0 17.0 5581.0 39811.0 96926.0 NaN NaN NaN ... CAN CA 124.0 CAD 61 ChIJDcHTs_Q4EVERjVnGRNguMhk Northwest Territories CANT NaN CAN.6_1
496253 eef40c88 2022-02-05 6550.0 17.0 5581.0 39852.0 96926.0 NaN NaN NaN ... CAN CA 124.0 CAD 61 ChIJDcHTs_Q4EVERjVnGRNguMhk Northwest Territories CANT NaN CAN.6_1
496254 eef40c88 2022-02-06 6550.0 17.0 5581.0 39852.0 96926.0 NaN NaN NaN ... CAN CA 124.0 CAD 61 ChIJDcHTs_Q4EVERjVnGRNguMhk Northwest Territories CANT NaN CAN.6_1
496255 eef40c88 2022-02-07 6846.0 17.0 5925.0 39878.0 97798.0 NaN NaN NaN ... CAN CA 124.0 CAD 61 ChIJDcHTs_Q4EVERjVnGRNguMhk Northwest Territories CANT NaN CAN.6_1
496256 eef40c88 2022-02-08 NaN NaN NaN NaN 97798.0 NaN NaN NaN ... CAN CA 124.0 CAD 61 ChIJDcHTs_Q4EVERjVnGRNguMhk Northwest Territories CANT NaN CAN.6_1

5 rows × 47 columns

Q5 Create a bar plot to show the number of tests conducted in each province.

Order the provinces from west to east (use your best judgement for cases where the order is unclear). Which Canadian province that has conducted the most tests?

canada.columns
Index(['id', 'date', 'confirmed', 'deaths', 'recovered', 'tests', 'vaccines',
       'people_vaccinated', 'people_fully_vaccinated', 'hosp', 'icu', 'vent',
       'school_closing', 'workplace_closing', 'cancel_events',
       'gatherings_restrictions', 'transport_closing',
       'stay_home_restrictions', 'internal_movement_restrictions',
       'international_movement_restrictions', 'information_campaigns',
       'testing_policy', 'contact_tracing', 'facial_coverings',
       'vaccination_policy', 'elderly_people_protection',
       'government_response_index', 'stringency_index',
       'containment_health_index', 'economic_support_index',
       'administrative_area_level', 'administrative_area_level_1',
       'administrative_area_level_2', 'administrative_area_level_3',
       'latitude', 'longitude', 'population', 'iso_alpha_3', 'iso_alpha_2',
       'iso_numeric', 'iso_currency', 'key_local', 'key_google_mobility',
       'key_apple_mobility', 'key_jhu_csse', 'key_nuts', 'key_gadm'],
      dtype='object')
canada.administrative_area_level_2.unique()
array(['Prince Edward Island', 'Manitoba', 'Yukon', 'Nunavut', 'Ontario',
       'Quebec', 'Nova Scotia', 'British Columbia',
       'Newfoundland and Labrador', 'New Brunswick', 'Saskatchewan',
       'Alberta', 'Northwest Territories'], dtype=object)
canada_test_conducted = canada[["id", "date", "tests", "administrative_area_level_2", "latitude", "longitude", "population"]].rename(columns = {"administrative_area_level_2" : "province"})
canada_test_conducted.dropna(inplace = True)
canada_test_conducted.sort_values(["province","date"], ascending = [True, False])
id date tests province latitude longitude population
475439 e61d6191 2022-02-07 6763655.0 Alberta 54.500614 -115.002842 4413146
475438 e61d6191 2022-02-06 6746196.0 Alberta 54.500614 -115.002842 4413146
475437 e61d6191 2022-02-05 6746196.0 Alberta 54.500614 -115.002842 4413146
475436 e61d6191 2022-02-04 6746196.0 Alberta 54.500614 -115.002842 4413146
475435 e61d6191 2022-02-03 6739970.0 Alberta 54.500614 -115.002842 4413146
... ... ... ... ... ... ... ...
124171 38791b01 2020-03-15 49.0 Yukon 64.819450 -136.804579 41078
124170 38791b01 2020-03-14 37.0 Yukon 64.819450 -136.804579 41078
124169 38791b01 2020-03-13 36.0 Yukon 64.819450 -136.804579 41078
124168 38791b01 2020-03-12 34.0 Yukon 64.819450 -136.804579 41078
124167 38791b01 2020-03-11 23.0 Yukon 64.819450 -136.804579 41078

9128 rows × 7 columns

canada_test_conducted["date_order"] = canada_test_conducted.sort_values(["province","date"], ascending = [True, False]).groupby("province").cumcount() + 1
canada_test_conducted[canada_test_conducted.date_order == 1].sort_values(["longitude", "latitude"])
id date tests province latitude longitude population date_order
124865 38791b01 2022-02-07 9129.0 Yukon 64.819450 -136.804579 41078 1
407259 c229681f 2022-02-07 5495428.0 British Columbia 54.499851 -124.993506 5110917 1
496255 eef40c88 2022-02-07 39878.0 Northwest Territories 65.280365 -121.562220 44904 1
475439 e61d6191 2022-02-07 6763655.0 Alberta 54.500614 -115.002842 4413146 1
474047 e4c07903 2022-02-07 1452358.0 Saskatchewan 54.500038 -105.927063 1181666 1
264912 7fc88543 2022-02-07 32179.0 Nunavut 66.001041 -100.263618 38780 1
99343 2a9fd65a 2022-02-07 1434252.0 Manitoba 54.510344 -97.212207 1377517 1
276313 83fc0fa9 2022-02-07 22606233.0 Ontario 49.269156 -87.166464 14711827 1
347935 a7ce33b9 2022-02-07 16015314.0 Quebec 53.889046 -73.288937 8537674 1
443254 d177e539 2022-02-07 704856.0 New Brunswick 46.551245 -66.411970 779993 1
390749 b91ff4d1 2022-02-07 1728364.0 Nova Scotia 44.727022 -64.602949 977457 1
4180 015d95fc 2022-02-07 255565.0 Prince Edward Island 46.503836 -63.615584 158158 1
441298 d07806cb 2022-02-07 551866.0 Newfoundland and Labrador 49.120554 -56.692576 521365 1

Since the test column is a cumulative number, if you check the data of the last date in each province, you can check the number of tests conducted in each province.

plt.figure(figsize = (30, 8))
sns.barplot(x = "province", y = "tests", \
            data = canada_test_conducted[canada_test_conducted.date_order == 1].sort_values(["longitude", "latitude"]))
plt.ylabel("The number of tests conducted")
plt.xlabel("Province")
plt.show()

png

In the above bar chart, provinces are sorted in the order from west to east by sorting longitude first, and then latitude values. Looking at the bar chart above, it can be seen that Ontario and Quebec tested overwhelmingly, and there were more tests in the west than in the east. However, since this is an interpretation that does not take into account the population of each province, it is necessary to consider the population as well.

Q6 Create a bar plot that shows the number of tests conducted per 1000 people for each province in Canada.

How would you interpret the results of this bar plot given the results of bar plot in question 5.

canada_test_conducted.loc[canada_test_conducted.date_order == 1, "tests/pop"] = canada_test_conducted[canada_test_conducted.date_order == 1].tests \
                                                                            / canada_test_conducted[canada_test_conducted.date_order == 1].population \
                                                                            * 1000
canada_test_conducted[canada_test_conducted.date_order == 1].sort_values(["longitude", "latitude"])
id date tests province latitude longitude population date_order tests/pop
124865 38791b01 2022-02-07 9129.0 Yukon 64.819450 -136.804579 41078 1 222.235747
407259 c229681f 2022-02-07 5495428.0 British Columbia 54.499851 -124.993506 5110917 1 1075.233270
496255 eef40c88 2022-02-07 39878.0 Northwest Territories 65.280365 -121.562220 44904 1 888.072332
475439 e61d6191 2022-02-07 6763655.0 Alberta 54.500614 -115.002842 4413146 1 1532.615282
474047 e4c07903 2022-02-07 1452358.0 Saskatchewan 54.500038 -105.927063 1181666 1 1229.076575
264912 7fc88543 2022-02-07 32179.0 Nunavut 66.001041 -100.263618 38780 1 829.783394
99343 2a9fd65a 2022-02-07 1434252.0 Manitoba 54.510344 -97.212207 1377517 1 1041.186425
276313 83fc0fa9 2022-02-07 22606233.0 Ontario 49.269156 -87.166464 14711827 1 1536.602694
347935 a7ce33b9 2022-02-07 16015314.0 Quebec 53.889046 -73.288937 8537674 1 1875.840422
443254 d177e539 2022-02-07 704856.0 New Brunswick 46.551245 -66.411970 779993 1 903.669648
390749 b91ff4d1 2022-02-07 1728364.0 Nova Scotia 44.727022 -64.602949 977457 1 1768.225098
4180 015d95fc 2022-02-07 255565.0 Prince Edward Island 46.503836 -63.615584 158158 1 1615.884116
441298 d07806cb 2022-02-07 551866.0 Newfoundland and Labrador 49.120554 -56.692576 521365 1 1058.502201
plt.figure(figsize = (30, 8))
sns.barplot(x = "province", y = "tests/pop", \
            data = canada_test_conducted[canada_test_conducted.date_order == 1].sort_values(["longitude", "latitude"]))
plt.ylabel("The number of tests conducted pero 1000 people")
plt.xlabel("Province")
plt.show()

png

In the graph in Q5, it came out that Ontario and Quebec had an overwhelming number of tests, but if you check the graph in Q6, you can confirm that this was because the population of the two provinces was overwhelmingly large. Comparing the number of tests per 1000 population, it can be seen that Ontario is lower than Nova Scotia and Prince Edward Island. In addition, it can be seen that the number of tests per 1000 population is slightly higher in the east than in the west.

Q7 Create a visualization that shows which Canadian province has the highest recovery rate.

Recovery rate is calculated as the number of recovered cases divided by the number of confirmed cases.

canada_recovery = canada[["id", "date", "confirmed", "recovered", "administrative_area_level_2", "latitude", "longitude"]].rename(columns = {"administrative_area_level_2" : "province"})
canada_recovery.dropna(inplace = True)
canada_recovery["date_order"] = canada_recovery.sort_values(["province","date"], ascending = [True, False]).groupby("province").cumcount() + 1
canada_recovery.loc[canada_recovery.date_order == 1, "recovery_rate"] = canada_recovery[canada_recovery.date_order == 1].recovered \
                                                                            / canada_recovery[canada_recovery.date_order == 1].confirmed
canada_recovery.loc[canada_recovery.date_order == 1].sort_values("recovery_rate", ascending = False)
id date confirmed recovered province latitude longitude date_order recovery_rate
124865 38791b01 2022-02-07 3235.0 3105.0 Yukon 64.819450 -136.804579 1 0.959815
276313 83fc0fa9 2022-02-07 1056149.0 1010878.0 Ontario 49.269156 -87.166464 1 0.957136
347935 a7ce33b9 2022-02-07 883192.0 834633.0 Quebec 53.889046 -73.288937 1 0.945019
475439 e61d6191 2022-02-07 508051.0 474284.0 Alberta 54.500614 -115.002842 1 0.933536
407259 c229681f 2022-02-07 333925.0 306419.0 British Columbia 54.499851 -124.993506 1 0.917628
474047 e4c07903 2022-02-07 123258.0 113023.0 Saskatchewan 54.500038 -105.927063 1 0.916963
441298 d07806cb 2022-02-07 18740.0 17156.0 Newfoundland and Labrador 49.120554 -56.692576 1 0.915475
390749 b91ff4d1 2022-02-07 40767.0 37300.0 Nova Scotia 44.727022 -64.602949 1 0.914956
443254 d177e539 2022-02-07 31017.0 27298.0 New Brunswick 46.551245 -66.411970 1 0.880098
496255 eef40c88 2022-02-07 6846.0 5925.0 Northwest Territories 65.280365 -121.562220 1 0.865469
99343 2a9fd65a 2022-02-07 123739.0 103595.0 Manitoba 54.510344 -97.212207 1 0.837206
4180 015d95fc 2022-02-07 9104.0 7268.0 Prince Edward Island 46.503836 -63.615584 1 0.798330
264912 7fc88543 2022-02-07 1989.0 1444.0 Nunavut 66.001041 -100.263618 1 0.725993
plt.figure(figsize = (30, 8))
sns.barplot(x = "province", y = "recovery_rate", \
            data = canada_recovery[canada_recovery.date_order == 1].sort_values(["longitude", "latitude"]))
plt.ylabel("Recovery rate = # of recovered cases / # of confirmed cases")
plt.xlabel("Province")
plt.show()

png

Yukon has the highest recovery rate of 0.96, Nunavut has the lowest 0.73. Yukon had the lowest number of tests per 1000 population but the highest recovery rate. In Nunavut, the number of tests per 1000 people was 829, which was much lower than in other provinces, but the recovery rate was also the lowest.

Q8 Create visualizations that show the impacts of _at least_ three policy measures on mortality or infection rates.

See https://covid19datahub.io/articles/doc/data.html for descriptions of the available policy measures. You are not limited to histograms and bar charts. Remember that you can use subplots!

Gatherings restrictions

0 - no restrictions
1 - restrictions on very large gatherings (the limit is above 1000 people)
2 - restrictions on gatherings between 101-1000 people
3 - restrictions on gatherings between 11-100 people
4 - restrictions on gatherings of 10 people or less

If gathering restrictions are targeted to a specific geographical region, then they have a negative value. Otherwise, if they are a general policy that is applied across the whole country/territory, then they have positive values. Since I am only interested in the impacts of the general policy on infection rates, I will focus on the data that have the positive gathering_restirctions values.

canada_gatherings_restriction = canada[["date", "confirmed", "population", "gatherings_restrictions", "administrative_area_level_2"]].rename(columns = {"administrative_area_level_2" : "province", "confirmed" : "cum_confirmed"})
canada_gatherings_restriction = canada_gatherings_restriction[canada_gatherings_restriction.cum_confirmed.isnull() == False]
canada_gatherings_restriction = canada_gatherings_restriction[canada_gatherings_restriction.gatherings_restrictions >= 0].sort_values(["province", "date"])
canada_gatherings_restriction["daily_confirmed"] = canada_gatherings_restriction.sort_values(["province", "date"]).groupby("province").cum_confirmed.diff().fillna(canada_gatherings_restriction.cum_confirmed)
canada_gatherings_restriction = canada_gatherings_restriction[canada_gatherings_restriction.daily_confirmed >= 0]
canada_gatherings_restriction = canada_gatherings_restriction[["date", "province", "population", "gatherings_restrictions", "daily_confirmed", "cum_confirmed"]]
canada_gatherings_restriction_infection_rate = canada_gatherings_restriction.groupby(["gatherings_restrictions", "date"]).sum()[["daily_confirmed", "population"]].reset_index() 
canada_gatherings_restriction_infection_rate["infection_rate"] = canada_gatherings_restriction_infection_rate.daily_confirmed / canada_gatherings_restriction_infection_rate.population * 100
canada_gatherings_restriction_infection_rate
gatherings_restrictions date daily_confirmed population infection_rate
0 0.0 2020-01-31 4.0 19822744 0.000020
1 0.0 2020-02-08 3.0 19822744 0.000015
2 0.0 2020-02-16 1.0 19822744 0.000005
3 0.0 2020-02-21 1.0 19822744 0.000005
4 0.0 2020-02-24 1.0 19822744 0.000005
... ... ... ... ... ...
1854 4.0 2022-01-31 2073.0 4499128 0.046076
1855 4.0 2022-02-01 2188.0 4458050 0.049080
1856 4.0 2022-02-02 3172.0 4458050 0.071152
1857 4.0 2022-02-03 2512.0 4458050 0.056348
1858 4.0 2022-02-04 2273.0 4458050 0.050986

1859 rows × 5 columns

plt.figure(figsize = (20, 8))
sns.boxplot(data = canada_gatherings_restriction_infection_rate, x = "gatherings_restrictions", y = "infection_rate", showfliers = False)
plt.xlabel("Gathering restrictions", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

The graph above shows a box plot of the daily infection rate for each gathering restriction policy. It can be seen that the infection rate was high when there was no restriction, but when the restriction was level 1 to 3, the infection rate was significantly reduced. However, when the restriction was the highest at level 4, the infection rate was rather high.

canada_gatherings_restriction_infection_rate["infection_rate_moving_avg"] = canada_gatherings_restriction_infection_rate.sort_values(["gatherings_restrictions","date"]).groupby("gatherings_restrictions").infection_rate.rolling(7).mean().reset_index().infection_rate
plt.figure(figsize = (30, 8))
sns.lineplot(data = canada_gatherings_restriction_infection_rate, x = "date", y = "infection_rate_moving_avg", hue = "gatherings_restrictions")
plt.xlabel("date", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

It is a graph showing the infection rate for each restriction level over time. All five levels show a similar trend, but it can be seen that the infection rate is particularly high in the 4th level of restriction area from October 2020 to June 2021. In addition, the infection rate was particularly high in the level 0 restricted area from July 2021 to December 2021.

Stay home restrictions

0 - no measures
1 - recommend not leaving house
2 - require not leaving house with exceptions for daily exercise, grocery shopping, and ‘essential’ trips
3 - require not leaving house with minimal exceptions (eg allowed to leave once a week, or only one person can leave at a time, etc)

If stay-home restrictions are targeted to a specific geographical region, then they have a negative value. Otherwise, if they are a general policy that is applied across the whole country/territory, then they have positive values. Since I am only interested in the impacts of the general policy on infection rates, I will focus on the data that have positive stay-home restrictions values.

canada_stayhome_restriction = canada[["date", "confirmed", "population", "stay_home_restrictions", "administrative_area_level_2"]].rename(columns = {"administrative_area_level_2" : "province", "confirmed" : "cum_confirmed"})
canada_stayhome_restriction = canada_stayhome_restriction[canada_stayhome_restriction.cum_confirmed.isnull() == False]
canada_stayhome_restriction = canada_stayhome_restriction[canada_stayhome_restriction.stay_home_restrictions >= 0].sort_values(["province", "date"])
canada_stayhome_restriction["daily_confirmed"] = canada_stayhome_restriction.sort_values(["province", "date"]).groupby("province").cum_confirmed.diff().fillna(canada_stayhome_restriction.cum_confirmed)
canada_stayhome_restriction = canada_stayhome_restriction[canada_stayhome_restriction.daily_confirmed >= 0]
canada_stayhome_restriction = canada_stayhome_restriction[["date", "province", "population", "stay_home_restrictions", "daily_confirmed", "cum_confirmed"]]
canada_stayhome_restriction_infection_rate = canada_stayhome_restriction.groupby(["stay_home_restrictions", "date"]).sum()[["daily_confirmed", "population"]].reset_index() 
canada_stayhome_restriction_infection_rate["infection_rate"] = canada_stayhome_restriction_infection_rate.daily_confirmed / canada_stayhome_restriction_infection_rate.population * 100
canada_stayhome_restriction_infection_rate
stay_home_restrictions date daily_confirmed population infection_rate
0 0.0 2020-01-31 4.0 19822744 0.000020
1 0.0 2020-02-08 3.0 19822744 0.000015
2 0.0 2020-02-16 1.0 19822744 0.000005
3 0.0 2020-02-21 1.0 19822744 0.000005
4 0.0 2020-02-24 1.0 19822744 0.000005
... ... ... ... ... ...
1389 2.0 2022-01-12 8351.0 8537674 0.097814
1390 2.0 2022-01-13 8793.0 8537674 0.102991
1391 2.0 2022-01-14 7382.0 8537674 0.086464
1392 2.0 2022-01-15 6705.0 8537674 0.078534
1393 2.0 2022-01-16 5946.0 8537674 0.069644

1394 rows × 5 columns

plt.figure(figsize = (20, 8))
sns.boxplot(data = canada_stayhome_restriction_infection_rate, x = "stay_home_restrictions", y = "infection_rate", showfliers = False)
plt.xlabel("Stay home restrictions", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

The graph above shows a box plot of the daily infection rate for each stay-home restriction. Contrary to our common sense, it can be seen that the higher the restriction level, the higher the infection rate.

canada_stayhome_restriction_infection_rate["infection_rate_moving_avg"] = canada_stayhome_restriction_infection_rate.sort_values(["stay_home_restrictions","date"]).groupby("stay_home_restrictions").infection_rate.rolling(7).mean().reset_index().infection_rate
plt.figure(figsize = (30, 8))
sns.lineplot(data = canada_stayhome_restriction_infection_rate, x = "date", y = "infection_rate_moving_avg", hue = "stay_home_restrictions")
plt.xlabel("date", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

The overall trend is similar, but in the case of the second-stage restriction, it can be seen that the infection rate was particularly high from April 2020 to January 2021.

Transport closing

0 - no measures
1 - recommend closing (or significantly reduce volume/route/means of transport available)
2 - require closing (or prohibit most citizens from using it)

If transport closing restrictions are targeted to a specific geographical region, then they have a negative value. Otherwise, if they are a general policy that is applied across the whole country/territory, then they have positive values. Since I am only interested in the impacts of the general policy on infection rates, I will focus on the data that have positive transport closing values.

canada_transport_closing = canada[["date", "confirmed", "population", "transport_closing", "administrative_area_level_2"]].rename(columns = {"administrative_area_level_2" : "province", "confirmed" : "cum_confirmed"})
canada_transport_closing = canada_transport_closing[canada_transport_closing.cum_confirmed.isnull() == False]
canada_transport_closing = canada_transport_closing[canada_transport_closing.transport_closing >= 0].sort_values(["province", "date"])
canada_transport_closing["daily_confirmed"] = canada_transport_closing.sort_values(["province", "date"]).groupby("province").cum_confirmed.diff().fillna(canada_transport_closing.cum_confirmed)
canada_transport_closing = canada_transport_closing[canada_transport_closing.daily_confirmed >= 0]
canada_transport_closing = canada_transport_closing[["date", "province", "population", "transport_closing", "daily_confirmed", "cum_confirmed"]]
canada_transport_closing_infection_rate = canada_transport_closing.groupby(["transport_closing", "date"]).sum()[["daily_confirmed", "population"]].reset_index() 
canada_transport_closing_infection_rate["infection_rate"] = canada_transport_closing_infection_rate.daily_confirmed / canada_transport_closing_infection_rate.population * 100
canada_transport_closing_infection_rate
transport_closing date daily_confirmed population infection_rate
0 0.0 2020-01-31 4.0 19822744 0.000020
1 0.0 2020-02-08 3.0 19822744 0.000015
2 0.0 2020-02-16 1.0 19822744 0.000005
3 0.0 2020-02-21 1.0 19822744 0.000005
4 0.0 2020-02-24 1.0 19822744 0.000005
... ... ... ... ... ...
1256 1.0 2022-01-27 378.0 521365 0.072502
1257 1.0 2022-01-28 265.0 521365 0.050828
1258 1.0 2022-01-29 208.0 521365 0.039895
1259 1.0 2022-01-30 210.0 521365 0.040279
1260 1.0 2022-01-31 183.0 521365 0.035100

1261 rows × 5 columns

plt.figure(figsize = (20, 8))
sns.boxplot(data = canada_transport_closing_infection_rate, x = "transport_closing", y = "infection_rate", showfliers = False)
plt.xlabel("Transport closing", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

The graph above shows a box plot of the daily infection rate for each transport closing restriction. There was no level 2 restriction in Canada. It can be seen that the first level restriction, which only recommends transport closing, has little effect on the infection rate.

canada_transport_closing_infection_rate["infection_rate_moving_avg"] = canada_transport_closing_infection_rate.sort_values(["transport_closing","date"]).groupby("transport_closing").infection_rate.rolling(7).mean().reset_index().infection_rate
plt.figure(figsize = (30, 8))
sns.lineplot(data = canada_transport_closing_infection_rate, x = "date", y = "infection_rate_moving_avg", hue = "transport_closing")
plt.xlabel("date", fontsize = 16)
plt.ylabel("Infection rate (%) = confirmed cases / population * 100", fontsize = 16)
plt.show()

png

Level 0 and Level 1 restrictions show almost no difference in the infection rate over time.