HW2. More Data Manipulation

87 minute read

Topics: Data manipulation, EDA(Exploratory Data Analysis)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches

Background
You’re a Data Science Consultant for an eCommerce retail company, they’ve asked you to analyze their sales database. Unfortunately, they did nothing to prepare or clean their data, only exporting their 3 database tables as JSON files. It’s up to you to clean their data, analyze it and answer questions to help drive business value!

The below files have been provided via the URLs shown:

invoices.json https://github.com/umsi-data-science/data/raw/main/invoices.json
items.json https://github.com/umsi-data-science/data/raw/main/items.json
purchases.json https://github.com/umsi-data-science/data/raw/main/purchases.json

They provided this data dictionary:

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

A few notes from the company:

If the InvoiceNo starts with the letter ‘c’, it indicates a cancellation. When conducting this analysis we only want to analyze invoices that were shipped. (ie. not canceled)
The datasets should be able to be merged, each row in the invoice table corresponds to multiple rows in the purchases table.
To find out the description or unit cost of an item in the purchase table, the StockCode should be used to match up the product in the items table.
They mentioned that they’ve been having a difficult time lately joining the items and purchases table, maybe there’s something wrong with the columns?

Answer the questions below.

Write your Python code that can answer the following questions,
and explain ALL your answers in plain English.
you can use as many code and markdown cells as you need for each question (i.e. don’t limit yourself to just one of each if you feel you need more).

MY_UNIQNAME = 'yjwoo'  # replace this with your uniqname

Q1. [5 points] Describe the dataset.

Load the data.
How many total invoices have been placed?
How many unique customers are there?
What is the total number of unique items?
Are there any columns with null values?
Thinking ahead, how do you think you would join the different tables? Please share 2-3 sentences about your approach.

Load the data

df_invoice = pd.read_json("https://github.com/umsi-data-science/data/raw/main/invoices.json")
df_invoice.head()

	InvoiceNo	InvoiceDate	CustomerID	Country
0	536365	12/1/10 8:26	17850.0	United Kingdom
1	536366	12/1/10 8:28	17850.0	United Kingdom
2	536367	12/1/10 8:34	13047.0	United Kingdom
3	536368	12/1/10 8:34	13047.0	United Kingdom
4	536369	12/1/10 8:35	13047.0	United Kingdom

df_item = pd.read_json("https://github.com/umsi-data-science/data/raw/main/items.json")
df_item.head()

	StockCode	Description	UnitPrice
0	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55
1	71053	WHITE METAL LANTERN	3.39
2	84406B	CREAM CUPID HEARTS COAT HANGER	2.75
3	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39
4	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39

df_purchase = pd.read_json("https://github.com/umsi-data-science/data/raw/main/purchases.json")
df_purchase.head()

	InvoiceNo	StockCodeSC	Quantity
0	536365	SC85123A	6
1	536365	SC71053	6
2	536365	SC84406B	8
3	536365	SC84029G	6
4	536365	SC84029E	6

By the above codes, we have loaded invoice, purchase, item data.

Invoice data

df_invoice.shape

(25943, 4)

Original invoice data has 25943 rows.

Invoice number

Since we are only interested in the invoices that are shipped, let’s focus on the invoices that do not start with ‘c’.

np.sum(df_invoice.InvoiceNo.str.startswith('C'))

There is a total of 3837 rows that are related to the invoices not shipped.

df_invoice_shipped = df_invoice[df_invoice.InvoiceNo.str.startswith('C') == False]
df_invoice_shipped.shape

(22106, 4)

df_invoice_shipped has the invoices data that are shipped. df_invoice_shipped has a total of 22106 rows.

df_invoice_shipped[df_invoice_shipped.InvoiceNo.str.startswith('5') == False]

	InvoiceNo	InvoiceDate	CustomerID	Country
15466	A563185	8/12/11 14:50	NaN	United Kingdom
15467	A563186	8/12/11 14:51	NaN	United Kingdom
15468	A563187	8/12/11 14:52	NaN	United Kingdom

Three invoice numbers are different from general other invoice numbers in that they start with “A”. Let’s check if purchase data also has these three invoice numbers.

df_purchase[df_purchase.InvoiceNo.str.startswith('A')]

	InvoiceNo	StockCodeSC	Quantity
299982	A563185	SCB	1
299983	A563186	SCB	1
299984	A563187	SCB	1

Since purchase data has all the information related to these three abnormal invoice numbers, the analysis proceeds without deleting these three invoice numbers.

Customer ID

df_invoice_shipped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22106 entries, 0 to 25942
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    22106 non-null  object 
 1   InvoiceDate  22106 non-null  object 
 2   CustomerID   18566 non-null  float64
 3   Country      22106 non-null  object 
dtypes: float64(1), object(3)
memory usage: 863.5+ KB

np.sum(df_invoice_shipped.CustomerID.isnull())

We can check that there are 3540 missing values in the “CumstomerID” column. (Q1 - 4)

df_invoice_shipped[df_invoice_shipped.CustomerID.isnull()]

	InvoiceNo	InvoiceDate	CustomerID	Country
46	536414	12/1/10 11:52	NaN	United Kingdom
89	536544	12/1/10 14:32	NaN	United Kingdom
90	536545	12/1/10 14:32	NaN	United Kingdom
91	536546	12/1/10 14:33	NaN	United Kingdom
92	536547	12/1/10 14:33	NaN	United Kingdom
...	...	...	...	...
25859	581435	12/8/11 16:14	NaN	United Kingdom
25863	581439	12/8/11 16:30	NaN	United Kingdom
25911	581492	12/9/11 10:03	NaN	United Kingdom
25916	581497	12/9/11 10:23	NaN	United Kingdom
25917	581498	12/9/11 10:26	NaN	United Kingdom

3540 rows × 4 columns

Even though invoice df_invoice_shipped has missing values in the “CustomerID” column, we can get other purchase information for each transaction by invoice numbers. Therefore, the analysis proceeds without deleting rows with a missing value in CustomerID.

Country

df_invoice_shipped.Country.value_counts()

United Kingdom    20161
Germany             457
France              393
EIRE                289
Belgium              98
                  ...  
Czech Republic        2
Saudi Arabia          1
Brazil                1
Lebanon               1
RSA                   1
Name: Country, Length: 38, dtype: int64

The above data shows how many rows exist for each country. We can check there is a total of 13 abnormal “Unspecified” rows in the “Country” column, which can be similar to a missing value.

df_invoice_shipped[df_invoice_shipped.Country == "Unspecified"]

	InvoiceNo	InvoiceDate	CustomerID	Country
7594	549687	4/11/11 13:29	12363.0	Unspecified
9350	552695	5/10/11 15:31	16320.0	Unspecified
10041	553857	5/19/11 13:30	NaN	Unspecified
12173	557499	6/20/11 15:25	16320.0	Unspecified
13339	559521	7/8/11 16:26	NaN	Unspecified
...	...	...	...	...
15911	563947	8/22/11 10:18	12363.0	Unspecified
15946	564051	8/22/11 13:32	14265.0	Unspecified
16621	565303	9/2/11 12:17	NaN	Unspecified
23151	576646	11/16/11 10:18	NaN	Unspecified
24289	578539	11/24/11 14:55	NaN	Unspecified

13 rows × 4 columns

However, since we can also get other purchase information for each transaction by invoice numbers, the analysis proceeds without deleting “Unspecified” rows in the “Country” column.

Invoice date

np.sum(df_invoice_shipped.InvoiceDate.isnull())

There is no missing value in invoice date. (Q1 - 4)

In principle, one invoice number should have one invoice date.

df_invoice_shipped = df_invoice_shipped.merge(df_invoice_shipped.groupby("InvoiceNo").count().InvoiceDate.reset_index().rename({"InvoiceDate" : "InvoiceDateCount"}, axis = 1), on = "InvoiceNo", how = "left")
df_invoice_shipped.head()

	InvoiceNo	InvoiceDate	CustomerID	Country	InvoiceDateCount
0	536365	12/1/10 8:26	17850.0	United Kingdom	1
1	536366	12/1/10 8:28	17850.0	United Kingdom	1
2	536367	12/1/10 8:34	13047.0	United Kingdom	1
3	536368	12/1/10 8:34	13047.0	United Kingdom	1
4	536369	12/1/10 8:35	13047.0	United Kingdom	1

df_invoice_shipped.InvoiceDateCount.sort_values()

      1
  1
  1
  1
  1
        ..
   2
   2
   2
  2
   2
Name: InvoiceDateCount, Length: 22106, dtype: int64

From the above data, we can see that several invoice numbers have two different invoice dates.

pd.set_option('display.max_rows', df_invoice_shipped.shape[0]+1)

df_invoice_shipped.loc[df_invoice_shipped[df_invoice_shipped.InvoiceDateCount > 1].sort_values(by = ["InvoiceNo", "InvoiceDate", "CustomerID"]).index]

	InvoiceNo	InvoiceDate	CustomerID	Country	InvoiceDateCount
130	536591	12/1/10 16:57	14606.0	United Kingdom	2
131	536591	12/1/10 16:58	14606.0	United Kingdom	2
1789	540185	1/5/11 13:40	14653.0	United Kingdom	2
1790	540185	1/5/11 13:41	14653.0	United Kingdom	2
2374	541596	1/19/11 16:18	17602.0	United Kingdom	2
2375	541596	1/19/11 16:19	17602.0	United Kingdom	2
2389	541631	1/20/11 10:47	12637.0	France	2
2390	541631	1/20/11 10:48	12637.0	France	2
2454	541809	1/21/11 14:58	NaN	United Kingdom	2
2455	541809	1/21/11 14:59	NaN	United Kingdom	2
2462	541816	1/21/11 15:56	17799.0	United Kingdom	2
2463	541816	1/21/11 15:57	17799.0	United Kingdom	2
2481	541849	1/23/11 13:33	13230.0	United Kingdom	2
2482	541849	1/23/11 13:34	13230.0	United Kingdom	2
2640	542217	1/26/11 12:35	14606.0	United Kingdom	2
2641	542217	1/26/11 12:36	14606.0	United Kingdom	2
2942	542806	2/1/11 11:19	12836.0	United Kingdom	2
2943	542806	2/1/11 11:20	12836.0	United Kingdom	2
3116	543171	2/4/11 9:10	NaN	United Kingdom	2
3117	543171	2/4/11 9:11	NaN	United Kingdom	2
3124	543179	2/4/11 10:31	12754.0	Japan	2
3125	543179	2/4/11 10:32	12754.0	Japan	2
3128	543182	2/4/11 10:39	NaN	United Kingdom	2
3129	543182	2/4/11 10:40	NaN	United Kingdom	2
3408	543777	2/11/11 16:19	15406.0	United Kingdom	2
3409	543777	2/11/11 16:20	15406.0	United Kingdom	2
3597	544186	2/16/11 15:55	NaN	United Kingdom	2
3598	544186	2/16/11 15:56	NaN	United Kingdom	2
3825	544667	2/22/11 15:09	17511.0	United Kingdom	2
3826	544667	2/22/11 15:10	17511.0	United Kingdom	2
3956	544926	2/24/11 17:50	13468.0	United Kingdom	2
3957	544926	2/24/11 17:51	13468.0	United Kingdom	2
4212	545460	3/2/11 17:32	13230.0	United Kingdom	2
4213	545460	3/2/11 17:33	13230.0	United Kingdom	2
4369	545713	3/7/11 10:11	NaN	United Kingdom	2
4370	545713	3/7/11 10:12	NaN	United Kingdom	2
4660	546388	3/11/11 13:42	NaN	United Kingdom	2
4661	546388	3/11/11 13:43	NaN	United Kingdom	2
4957	546986	3/18/11 12:55	14194.0	United Kingdom	2
4958	546986	3/18/11 12:56	14194.0	United Kingdom	2
5328	547690	3/24/11 14:55	16063.0	United Kingdom	2
5329	547690	3/24/11 14:56	16063.0	United Kingdom	2
5591	548203	3/29/11 16:40	NaN	United Kingdom	2
5592	548203	3/29/11 16:41	NaN	United Kingdom	2
6148	549245	4/7/11 11:59	15005.0	United Kingdom	2
6149	549245	4/7/11 12:00	15005.0	United Kingdom	2
6308	549524	4/8/11 15:41	NaN	United Kingdom	2
6309	549524	4/8/11 15:42	NaN	United Kingdom	2
6708	550320	4/17/11 12:37	12748.0	United Kingdom	2
6709	550320	4/17/11 12:38	12748.0	United Kingdom	2
6722	550333	4/17/11 14:05	15410.0	United Kingdom	2
6723	550333	4/17/11 14:06	15410.0	United Kingdom	2
6858	550641	4/19/11 15:53	NaN	United Kingdom	2
6859	550641	4/19/11 15:54	NaN	United Kingdom	2
6862	550645	4/19/11 16:30	16098.0	United Kingdom	2
6863	550645	4/19/11 16:31	16098.0	United Kingdom	2
7550	552000	5/5/11 15:55	NaN	United Kingdom	2
7551	552000	5/5/11 15:56	NaN	United Kingdom	2
8183	553199	5/15/11 15:13	17758.0	United Kingdom	2
8184	553199	5/15/11 15:14	17758.0	United Kingdom	2
8242	553375	5/16/11 14:52	14911.0	EIRE	2
8243	553375	5/16/11 14:53	14911.0	EIRE	2
8352	553556	5/17/11 16:48	17530.0	United Kingdom	2
8353	553556	5/17/11 16:49	17530.0	United Kingdom	2
8671	554116	5/22/11 15:11	14514.0	United Kingdom	2
8672	554116	5/22/11 15:12	14514.0	United Kingdom	2
10560	558086	6/26/11 11:58	16728.0	United Kingdom	2
10561	558086	6/26/11 11:59	16728.0	United Kingdom	2
12182	561369	7/26/11 16:21	NaN	United Kingdom	2
12183	561369	7/26/11 16:22	NaN	United Kingdom	2
12572	562128	8/3/11 9:06	16150.0	United Kingdom	2
12573	562128	8/3/11 9:07	16150.0	United Kingdom	2
13124	563245	8/15/11 10:53	16272.0	United Kingdom	2
13125	563245	8/15/11 10:54	16272.0	United Kingdom	2
14981	567183	9/18/11 15:32	14769.0	United Kingdom	2
14982	567183	9/18/11 15:33	14769.0	United Kingdom	2
17238	571735	10/19/11 10:03	14229.0	United Kingdom	2
17239	571735	10/19/11 10:04	14229.0	United Kingdom	2
18394	574076	11/2/11 15:37	NaN	United Kingdom	2
18395	574076	11/2/11 15:38	NaN	United Kingdom	2
19373	576057	11/13/11 15:05	15861.0	United Kingdom	2
19374	576057	11/13/11 15:06	15861.0	United Kingdom	2
20668	578548	11/24/11 15:02	17345.0	United Kingdom	2
20669	578548	11/24/11 15:03	17345.0	United Kingdom	2

The invoice dates in each invoice number differed by only one minute each other, and other columns are all same. So I will only use the earlier of the two dates.

pd.set_option('display.max_rows', 10)

df_invoice_shipped

	InvoiceNo	InvoiceDate	CustomerID	Country	InvoiceDateCount
0	536365	12/1/10 8:26	17850.0	United Kingdom	1
1	536366	12/1/10 8:28	17850.0	United Kingdom	1
2	536367	12/1/10 8:34	13047.0	United Kingdom	1
3	536368	12/1/10 8:34	13047.0	United Kingdom	1
4	536369	12/1/10 8:35	13047.0	United Kingdom	1
...	...	...	...	...	...
22101	581583	12/9/11 12:23	13777.0	United Kingdom	1
22102	581584	12/9/11 12:25	13777.0	United Kingdom	1
22103	581585	12/9/11 12:31	15804.0	United Kingdom	1
22104	581586	12/9/11 12:49	13113.0	United Kingdom	1
22105	581587	12/9/11 12:50	12680.0	France	1

22106 rows × 5 columns

df_invoice_shipped["partition_row_number"] = df_invoice_shipped.sort_values(by = ["InvoiceNo", "InvoiceDate"]).groupby("InvoiceNo").cumcount() + 1

df_invoice_shipped.loc[df_invoice_shipped[df_invoice_shipped.InvoiceDateCount > 1].sort_values(by = ["InvoiceNo", "InvoiceDate", "CustomerID"]).index]

	InvoiceNo	InvoiceDate	CustomerID	Country	InvoiceDateCount	partition_row_number
130	536591	12/1/10 16:57	14606.0	United Kingdom	2	1
131	536591	12/1/10 16:58	14606.0	United Kingdom	2	2
1789	540185	1/5/11 13:40	14653.0	United Kingdom	2	1
1790	540185	1/5/11 13:41	14653.0	United Kingdom	2	2
2374	541596	1/19/11 16:18	17602.0	United Kingdom	2	1
...	...	...	...	...	...	...
18395	574076	11/2/11 15:38	NaN	United Kingdom	2	2
19373	576057	11/13/11 15:05	15861.0	United Kingdom	2	1
19374	576057	11/13/11 15:06	15861.0	United Kingdom	2	2
20668	578548	11/24/11 15:02	17345.0	United Kingdom	2	1
20669	578548	11/24/11 15:03	17345.0	United Kingdom	2	2

84 rows × 6 columns

df_invoice_shipped = df_invoice_shipped[df_invoice_shipped.partition_row_number == 1]
df_invoice_shipped.drop(["InvoiceDateCount", "partition_row_number"], axis = 1, inplace = True)
df_invoice_shipped.head()

/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(

	InvoiceNo	InvoiceDate	CustomerID	Country
0	536365	12/1/10 8:26	17850.0	United Kingdom
1	536366	12/1/10 8:28	17850.0	United Kingdom
2	536367	12/1/10 8:34	13047.0	United Kingdom
3	536368	12/1/10 8:34	13047.0	United Kingdom
4	536369	12/1/10 8:35	13047.0	United Kingdom

df_invoice_shipped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22064 entries, 0 to 22105
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    22064 non-null  object 
 1   InvoiceDate  22064 non-null  object 
 2   CustomerID   18536 non-null  float64
 3   Country      22064 non-null  object 
dtypes: float64(1), object(3)
memory usage: 861.9+ KB

The column “InvoiceDate” is now object type, so I need to change this column to the DateTime type.

df_invoice_shipped["InvoiceDate"] = pd.to_datetime(df_invoice_shipped["InvoiceDate"])
df_invoice_shipped.head()

/var/folders/kb/9bgdwxjn0b751yc9w59v1ll00000gn/T/ipykernel_53930/964488366.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_invoice_shipped["InvoiceDate"] = pd.to_datetime(df_invoice_shipped["InvoiceDate"])

	InvoiceNo	InvoiceDate	CustomerID	Country
0	536365	2010-12-01 08:26:00	17850.0	United Kingdom
1	536366	2010-12-01 08:28:00	17850.0	United Kingdom
2	536367	2010-12-01 08:34:00	13047.0	United Kingdom
3	536368	2010-12-01 08:34:00	13047.0	United Kingdom
4	536369	2010-12-01 08:35:00	13047.0	United Kingdom

df_invoice_shipped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22064 entries, 0 to 22105
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    22064 non-null  object        
 1   InvoiceDate  22064 non-null  datetime64[ns]
 2   CustomerID   18536 non-null  float64       
 3   Country      22064 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 861.9+ KB

invoice data after preprocessing

df_invoice_shipped.shape

(22064, 4)

After completing all the preprocessing for the invoice data, the invoice data has a total of 22064 rows and 4 columns.

len(df_invoice_shipped.InvoiceNo.unique())

22064 unique invoices have been placed. (Q1-2)

len(df_invoice_shipped.CustomerID.unique())

There is a total of 4339 unique customers excluding missing values. (Q1-3)

Item data

df_item.head()

	StockCode	Description	UnitPrice
0	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55
1	71053	WHITE METAL LANTERN	3.39
2	84406B	CREAM CUPID HEARTS COAT HANGER	2.75
3	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39
4	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39

df_item.shape

(4070, 3)

Original item data has 4070 rows and 3 columns

stock code

df_item.loc[df_item.StockCode.sort_values().index]

	StockCode	Description	UnitPrice
31	10002	INFLATABLE POLITICAL GLOBE	0.85
3147	10080	GROOVY CACTUS INFLATABLE	0.85
1636	10120	DOGGY RUBBER	0.21
1635	10123C	HEARTS WRAPPING TAPE	0.65
3308	10123G	None	0.00
...	...	...	...
2843	gift_0001_20	Dotcomgiftshop Gift Voucher £20.00	17.02
2842	gift_0001_30	Dotcomgiftshop Gift Voucher £30.00	25.53
2774	gift_0001_40	Dotcomgiftshop Gift Voucher £40.00	34.04
2815	gift_0001_50	Dotcomgiftshop Gift Voucher £50.00	42.55
2791	m	Manual	2.55

4070 rows × 3 columns

In the above data, the stock code in the last row is abnormal “m”. However, since this stock code has a valid description and unit price, it can be considered as the valid stock code.

np.sum(df_item.StockCode.isnull())

There is no missing value in the stock code column. (Q1 - 4)

I think in principle there should be one description and price for each stock code.

df_item = df_item.merge(df_item.groupby("StockCode").count().reset_index().rename({"Description" : "description_count", "UnitPrice" : "unit_price_count"}, axis = 1), on = "StockCode", how = "left")
df_item

	StockCode	Description	UnitPrice	description_count	unit_price_count
0	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55	1	1
1	71053	WHITE METAL LANTERN	3.39	1	1
2	84406B	CREAM CUPID HEARTS COAT HANGER	2.75	1	1
3	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39	1	1
4	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39	1	1
...	...	...	...	...	...
4065	85179a	GREEN BITTY LIGHT CHAIN	2.46	1	1
4066	23617	SET 10 CARDS SWIRLY XMAS TREE 17104	2.91	1	1
4067	90214U	LETTER "U" BLING KEY RING	0.29	1	1
4068	47591b	SCOTTIES CHILDRENS APRON	4.13	1	1
4069	23843	PAPER CRAFT , LITTLE BIRDIE	2.08	1	1

4070 rows × 5 columns

df_item.description_count.sort_values()

  0
  0
  0
  0
  0
       ..
  1
  1
  1
  1
  1
Name: description_count, Length: 4070, dtype: int64

df_item.unit_price_count.sort_values()

     1
  1
  1
  1
  1
       ..
  1
  1
  1
  1
  1
Name: unit_price_count, Length: 4070, dtype: int64

By above two results, we can see that each stock code has at most one description and price value as I initially thought. So the stock code has nothing to deal with.

df_item.drop(["description_count", "unit_price_count"], axis = 1, inplace = True)

len(df_item.StockCode.unique())

There is a total of 4070 unique items. (Q1 - 5)

Unit price

df_item.loc[df_item.UnitPrice.sort_values().index]

	StockCode	Description	UnitPrice
3260	21614	None	0.00
3283	21274	None	0.00
3282	84845D	None	0.00
3280	84876C	None	0.00
3279	84875A	None	0.00
...	...	...	...
190	22827	RUSTIC SEVENTEEN DRAWER SIDEBOARD	165.00
2541	22826	LOVE SEAT ANTIQUE WHITE METAL	175.00
1591	22655	VINTAGE RED KITCHEN CABINET	295.00
952	DOT	DOTCOM POSTAGE	569.77
3753	B	Adjust bad debt	11062.06

4070 rows × 3 columns

df_item[df_item.UnitPrice == 0].shape

(215, 3)

We can see that the minimum value of the unit price is 0 and the maximum value of the unit price is 11062.06. There is a total of 215 stock codes that have 0 unit prices.

df_item[df_item.UnitPrice == 0].Description.unique()

array([None, '?', 'check', 'FRENCH BLUE METAL DOOR SIGN 4',
       'CHILDS GARDEN TROWEL BLUE ', 'CHILDS GARDEN RAKE BLUE',
       'LUNCH BOX WITH CUTLERY FAIRY CAKES ',
       'TEATIME FUNKY FLOWER BACKPACK FOR 2', 'damages', 'Given away',
       'thrown away', 'throw away', "thrown away-can't sell.",
       "thrown away-can't sell", 'mailout ', 'mailout',
       'Thrown away-rusty', 'wet damaged', 'Damaged',
       'TRAVEL CARD WALLET DOTCOMGIFTSHOP', 'ebay', 'found', 'adjustment',
       'Found by jackie', 'Unsaleable, destroyed.'], dtype=object)

The list above is a collection of unique descriptions of stock code with a zero unit price. There are cases where the description is none, but I think it is difficult to interpret the case where the unit price is 0 as a missing value. So the unit price has nothing to deal with.

Description

df_item[df_item.Description.isnull()]

	StockCode	Description	UnitPrice
1041	21134	None	0.0
1042	22145	None	0.0
1043	37509	None	0.0
1047	85226A	None	0.0
1048	85044	None	0.0
...	...	...	...
3723	37477C	None	0.0
3839	35592T	None	0.0
3860	35598A	None	0.0
3958	23702	None	0.0
4050	84971L	None	0.0

176 rows × 3 columns

There are 176 missing values in the description column. (Q1 - 5)

Purchase data

df_purchase.head()

	InvoiceNo	StockCodeSC	Quantity
0	536365	SC85123A	6
1	536365	SC71053	6
2	536365	SC84406B	8
3	536365	SC84029G	6
4	536365	SC84029E	6

df_purchase.shape

(541909, 3)

Original purchase data has 541909 rows and 3 columns.

Invoice number

np.sum(df_purchase.InvoiceNo.isnull())

There is no missing value in the invoice number. (Q1 - 5)

Since we are not interested in the invoice numbers that were canceled, let’s delete these rows.

df_purchase_shipped = df_purchase[df_purchase.InvoiceNo.str.startswith("C") == False]
df_purchase_shipped.head()

	InvoiceNo	StockCodeSC	Quantity
0	536365	SC85123A	6
1	536365	SC71053	6
2	536365	SC84406B	8
3	536365	SC84029G	6
4	536365	SC84029E	6

Stock code

np.sum(df_purchase_shipped.StockCodeSC.isnull())

There is no missing value in the stock code. (Q1 - 5)

In the previous item data, stock code was used as the column name “StockCode”. However, since “StockCodeSC” is used in the purchase data, let’s unify it as “Stockcode”.

np.sum(df_purchase_shipped.StockCodeSC.str.startswith("SC") == False)

All “StockcodeSC” values start with “SC”, so I have to delete “SC” in front.

df_purchase_shipped["StockCode"] = df_purchase_shipped.StockCodeSC.str.split("SC").str[1]

/var/folders/kb/9bgdwxjn0b751yc9w59v1ll00000gn/T/ipykernel_53930/3471317207.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_purchase_shipped["StockCode"] = df_purchase_shipped.StockCodeSC.str.split("SC").str[1]

df_purchase_shipped.drop("StockCodeSC", inplace = True, axis = 1)

/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(

df_purchase_shipped.head()

	InvoiceNo	Quantity	StockCode
0	536365	6	85123A
1	536365	6	71053
2	536365	8	84406B
3	536365	6	84029G
4	536365	6	84029E

Quantity

np.sum(df_purchase_shipped.Quantity.isnull())

There is no missing value in the quantity column. (Q1 - 5)

df_purchase_shipped.Quantity.sort_values()

225530    -9600
225529    -9600
225528    -9058
115818    -5368
431381    -4830
          ...  
421632     4800
74614      5568
502122    12540
61619     74215
540421    80995
Name: Quantity, Length: 532621, dtype: int64

There are some minus values in the quantity column. Let’s check these values.

df_purchase_shipped[df_purchase_shipped.Quantity <= 0]

	InvoiceNo	Quantity	StockCode
2406	536589	-10	21777
4347	536764	-38	84952C
7188	536996	-20	22712
7189	536997	-20	22028
7190	536998	-6	85067
...	...	...	...
535333	581210	-26	23395
535335	581212	-1050	22578
535336	581213	-30	22576
536908	581226	-338	23090
538919	581422	-235	23169

1336 rows × 3 columns

For now, I don’t know for sure what the negative quantity means, so I’m going to leave it without deleting it for now. Later, when I need to use the quantity column, I will handle these negative values appropriately.

Merge

First, I extracted only shipped records from the invoice data and will merge the other two tables based on this invoice data. When the purchase data is combined based on the invoice number of the invoice data, the information about which items and how many were traded in each transaction is combined. By combining the item data based on the stock code of the combined data, I can get the description and unit price information of each item traded. Because data can be combined only with the invoice number and stock code, first create data that has all information even if there are missing values in other columns. When performing subsequent analysis, the missing values will be appropriately handled according to the situation. (Q1 - 6)

df_merged = df_invoice_shipped.merge(df_purchase_shipped, on = "InvoiceNo", how = "left").merge(df_item, on = "StockCode", how = "left")
df_merged

	InvoiceNo	InvoiceDate	CustomerID	Country	Quantity	StockCode	Description	UnitPrice
0	536365	2010-12-01 08:26:00	17850.0	United Kingdom	6	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55
1	536365	2010-12-01 08:26:00	17850.0	United Kingdom	6	71053	WHITE METAL LANTERN	3.39
2	536365	2010-12-01 08:26:00	17850.0	United Kingdom	8	84406B	CREAM CUPID HEARTS COAT HANGER	2.75
3	536365	2010-12-01 08:26:00	17850.0	United Kingdom	6	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39
4	536365	2010-12-01 08:26:00	17850.0	United Kingdom	6	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39
...	...	...	...	...	...	...	...	...
532616	581587	2011-12-09 12:50:00	12680.0	France	12	22613	PACK OF 20 SPACEBOY NAPKINS	1.66
532617	581587	2011-12-09 12:50:00	12680.0	France	6	22899	CHILDREN'S APRON DOLLY GIRL	2.10
532618	581587	2011-12-09 12:50:00	12680.0	France	4	23254	CHILDRENS CUTLERY DOLLY GIRL	4.15
532619	581587	2011-12-09 12:50:00	12680.0	France	4	23255	CHILDRENS CUTLERY CIRCUS PARADE	4.15
532620	581587	2011-12-09 12:50:00	12680.0	France	3	22138	BAKING SET 9 PIECE RETROSPOT	4.95

532621 rows × 8 columns

df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 532621 entries, 0 to 532620
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    532621 non-null  object        
 1   InvoiceDate  532621 non-null  datetime64[ns]
 2   CustomerID   397924 non-null  float64       
 3   Country      532621 non-null  object        
 4   Quantity     532621 non-null  int64         
 5   StockCode    532621 non-null  object        
 6   Description  531204 non-null  object        
 7   UnitPrice    532621 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 36.6+ MB

Since customer ID means a separate entity rather than a mathematical number, the object type is more appropriate than float. So I will change the customer id to object type.

df_merged = df_merged.astype({"CustomerID" : object})
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 532621 entries, 0 to 532620
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    532621 non-null  object        
 1   InvoiceDate  532621 non-null  datetime64[ns]
 2   CustomerID   397924 non-null  object        
 3   Country      532621 non-null  object        
 4   Quantity     532621 non-null  int64         
 5   StockCode    532621 non-null  object        
 6   Description  531204 non-null  object        
 7   UnitPrice    532621 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 36.6+ MB

df_merged = df_merged[["InvoiceNo", "InvoiceDate", "CustomerID", "Country", "StockCode", "Description", "UnitPrice", "Quantity"]]
df_merged

	InvoiceNo	InvoiceDate	CustomerID	Country	StockCode	Description	UnitPrice	Quantity
0	536365	2010-12-01 08:26:00	17850.0	United Kingdom	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55	6
1	536365	2010-12-01 08:26:00	17850.0	United Kingdom	71053	WHITE METAL LANTERN	3.39	6
2	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84406B	CREAM CUPID HEARTS COAT HANGER	2.75	8
3	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39	6
4	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39	6
...	...	...	...	...	...	...	...	...
532616	581587	2011-12-09 12:50:00	12680.0	France	22613	PACK OF 20 SPACEBOY NAPKINS	1.66	12
532617	581587	2011-12-09 12:50:00	12680.0	France	22899	CHILDREN'S APRON DOLLY GIRL	2.10	6
532618	581587	2011-12-09 12:50:00	12680.0	France	23254	CHILDRENS CUTLERY DOLLY GIRL	4.15	4
532619	581587	2011-12-09 12:50:00	12680.0	France	23255	CHILDRENS CUTLERY CIRCUS PARADE	4.15	4
532620	581587	2011-12-09 12:50:00	12680.0	France	22138	BAKING SET 9 PIECE RETROSPOT	4.95	3

532621 rows × 8 columns

Q2. [10 points] Invoice Analysis

For each customer calculate how many total invoices they have placed. List the top 10 customers who have placed an invoice in descending order.
Perform a similar calculation but instead of the number of invoices, calculate the total quantity of items ordered for each customer. List the top 10 customers in descending order.
Compare the top 10 customers, does it appear that the more invoices a customer have, the greater the total quantity of items? Explain your reasoning.

Hint: For 2.2, you may need to join two datasets together to answer the question.

(Q2 - 1)

pd.DataFrame(df_merged.groupby("CustomerID").InvoiceNo.nunique().sort_values(ascending = False).head(10).reset_index().rename({"InvoiceNo" : "total_invoice_number"}, axis = 1)) \
    .merge(df_merged.groupby("CustomerID").InvoiceDate.min().reset_index().rename({"InvoiceDate" : "min_invoice_date"}, axis = 1), on = "CustomerID", how = "left") \
    .merge(df_merged.groupby("CustomerID").InvoiceDate.max().reset_index().rename({"InvoiceDate" : "max_invoice_date"}, axis = 1), on = "CustomerID", how = "left")

	CustomerID	total_invoice_number	min_invoice_date	max_invoice_date
0	12748.0	210	2010-12-01 12:48:00	2011-12-09 12:20:00
1	14911.0	201	2010-12-01 14:05:00	2011-12-08 15:54:00
2	17841.0	124	2010-12-01 14:41:00	2011-12-08 12:07:00
3	13089.0	97	2010-12-05 10:27:00	2011-12-07 09:02:00
4	14606.0	93	2010-12-01 16:57:00	2011-12-08 19:28:00
5	15311.0	91	2010-12-01 09:41:00	2011-12-09 12:00:00
6	12971.0	86	2010-12-02 16:42:00	2011-12-06 12:20:00
7	14646.0	74	2010-12-20 10:09:00	2011-12-08 12:12:00
8	16029.0	63	2010-12-01 09:57:00	2011-11-01 10:27:00
9	13408.0	62	2010-12-01 10:39:00	2011-12-08 09:05:00

The above customers are the top 10 customers who have placed the invoice in descending order. The largest number of invoices is 210, which is more than three times the number of 62 invoices in the 10th place. All of the top 10 users have placed orders for the same period of about 1 year.

(Q2 - 2)

pd.DataFrame(df_merged.groupby("CustomerID").Quantity.sum().sort_values(ascending = False).reset_index().rename({"Quantity" : "total_quantity"}, axis = 1)).head(10) \
    .merge(df_merged.groupby("CustomerID").InvoiceDate.min().reset_index().rename({"InvoiceDate" : "min_invoice_date"}, axis = 1), on = "CustomerID", how = "left") \
    .merge(df_merged.groupby("CustomerID").InvoiceDate.max().reset_index().rename({"InvoiceDate" : "max_invoice_date"}, axis = 1), on = "CustomerID", how = "left")

	CustomerID	total_quantity	min_invoice_date	max_invoice_date
0	14646.0	197491	2010-12-20 10:09:00	2011-12-08 12:12:00
1	16446.0	80997	2011-05-18 09:52:00	2011-12-09 09:15:00
2	14911.0	80515	2010-12-01 14:05:00	2011-12-08 15:54:00
3	12415.0	77670	2011-01-06 11:12:00	2011-11-15 14:22:00
4	12346.0	74215	2011-01-18 10:01:00	2011-01-18 10:01:00
5	17450.0	69993	2010-12-07 09:23:00	2011-12-01 13:29:00
6	17511.0	64549	2010-12-01 10:19:00	2011-12-07 10:12:00
7	18102.0	64124	2010-12-07 16:42:00	2011-12-09 11:50:00
8	13694.0	63312	2010-12-01 12:12:00	2011-12-06 09:32:00
9	14298.0	58343	2010-12-14 12:59:00	2011-12-01 13:12:00

The table above shows the top 10 users with the highest total quantity. The largest total quantity is about 200,000, which is more than three times that of about 60,000 in the 10th place. When looking at the period of ordering as well, customer 12346 ordered 70,000 quantities a day. That is, customer 14646 has the largest total quantity without considering the period, and 12346 has the largest total quantity considering the period.

(Q2 - 3)

Let’s consider the number of invoices and the total quantity together.

df_total_invoice_quantity = pd.DataFrame(df_merged.groupby("CustomerID").InvoiceNo.nunique().reset_index().rename({"InvoiceNo" : "total_invoice_number"}, axis = 1)) \
                                .merge(df_merged.groupby("CustomerID").Quantity.sum().reset_index().rename({"Quantity" : "total_quantity"}, axis = 1))

df_total_invoice_quantity.sort_values("total_invoice_number", ascending = False).head(10)

	CustomerID	total_invoice_number	total_quantity
326	12748.0	210	25748
1880	14911.0	201	80515
4011	17841.0	124	23071
562	13089.0	97	31070
1662	14606.0	93	6224
2177	15311.0	91	38194
481	12971.0	86	9289
1690	14646.0	74	197491
2703	16029.0	63	40208
796	13408.0	62	16232

df_total_invoice_quantity[["total_invoice_number","total_quantity"]].corr()

	total_invoice_number	total_quantity
total_invoice_number	1.000000	0.558005
total_quantity	0.558005	1.000000

First, if you check the correlation between the total invoice number and total quantity for all customers, it is about 0.5. The correlation coefficient is a statistic that indicates how linear two variables are and has a value between -1 and 1. A positive value indicates a positive linear relationship, and a negative value indicates a negative linear relationship. The closer the absolute value is to 1, the stronger the linear relationship is. Since the correlation is about 0.5, it can be seen that the total invoice number and total quantity show a positive linear relationship to some extent.

df_total_invoice_quantity[["total_invoice_number","total_quantity"]].sort_values("total_invoice_number", ascending = False).head(10).corr()

	total_invoice_number	total_quantity
total_invoice_number	1.000000	-0.038673
total_quantity	-0.038673	1.000000

However, when only the top 10 people based on the invoice number were analyzed, it can be seen that the correlation between the total invoice number and the total quantity is rather low at -0.03.

sns.scatterplot(x = "total_invoice_number", y = "total_quantity", data = df_total_invoice_quantity[["total_invoice_number","total_quantity"]].sort_values("total_invoice_number", ascending = False).head(10),)

<AxesSubplot:xlabel='total_invoice_number', ylabel='total_quantity'>

png

Even if you look at the total invoice number and total quantity of the top 10 people as a plot, you can see that there is almost no linear relationship between the two variables.

Q3. [10 points] Item Analysis

What is the average item-unit price?
What % of items are under $25?
Generate a histogram of the unit prices. Select reasonable min/max values for the x-axis. Why did you pick those values? What do you notice about the histogram?

### (Q3 - 1) & (Q3 - 3)

df_item.UnitPrice.mean()

6.905277886977952

In simple calculations, the average unit price of an item is about 7 sterling.

df_item[df_item.UnitPrice == 0].shape

(215, 3)

df_item_positive_price = df_item[df_item.UnitPrice > 0]
df_item_positive_price.UnitPrice.mean()

7.290397146562974

However, about 5% of the items in the item data have a unit price of 0. So, if you exclude items with a unit price of 0, the average unit price is 7.2 sterling.

df_item_positive_price.describe()

	UnitPrice
count	3855.000000
mean	7.290397
std	178.548626
min	0.001000
25%	1.250000
50%	2.510000
75%	4.950000
max	11062.060000

Also, since the average is greatly affected by outliers, it is necessary to check whether there are outliers when calculating the average. Looking at the table above, about 75% of items are lower than 5 sterling, but the average price we found above was 7.2 sterling. This is because there is an extreme value in the price, such as the max value of 11062.

sns.distplot(np.log(df_item_positive_price.UnitPrice))

/opt/anaconda3/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='UnitPrice', ylabel='Density'>

png

If you log-transform the unit price, you can confirm that it roughly follows a normal distribution. Therefore, let’s reduce the influence of outliers by using only about 95% of the data within 2 standard deviations from the average in the log unit price.

lower_bound = np.log(df_item_positive_price.UnitPrice).describe()[1] - (2 * np.log(df_item_positive_price.UnitPrice).describe()[2])
lower_bound

-1.1411714381940192

upper_bound = np.log(df_item_positive_price.UnitPrice).describe()[1] + (2 * np.log(df_item_positive_price.UnitPrice).describe()[2])
upper_bound

2.9566302011417886

df_item_positive_price_except_outlier = df_item_positive_price[(np.log(df_item_positive_price.UnitPrice) > lower_bound) & (np.log(df_item_positive_price.UnitPrice) < upper_bound)]
df_item_positive_price_except_outlier.describe()

	UnitPrice
count	3716.000000
mean	3.580312
std	3.244735
min	0.320000
25%	1.250000
50%	2.510000
75%	4.250000
max	19.130000

sns.distplot(df_item_positive_price_except_outlier.UnitPrice)

/opt/anaconda3/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='UnitPrice', ylabel='Density'>

png

When looking at only 95% of data excluding outliers, the minimum value of unit price is 0.3 sterling and the maximum value is 19 sterling, which is somewhat reasonable. The average item-unit price is 3.58 sterling with this reduced data.

(Q3 - 2)

To find items that are under 25$, we have to convert our sterling unit price to dollar unit price. I will apply the exchange rate between sterling and dollar as 1:1.3.

df_item["unit_price_dollar"] = df_item.UnitPrice * 1.13
df_item

	StockCode	Description	UnitPrice	unit_price_dollar
0	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55	2.8815
1	71053	WHITE METAL LANTERN	3.39	3.8307
2	84406B	CREAM CUPID HEARTS COAT HANGER	2.75	3.1075
3	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39	3.8307
4	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39	3.8307
...	...	...	...	...
4065	85179a	GREEN BITTY LIGHT CHAIN	2.46	2.7798
4066	23617	SET 10 CARDS SWIRLY XMAS TREE 17104	2.91	3.2883
4067	90214U	LETTER "U" BLING KEY RING	0.29	0.3277
4068	47591b	SCOTTIES CHILDRENS APRON	4.13	4.6669
4069	23843	PAPER CRAFT , LITTLE BIRDIE	2.08	2.3504

4070 rows × 4 columns

np.sum(df_item.unit_price_dollar < 25) / len(df_item.unit_price_dollar < 25)

0.9867321867321868

Including zero-price items, about 98.6% of items are under $25.

np.sum(df_item.loc[df_item_positive_price.index].unit_price_dollar < 25) / len(df_item.loc[df_item_positive_price.index].unit_price_dollar < 25)

0.9859922178988327

Excluding zero-price items, about 93.3% of items are under $25.

np.sum(df_item.loc[df_item_positive_price_except_outlier.index].unit_price_dollar < 25) / len(df_item.loc[df_item_positive_price_except_outlier.index].unit_price_dollar < 25)

1.0

Excluding all outliers as above, all items are under $25.

Q4. [25 points] Order Trends

What are the top 10 most ordered items? Describe them. Do you see any trends?
What are the top 5 invoices that generated the most revenue? (Revenue is calculated by “marking up” the unit price by 25%.)
Do the top 5 invoices contain any of the top 10 most ordered items?

Hint: When calculating the revenue we suggest adding a new column on the dataframe.

(Q4 - 1)

There can be several different ways to measure the most ordered items. If it is measured based on quantity, rank can be affected in cases where a particular item is traded a lot in one transaction. So I will measure most ordered items by invoice count. Also, since we only need to measure when items are clearly ordered, we will exclude cases where quantity is negative.

df_merged_positive_quantity = df_merged[df_merged.Quantity > 0]

most_ordered_items = df_merged_positive_quantity.groupby("StockCode").InvoiceNo.nunique().sort_values(ascending = False).head(10).reset_index().rename({"InvoiceNo" : "invoice_count"}, axis = 1) \
                     .merge(df_item[["StockCode", "Description"]], on = "StockCode", how = "left")
most_ordered_items

	StockCode	invoice_count	Description
0	85123A	2203	WHITE HANGING HEART T-LIGHT HOLDER
1	85099B	2092	JUMBO BAG RED RETROSPOT
2	22423	1989	REGENCY CAKESTAND 3 TIER
3	47566	1686	PARTY BUNTING
4	20725	1565	LUNCH BAG RED RETROSPOT
5	84879	1455	ASSORTED COLOUR BIRD ORNAMENT
6	22197	1392	SMALL POPCORN HOLDER
7	22720	1387	SET OF 3 CAKE TINS PANTRY DESIGN
8	21212	1320	PACK OF 72 RETROSPOT CAKE CASES
9	22383	1285	LUNCH BAG SUKI DESIGN

The above table shows the top 10 most ordered items. Overall, party-related items seem to occupy the upper ranks. The 3rd, 8th, and 9th places were cake-related items, and there were a total of 4700 invoices. The 1st, 4th, 6th, and 7th places were party-related decorations, and there were a total of 6700 invoices. The 2nd, 5th, and 10th places were bag types, and there were a total of 4900 invoices.

(Q4 - 2)

Revenue is calculated by “marking up” the unit price by 25%. So, let’s make a new revenue column calculated by unit price * 1.25 * quantity.

df_merged_positive_quantity["revenue"] = df_merged_positive_quantity.UnitPrice * 1.25 * df_merged_positive_quantity.Quantity
df_merged_positive_quantity

/var/folders/kb/9bgdwxjn0b751yc9w59v1ll00000gn/T/ipykernel_53930/1210467572.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_merged_positive_quantity["revenue"] = df_merged_positive_quantity.UnitPrice * 1.25 * df_merged_positive_quantity.Quantity

	InvoiceNo	InvoiceDate	CustomerID	Country	StockCode	Description	UnitPrice	Quantity	revenue
0	536365	2010-12-01 08:26:00	17850.0	United Kingdom	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55	6	19.1250
1	536365	2010-12-01 08:26:00	17850.0	United Kingdom	71053	WHITE METAL LANTERN	3.39	6	25.4250
2	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84406B	CREAM CUPID HEARTS COAT HANGER	2.75	8	27.5000
3	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39	6	25.4250
4	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39	6	25.4250
...	...	...	...	...	...	...	...	...	...
532616	581587	2011-12-09 12:50:00	12680.0	France	22613	PACK OF 20 SPACEBOY NAPKINS	1.66	12	24.9000
532617	581587	2011-12-09 12:50:00	12680.0	France	22899	CHILDREN'S APRON DOLLY GIRL	2.10	6	15.7500
532618	581587	2011-12-09 12:50:00	12680.0	France	23254	CHILDRENS CUTLERY DOLLY GIRL	4.15	4	20.7500
532619	581587	2011-12-09 12:50:00	12680.0	France	23255	CHILDRENS CUTLERY CIRCUS PARADE	4.15	4	20.7500
532620	581587	2011-12-09 12:50:00	12680.0	France	22138	BAKING SET 9 PIECE RETROSPOT	4.95	3	18.5625

531285 rows × 9 columns

invoice_top_revenue = np.round(df_merged_positive_quantity.groupby("InvoiceNo").sum()[["revenue", "Quantity"]].sort_values("revenue", ascending = False).head(5)) \
                          .merge(df_merged_positive_quantity.groupby("InvoiceNo").nunique().StockCode, on = "InvoiceNo", how = "left") \
                          .merge(df_invoice_shipped[["InvoiceNo", "Country", "CustomerID", "InvoiceDate"]], on = "InvoiceNo", how = "left") \
                          .rename({"revenue" : "total_revenue", "Quantity" : "total_quantity", "StockCode" : "unique_stockcode_count"}, axis = 1)
invoice_top_revenue

	InvoiceNo	total_revenue	total_quantity	unique_stockcode_count	Country	CustomerID	InvoiceDate
0	547966	712212.0	1000	1	United Kingdom	NaN	2011-03-28 15:49:00
1	581483	210587.0	80995	1	United Kingdom	16446.0	2011-12-09 09:15:00
2	541431	96480.0	74215	1	United Kingdom	12346.0	2011-01-18 10:01:00
3	556255	37305.0	3600	12	United Kingdom	18102.0	2011-06-09 17:27:00
4	556917	34546.0	15049	138	Australia	12415.0	2011-06-15 13:37:00

The above table shows the top 5 invoices that generated the most revenue in sterling. 1st invoice generated revenue of 710,000 sterling, which is three times higher than the second invoice’s revenue of 210,000 sterling. The trades that yielded the most revenue were only 1000 quantities of only one type. Compared to other invoices that traded 3,000, 10,000, 70,000, and 80,000 quantities, it can be seen that the largest profit was made with a very small quantity. Of the total of 5 invoices, only 1 was ordered from Australia and the remaining 4 were all ordered from the United Kingdom. And all 4 invoices were made in the first half of 2011, and only one invoice was made at the end of 2011.

(Q4 - 3)

most_ordered_items["is_most_ordered"] = 1

invoice_top_revenue_with_item = df_merged_positive_quantity[["InvoiceNo","StockCode"]].merge(invoice_top_revenue["InvoiceNo"], on = "InvoiceNo", how = "inner") \
                                    .merge(most_ordered_items[["StockCode", "Description", "is_most_ordered"]], on = "StockCode", how = "left").fillna(0)
invoice_top_revenue_with_item

	InvoiceNo	StockCode	Description	is_most_ordered
0	541431	23166	0	0.0
1	547966	DOT	0	0.0
2	556255	48138	0	0.0
3	556255	48173C	0	0.0
4	556255	48188	0	0.0
...	...	...	...	...
148	556917	22364	0	0.0
149	556917	22363	0	0.0
150	556917	21115	0	0.0
151	556917	22308	0	0.0
152	581483	23843	0	0.0

153 rows × 4 columns

invoice_top_revenue_with_item[invoice_top_revenue_with_item.is_most_ordered == 1] 

	InvoiceNo	StockCode	Description	is_most_ordered
61	556917	20725	LUNCH BAG RED RETROSPOT	1.0
89	556917	85099B	JUMBO BAG RED RETROSPOT	1.0
100	556917	22423	REGENCY CAKESTAND 3 TIER	1.0
107	556917	21212	PACK OF 72 RETROSPOT CAKE CASES	1.0
110	556917	22720	SET OF 3 CAKE TINS PANTRY DESIGN	1.0

is_most_ordered column has a value of 1 if each item in each invoice is the top 10 most ordered items. Looking at the table above, only the “556917” invoice of the top 5 invoices contains 5 items from the top 10 most ordered items. This invoice contains all cake-related items and two bag-related items of the top 10 most ordered items.

Q5. [30 points] Customer Analysis

Classify customers into segments based on the total revenue they have generated for the company.
- low value: less than $1500
- medium value: between 1500 and 8000 dollars
- high value: greater than $8000
How many customers are in each segment?
Using the pivot table function, create a table that displays the average order quantity of each stock code for a given segment.
Are the items with the highest average order quantity generally the same across segments? Explain your reasoning.
Choose three items and discuss any trends/differences you notice across the three segments.

Hint: When calculating the segment, we suggest constructing a new dataframe as an intermediary step with the columns: CustomerID, Revenue, Segment.

Since the customer segments are divided based on revenue calculated in dollars, let’s convert revenue into dollars.

df_merged_positive_quantity["revenue_dollar"] = df_merged_positive_quantity.revenue * 1.3
df_merged_positive_quantity.head()

/var/folders/kb/9bgdwxjn0b751yc9w59v1ll00000gn/T/ipykernel_53930/3486501612.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_merged_positive_quantity["revenue_dollar"] = df_merged_positive_quantity.revenue * 1.3

	InvoiceNo	InvoiceDate	CustomerID	Country	StockCode	Description	UnitPrice	Quantity	revenue	revenue_dollar
0	536365	2010-12-01 08:26:00	17850.0	United Kingdom	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55	6	19.125	24.8625
1	536365	2010-12-01 08:26:00	17850.0	United Kingdom	71053	WHITE METAL LANTERN	3.39	6	25.425	33.0525
2	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84406B	CREAM CUPID HEARTS COAT HANGER	2.75	8	27.500	35.7500
3	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39	6	25.425	33.0525
4	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39	6	25.425	33.0525

customer_revenue = df_merged_positive_quantity.groupby("CustomerID").sum().revenue_dollar.reset_index().rename({"revenue_dollar" : "total_revenue_dollar"}, axis = 1)
customer_revenue

	CustomerID	total_revenue_dollar
0	12346.0	125423.35000
1	12347.0	8837.46500
2	12348.0	2909.01000
3	12349.0	3131.31000
4	12350.0	663.06500
...	...	...
4334	18280.0	412.19750
4335	18281.0	109.65500
4336	18282.0	377.60125
4337	18283.0	3451.56500
4338	18287.0	4446.48750

4339 rows × 2 columns

low_segment_mask = customer_revenue.total_revenue_dollar < 1500
medium_segment_mask = (customer_revenue.total_revenue_dollar >= 1500) & (customer_revenue.total_revenue_dollar < 8000)
high_segment_mask = customer_revenue.total_revenue_dollar >= 8000

customer_revenue.loc[low_segment_mask, "revenue_segment"] = "low"
customer_revenue.loc[medium_segment_mask, "revenue_segment"] = "medium"
customer_revenue.loc[high_segment_mask, "revenue_segment"] = "high"

customer_revenue

	CustomerID	total_revenue_dollar	revenue_segment
0	12346.0	125423.35000	high
1	12347.0	8837.46500	high
2	12348.0	2909.01000	medium
3	12349.0	3131.31000	medium
4	12350.0	663.06500	low
...	...	...	...
4334	18280.0	412.19750	low
4335	18281.0	109.65500	low
4336	18282.0	377.60125	low
4337	18283.0	3451.56500	medium
4338	18287.0	4446.48750	medium

4339 rows × 3 columns

Based on the criteria above, 3 segments(high, medium, and low) were created.

(Q5 - 1)

customer_revenue.groupby("revenue_segment").count().CustomerID.reset_index() \
    .merge(round(customer_revenue.groupby("revenue_segment").count().CustomerID / np.sum(customer_revenue.groupby("revenue_segment").count().CustomerID) * 100, 1), on = "revenue_segment", how = "left") \
    .rename({"CustomerID_x" : "total_customer_count", "CustomerID_y" : "total_customer_count_ratio"}, axis = 1) \
    .loc[[0, 2, 1]]

	revenue_segment	total_customer_count	total_customer_count_ratio
0	high	356	8.2
2	medium	1656	38.2
1	low	2327	53.6

There are 2327 customers in the low segment which is about 54% of the total customers. About 38% of total customers, or 1656, belong to the medium group, and about 8%, or 356 customers, belong to the high group.

(Q5 - 2)

df_merged_positive_quantity = df_merged_positive_quantity.merge(customer_revenue, on = "CustomerID", how = "left")
df_merged_positive_quantity

	InvoiceNo	InvoiceDate	CustomerID	Country	StockCode	Description	UnitPrice	Quantity	revenue	revenue_dollar	total_revenue_dollar	revenue_segment
0	536365	2010-12-01 08:26:00	17850.0	United Kingdom	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55	6	19.1250	24.86250	8821.16625	high
1	536365	2010-12-01 08:26:00	17850.0	United Kingdom	71053	WHITE METAL LANTERN	3.39	6	25.4250	33.05250	8821.16625	high
2	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84406B	CREAM CUPID HEARTS COAT HANGER	2.75	8	27.5000	35.75000	8821.16625	high
3	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39	6	25.4250	33.05250	8821.16625	high
4	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39	6	25.4250	33.05250	8821.16625	high
...	...	...	...	...	...	...	...	...	...	...	...	...
531280	581587	2011-12-09 12:50:00	12680.0	France	22613	PACK OF 20 SPACEBOY NAPKINS	1.66	12	24.9000	32.37000	1568.79125	medium
531281	581587	2011-12-09 12:50:00	12680.0	France	22899	CHILDREN'S APRON DOLLY GIRL	2.10	6	15.7500	20.47500	1568.79125	medium
531282	581587	2011-12-09 12:50:00	12680.0	France	23254	CHILDRENS CUTLERY DOLLY GIRL	4.15	4	20.7500	26.97500	1568.79125	medium
531283	581587	2011-12-09 12:50:00	12680.0	France	23255	CHILDRENS CUTLERY CIRCUS PARADE	4.15	4	20.7500	26.97500	1568.79125	medium
531284	581587	2011-12-09 12:50:00	12680.0	France	22138	BAKING SET 9 PIECE RETROSPOT	4.95	3	18.5625	24.13125	1568.79125	medium

531285 rows × 12 columns

First, the customer segments obtained above were merged into our merged table.

df_merged_positive_quantity.Quantity.describe()

count    531285.000000
mean         10.655262
std         156.830323
min           1.000000
25%           1.000000
50%           3.000000
75%          10.000000
max       80995.000000
Name: Quantity, dtype: float64

Looking at the distribution of quantity, it can be seen that there are outliers with a max value of 80995 while about 75% is less than 10. The average quantile we want to find is greatly affected by outliers, so we will calculate the average value excluding these outliers. The minimum value of a quantity is 1, which cannot be regarded as an outlier, so only the outliers toward the maximum value will be removed.

percentile_99 = np.percentile(df_merged_positive_quantity.Quantity, 99.5)
percentile_99

168.0

99.5th quantile quantity value is 168 and it is a reasonable value for quantity. Therefore, the analysis will be performed using only 99.5% of the data to reduce the influence of quantity outliers.

df_merged_positive_quantity_except_outlier = df_merged_positive_quantity[df_merged_positive_quantity.Quantity <= percentile_99]

segment_stockcode_average_quantity = df_merged_positive_quantity_except_outlier[["CustomerID", "revenue_segment", "InvoiceNo", "StockCode", "Quantity"]] \
                                    .pivot_table(index = "revenue_segment", columns = "StockCode", values = "Quantity", aggfunc = np.mean).fillna(0)
segment_stockcode_average_quantity

StockCode	10002	10080	10120	10123C	10124A	10124G	10125	10133	10135	11001	...	90214V	90214W	90214Y	90214Z	BANK CHARGES	C2	DOT	M	PADS	POST
revenue_segment
high	27.450000	12.5000	7.714286	0.0	0.0	0.0	36.857143	14.424242	18.342857	15.368421	...	1.0	0.0	1.0	0.0	1.0	1.0000	1.0	7.089888	1.0	3.518519
low	6.000000	24.0000	5.500000	2.0	3.5	4.5	12.750000	17.285714	14.571429	11.692308	...	0.0	0.0	48.0	0.0	1.0	1.0625	0.0	6.040000	1.0	1.839286
medium	10.347826	13.5625	6.461538	1.0	3.0	4.0	9.645161	22.071429	12.070175	13.677419	...	12.0	12.0	12.0	12.0	1.0	1.0000	0.0	6.184874	1.0	2.678119

3 rows × 3663 columns

The above table is the pivot table that displays the average order quantity of each stock code for a given segment. Since there are over 3000 stock codes, it is difficult to identify them at a glance. Let’s look at only 3 stock codes with a high average order quantity for each segment.

segment_stockcode_average_quantity_unstacked = segment_stockcode_average_quantity.unstack().reset_index() \
                                                   .rename({0 : "average_quantity"}, axis = 1)[["revenue_segment","StockCode","average_quantity"]] \
                                                   .sort_values(["revenue_segment", "average_quantity"], ascending = False)  
segment_stockcode_average_quantity_unstacked

	revenue_segment	StockCode	average_quantity
101	medium	16033	120.000000
107	medium	16045	100.000000
284	medium	17084R	79.200000
7202	medium	35001W	78.000000
224	medium	16259	74.714286
...	...	...	...
10941	high	90214O	0.000000
10953	high	90214T	0.000000
10956	high	90214U	0.000000
10962	high	90214W	0.000000
10968	high	90214Z	0.000000

10989 rows × 3 columns

segment_stockcode_average_quantity_unstacked["average_quantity_rank"] = segment_stockcode_average_quantity_unstacked.groupby("revenue_segment").cumcount() + 1
segment_stockcode_average_quantity_unstacked = segment_stockcode_average_quantity_unstacked.merge(df_item[["StockCode","Description"]], on = "StockCode", how = "left")
segment_stockcode_average_quantity_unstacked[segment_stockcode_average_quantity_unstacked.average_quantity_rank <= 3] \
    .set_index(["revenue_segment", "StockCode"]).loc[["high", "medium", "low"]]

		average_quantity	average_quantity_rank	Description
revenue_segment	StockCode
high	23440	144.000000	1	PAINT YOUR OWN EGGS IN CRATE
	85025B	144.000000	2	EAU DE NILE HEART SHAPE PHOTO FRAME
	75178	108.250000	3	ASSTD COL BUTTERFLY/CRYSTAL W/CHIME
medium	16033	120.000000	1	MINI HIGHLIGHTER PENS
	16045	100.000000	2	POPART WOODEN PENCILS ASST
	17084R	79.200000	3	ASSORTED INCENSE PACK
low	16045	100.000000	1	POPART WOODEN PENCILS ASST
	16259	99.666667	2	PIECE OF CAMO STATIONERY SET
	17084R	96.000000	3	ASSORTED INCENSE PACK

The above table shows 3 stock codes with the highest average order in each revenue segment.

(Q5 - 3)

It can be seen that the top 3 items of the high segment are all something related to ornaments, and the top 3 items of the medium and low segments are all types of writing instruments. And especially in the medium and low segment, it can be seen that the “16045” and “17084R” two items are commonly included in the top three items. Therefore, items with the highest average order quantity are generally the same across medium and low segments and the high segment is a little bit different from the other segments.

(Q5 - 4)

I’m trying to figure out what kind of difference or trend there are in the quantity for each segment by selecting the top 3 items that are frequently traded over several months.

df_merged_positive_quantity_except_outlier["invoice_year_month"] = pd.to_datetime(df_merged_positive_quantity_except_outlier['InvoiceDate']).dt.to_period('M')

/var/folders/kb/9bgdwxjn0b751yc9w59v1ll00000gn/T/ipykernel_53930/2851248984.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_merged_positive_quantity_except_outlier["invoice_year_month"] = pd.to_datetime(df_merged_positive_quantity_except_outlier['InvoiceDate']).dt.to_period('M')

df_merged_positive_quantity_except_outlier

	InvoiceNo	InvoiceDate	CustomerID	Country	StockCode	Description	UnitPrice	Quantity	revenue	revenue_dollar	total_revenue_dollar	revenue_segment	invoice_year_month
0	536365	2010-12-01 08:26:00	17850.0	United Kingdom	85123A	WHITE HANGING HEART T-LIGHT HOLDER	2.55	6	19.1250	24.86250	8821.16625	high	2010-12
1	536365	2010-12-01 08:26:00	17850.0	United Kingdom	71053	WHITE METAL LANTERN	3.39	6	25.4250	33.05250	8821.16625	high	2010-12
2	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84406B	CREAM CUPID HEARTS COAT HANGER	2.75	8	27.5000	35.75000	8821.16625	high	2010-12
3	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	3.39	6	25.4250	33.05250	8821.16625	high	2010-12
4	536365	2010-12-01 08:26:00	17850.0	United Kingdom	84029E	RED WOOLLY HOTTIE WHITE HEART.	3.39	6	25.4250	33.05250	8821.16625	high	2010-12
...	...	...	...	...	...	...	...	...	...	...	...	...	...
531280	581587	2011-12-09 12:50:00	12680.0	France	22613	PACK OF 20 SPACEBOY NAPKINS	1.66	12	24.9000	32.37000	1568.79125	medium	2011-12
531281	581587	2011-12-09 12:50:00	12680.0	France	22899	CHILDREN'S APRON DOLLY GIRL	2.10	6	15.7500	20.47500	1568.79125	medium	2011-12
531282	581587	2011-12-09 12:50:00	12680.0	France	23254	CHILDRENS CUTLERY DOLLY GIRL	4.15	4	20.7500	26.97500	1568.79125	medium	2011-12
531283	581587	2011-12-09 12:50:00	12680.0	France	23255	CHILDRENS CUTLERY CIRCUS PARADE	4.15	4	20.7500	26.97500	1568.79125	medium	2011-12
531284	581587	2011-12-09 12:50:00	12680.0	France	22138	BAKING SET 9 PIECE RETROSPOT	4.95	3	18.5625	24.13125	1568.79125	medium	2011-12

528632 rows × 13 columns

A year-month column was added to check the monthly order quantity.

monthly_segment_quantity = df_merged_positive_quantity_except_outlier.groupby(["StockCode","invoice_year_month", "revenue_segment"]).sum().Quantity

monthly_segment_quantity.reset_index().groupby("StockCode").invoice_year_month.nunique().sort_values(ascending = False)

StockCode
POST      13
21916     13
21914     13
21913     13
21912     13
          ..
90102      1
84614A     1
90104      1
21777      1
37491B     1
Name: invoice_year_month, Length: 3663, dtype: int64

df_item

Let’s analyze the “POST”, “21916”, “21914”, which have been trading steadily for 13 months.

monthly_segment_quantity_top3 = monthly_segment_quantity.loc[["POST","21916","21914"]]

POST : POSTAGE

plt.figure(figsize = (15,8))

low_med_high = monthly_segment_quantity_top3.loc["POST"].reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar1 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low_med_high, color = "darkblue")

low_med = monthly_segment_quantity_top3.loc["POST"].loc[:, ["medium", "low"]].reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar2 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low_med, color = "lightblue")

low = monthly_segment_quantity_top3.loc["POST"].loc[:, "low"].reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar3 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low, color = "blue")

top_bar = mpatches.Patch(color='darkblue', label='high')
mid_bar = mpatches.Patch(color='lightblue', label='medium')
low_bar = mpatches.Patch(color='blue', label='low')
plt.legend(handles=[top_bar, mid_bar, low_bar])

<matplotlib.legend.Legend at 0x7fa0309bf0d0>

png

Except for December 2011, POST shows a steady increase in order quantity. In the case of December 2011, since all data for December has not yet been collected, the quantity may appear low, so let’s check this.

df_invoice_shipped.InvoiceDate.max()

Timestamp('2011-12-09 12:50:00')

Since the last invoice date in our data is shown as December 9, 2011, all of the data for December 11 has not yet been collected. Therefore, from the next item, let’s proceed with the analysis excluding the data for Dec.

In POST, it can be seen that a lot of orders are placed in the medium and high groups, and few orders are placed in the low group. In addition, it can be seen that the order in the medium group is increasing especially as time goes by.

21916 : SET 12 RETRO WHITE CHALK STICKS

monthly_segment_quantity_top3.loc["21916"].drop("2011-12", axis = 0)

invoice_year_month  revenue_segment
2010-12             high                47
                    low                  5
                    medium              90
2011-01             high                48
                    low                 24
                                      ... 
2011-10             high               250
                    medium             112
2011-11             high                87
                    low                 28
                    medium              98
Name: Quantity, Length: 34, dtype: int64

plt.figure(figsize = (15,8))

low_med_high = monthly_segment_quantity_top3.loc["21916"].drop("2011-12", axis = 0).reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar1 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low_med_high, color = "darkblue")

low_med = monthly_segment_quantity_top3.loc["21916"].drop("2011-12", axis = 0).loc[:, ["medium", "low"]].reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar2 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low_med, color = "lightblue")

low = monthly_segment_quantity_top3.loc["21916"].drop("2011-12", axis = 0).loc[:, "low"].reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar3 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low, color = "blue")

top_bar = mpatches.Patch(color='darkblue', label='high')
mid_bar = mpatches.Patch(color='lightblue', label='medium')
low_bar = mpatches.Patch(color='blue', label='low')
plt.legend(handles=[top_bar, mid_bar, low_bar])

<matplotlib.legend.Legend at 0x7fa030c713a0>

png

Chalk sticks showed an increasing trend from December 2010 to March 2011 but decreased exceptionally only in April 2011, and then showed a steady transaction volume. Chalk sticks also order a lot from the high and medium groups, and the low group has fewer orders.

21914 : BLUE HARMONICA IN BOX

plt.figure(figsize = (15,8))

low_med_high = monthly_segment_quantity_top3.loc["21914"].drop("2011-12", axis = 0).reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar1 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low_med_high, color = "darkblue")

low_med = monthly_segment_quantity_top3.loc["21914"].drop("2011-12", axis = 0).loc[:, ["medium", "low"]].reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar2 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low_med, color = "lightblue")

low = monthly_segment_quantity_top3.loc["21914"].drop("2011-12", axis = 0).loc[:, "low"].reset_index().groupby("invoice_year_month")["Quantity"].sum().reset_index()
bar3 = sns.barplot(x = "invoice_year_month", y = "Quantity", data = low, color = "blue")

top_bar = mpatches.Patch(color='darkblue', label='high')
mid_bar = mpatches.Patch(color='lightblue', label='medium')
low_bar = mpatches.Patch(color='blue', label='low')
plt.legend(handles=[top_bar, mid_bar, low_bar])

<matplotlib.legend.Legend at 0x7fa027d6ed90>

png

Blue harmonica maintained its order quantity until August 2011 and then showed a sharp increase from September to November 2011. There was an increase in order quantity in all of the low, medium, and high groups.

Youngjun Woo