Understanding Customer Retention Rates with Cohort Analysis using Python and Seaborn

Important note: This was created as part of my own personal learning process for data science in python. I find it extremely helpful when i write this down to help me learn better and faster. This was part of the course on DataCamp and the code is based on the course and other online resources I used. Please visit DataCamp for the original syllabus. I am currently no affiliated with DataCamp in any form and only use their resources as a learner.

It is important for a business to understand if their products are selling well. It is arguably more important to understand if their customers enjoyed the product they bought from you enough to make a repeated purchase. One way to look at this is through customer retention rates, the average quantity purchased, and the average price by segmenting them into cohorts and analyze them.

What is a cohort and how to analyse them?

Mutually exclusive segments based on time of acquisition, similar behaviors, or by size. We can use cohort analysis to compare business and marketing metrics through the product and customer lifecycles of the business. One of the easiest way to construct a cohort analysis is using pivot tables.

Cohort Pivot Table

Above we have ‘CohortMonth’ column as our dataframe index. The pivot columns are constructed as the number of months between the first activity and the next activity occurring.

Let’s get into some code.

Data Set — Online Retail Data from UK-based online store

We will be using the UCI Online Retail Data Set with over 0.5 million transactions between Dec 2010 to Dec 2011.

Importing Libraries

# Import libraries
import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt

These are the libraries we will be using for our cohort analysis.

Let’s load our data set in.

online = pd.read_csv('data.csv', encoding = "ISO-8859-1")# Convert InvoiceDate from object to datetime format
online['InvoiceDate'] = pd.to_datetime(online['InvoiceDate'])

Now let’s group our customer acquisition cohorts based on the month they made their first purchase.

# --Group customers in acquisition cohorts based on the month they made their first purchase--
# Create function to truncate given date in column to a first day of the month
def get_month(x): return dt.datetime(x.year, x.month, 1)
# Apply function to invoice date to invoice month column
online['InvoiceMonth'] = online['InvoiceDate'].apply(get_month)
grouping1 = online.groupby('CustomerID')['InvoiceMonth']online['CohortMonth'] = grouping1.transform('min')print(online.describe())

From the terminal, we can see that description indicates only 406,829 rows have ‘CustomerID’ values. We will drop the NaN values.

Next we will have to create our cohort index columns. First, let’s calculate the date difference between the first transaction and the next activity. We will create a function and apply it to our dataframe.

def get_date_int(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day
invoice_year, invoice_month, _ = get_date_int(online, 'InvoiceMonth')
cohort_year, cohort_month, _ = get_date_int(online, 'CohortMonth')

Next let’s get the value of the year and month difference between the transactions.

years_diff = invoice_year - cohort_year
months_diff = invoice_month - cohort_month

We will now construct our cohort index based on the number of months since the first transaction.

online['CohortIndex'] = years_diff * 12 + months_diff + 1
online.head()

Our dataframe is looking closer to what we need in order to create our pivot tables for customer retention.

Onto the next step!

# Count monthly active customers from each cohort
grouping_count = online.groupby(['CohortMonth', 'CohortIndex'])
cohort_data = grouping_count['CustomerID'].apply(pd.Series.nunique)
cohort_data = cohort_data.reset_index()
cohort_counts = cohort_data.pivot(index='CohortMonth',
columns='CohortIndex',
values='CustomerID')
print(cohort_counts.head())

Here, we calculate the monthly active customers from each cohort that went on to make a subsequent purchase at a later month.

Great! This screen grab looks over close to what we are trying to achieve, now we need to get the customers retained each month as a percentage of the total size of the cohort by month.

# --Calculate Retention Rate--
cohort_sizes = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_sizes, axis=0)
retention.round(3) * 100
retention.index = retention.index.strftime('%m-%Y')

While Spyder’s interface provides us with some decent conditional formatting in each table cell, we can do better by visualizing the results in seaborn. But before that, let’s calculate our average quantity and price by cohort.

# --Calculate Average Quantity--
grouping_qty = online.groupby(['CohortMonth', 'CohortIndex'])
cohort_data_qty = grouping_qty['Quantity'].mean()
cohort_data_qty = cohort_data_qty.reset_index()
average_quantity = cohort_data_qty.pivot(index='CohortMonth',
columns='CohortIndex',
values='Quantity')
average_quantity.index = average_quantity.index.strftime('%m-%Y')
# --Calculate Average Price--
grouping_price = online.groupby(['CohortMonth', 'CohortIndex'])
cohort_data_price = grouping_price['UnitPrice'].mean()
cohort_data_price = cohort_data_price.reset_index()
average_price = cohort_data_price.pivot(index='CohortMonth',
columns='CohortIndex',
values='UnitPrice')
average_price.index = average_price.index.strftime('%m-%Y')

Time to do some visualization

Now, let’s make our analysis presentable to a wider audience. For that we will create a heatmap with seaborn.

# --Plot heatmap using seaborn--
# Plot retention rates
plt.figure(figsize=(10, 8))
plt.title('Retention rates')
sns.heatmap(data = retention, annot = True, fmt = '.0%',vmin = 0.0,vmax = 0.5,cmap = 'BuGn')
plt.show()
# Plot average quantity
plt.figure(figsize=(10, 8))
plt.title('Average Quantity')
sns.heatmap(data = average_quantity, annot=True, cmap='Blues')
plt.show()
# Plot average price
plt.figure(figsize=(10, 8))
plt.title('Average Price')
sns.heatmap(data = average_price, annot=True, cmap='Blues')
plt.show()

Customer Retention Rate

Note that the first column will always be 100% as it the total size of each cohort month. The distinct diagonal shape of the cohort heatmap shows that at the time of the data download (Dec 2011), only the first cohort (Dec 2010) would have had be able to make a purchase in their 13th month since their first transaction. This logic extends through to the rest of the cohorts. Notice in our Dec 2010 cohort, there is a spike in the our retention rate (50%). This could potentially indicate the seasonal shopping crowd.

Average Quantity Purchased per Cohort

Here, we can observe the average quantity of our cohorts. Interestingly, the customers that are retained in subsequent months purchase more products from the online store. That’s a good sign!

Average Price of Products Purchased by Cohort

While the average price of items purchased remains consistent or even becoming lower, there are some anomalies in our heatmap that are worth exploring. Could it be a particular marketing campaign launched during that period that sold higher priced products to our customers? If so, we should invest more in those campaigns!

Conclusion

Cohort analysis is a powerful way to understand your customer retention and purchasing behavior overtime. As seen, it doesn’t require any machine-learning based algorithms, just some simple pivot tables. While this was done it Python, the logic can easily be extended to Excel or Google Sheets to create similar heatmaps from your data. Also, if you’re using Google Analytics, it is built into the platform! Here is an article on how you can leverage the power of cohort analysis in Google Analytics. Next time, we will take a look at another customer segmentation model, RFM. Stay tuned!

Important note: This was created as part of my own personal learning process for data science in python. I find it extremely helpful when i write this down to help me learn better and faster. This was part of the course on DataCamp and the code is based on the course and other online resources I used. Please visit DataCamp for the original syllabus. I am currently no affiliated with DataCamp in any form and only use their resources as a learner.

Business Intelligence Analyst for Direct-to-Consumer products.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store