Customer Segmentation Movement in SQL
Customer segmentation problem in SQL using BigQuery public dataset
A few days ago I explored the BigQuery public dataset and I came to one of them. The problem that I tried to solve is customer segmentation movement. There are two sub-problems that I found quite interesting to solve.
There is a database called
new_york_citibike in the BigQuery public dataset. This database contains two tables
citibike_trips. The table that we will focus for this problem is the
citibike_trips. It has 16 fields — tripduration, starttime, stoptime, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bikeid, usertype, birth_year, gender, customer_plan which each row represents one trip of a particular customer from a station to its stop station. It looks like this:
If you want to follow along with me you can access this dataset in BigQuery, free of charge.
- To see the number of monthly users in each segment
- To see the number of users moving from one segment to another from a month to the next month for each month
Note: I used the terms person, customer, and user interchangeably here. Which refers to the same thing.
Get the trips with starttime from 2017–10–01 onwards. Then combine usertype, birth_year and gender into 1. Assume every unique combination represents one user (One broader type of user). Include user with missing usertype, birth_year or gender.
Here comes the query:
The next requirement is to classify users into segments based on their trips data that month:
- 0 distinct start_station_name (or you can use start_station_id too, since both mapped 1 to 1) as ‘inactive’
- 1–10 distinct start_station_name as ‘casual’
- > 10 distinct start_station_name as ‘power’
From here we might want to know how many each person has trips in a particular month right? Below is the query to do that:
Let’s take a moment to see and think about the requirements and the data we already have. Actually, this part is the tricky part. From the data, we can just do it directly map a user based on their
num_trips. But, we will lose info for the users that is inactive or users that has 0 trips in a particular month. Since our current query only gives data on a user that has info in a particular month.
So, here is the solution to that:
We need to first, before we mapped to their segment based on requirements, impute a user that does not have transaction history in that month with 0. For example, if user X has info in months 3 and 4 but not 1, then we need to impute month 1 with 0.
Here are the steps to do that:
1. We will need to get a table that has the column
year_month from 2017–10–01 onwards.
2018–05–31 is the biggest date of starttime in our citibike_trips table.
2. We will need to get a table that has the column of all our unique person
3. Then we need to get the complete combination (Cartesian product) of both year_month from
To do that we can use syntax cross join or we can just put both after keyword from by separating with comma:
4. Then we need to do left join from the query in step number 3 with our
base_query_2.sql and impute the null value with 0
Since now we already imputed 0 for a person that does not have a single trip info in a month we can safely apply the segmenting rule using the query below:
At this point we can answer the first goal already with this query (the complete query):
And the following is the answer to our first question / goal:
Now, it’s time to answer the second goal. Let me rewrite a bit about our second goal here:
We will compute the movements of users between segments for the next month. For example: from January 2018 to February 2018, how many casual users stayed as casual, became power, or became inactive? And we will do it for every month.
To do that is actually pretty simple. We will just need to tweak a little bit the query from our previous goal. Like so -
From the query we know that we need to exclude where next_month_segment is null, which is pretty obvious because there is no next month for our last month (May 2018).
And here is the result:
That is it, what an outstanding utilization of SQL! From the data, we can know that inactive users are most likely to stay and power users becoming inactive is so unlikely.
To get a clearer insight we can make a visualization of the data that we have now. But, let’s leave it for my next article’s topic lol :D.
Alright, that’s all. Should you have any questions or anything to discuss feel free to leave them in the comment section!
And lastly, let’s connect on LinkedIn! :D