This project was completed as a part of Udacity's Machine Learning Nanodegree.
Introduction
In this project, we will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.
The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel'
and 'Region'
will be excluded in the analysis — with focus instead on the six product categories recorded for customers.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualizations code visuals.py
! wget https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/customer_segments/visuals.py
import visuals as vs
import pickle
# Pretty display for notebooks
%matplotlib inline
# Load the wholesale customers dataset
try:
data = pd.read_csv("https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/customer_segments/customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print(f"Wholesale customers dataset has {data.shape[0]} samples with {data.shape[1]} features each.")
except:
print("Dataset could not be loaded. Is the dataset missing?")
In this section, we will begin exploring the data through visualizations and code to understand how each feature is related to the others.
The dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'. The code block below produces a statistical summary for each of the above product categories.
# Display a description of the dataset
display(data.describe())
To get a better understanding of the customers and how their data will transform through the analysis, lets select a few sample data points and explore them in more detail.
indices = [26,176,392]
# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")
display(samples)
Considering the total purchase cost of each product category and the statistical description of the dataset above for our sample customers. What kind of establishment (customer) could each of the three samples we've chosen represent?
Looking at the total purchase of each product category above and comparing them with the medians of the distributions, we can guess that:
One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.
Lets do this for the 'Milk' feature.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
new_data = data.drop(['Milk'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(new_data,data['Milk'],test_size=0.25,random_state=101)
regressor = DecisionTreeRegressor(random_state=101).fit(X_train,y_train)
score = regressor.score(X_test,y_test)
print(score)
We tried to predict the 'Milk' feature (i.e. annual spending on milk products), based on the other features in the dataset (annual spending on other product categories).
The predicted R2 score was 0.2957. As we know that the R2 is between 0 and 1, the model we built for customer's milk purchasing habits isn't very good, although it is possible that there's some correlation between this feature and others.
It's safe to say that the 'Milk' feature is necessary for identifying customer's spending habits because it isn't possible to predict how a customer spends on Milk based on their spending on the other product categories. We can say that the 'Milk' feature adds extra (and maybe key) information to the data which is not easily inferable by model only through looking at the other features.
To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If it is found that the feature we attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if we believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data.
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
Looking at the plot above, there are a few pairs of features that exhibit some degree of correlation. They include:
As we tried to predict the 'Milk' feature earlier, this confirms the suspicion that Milk isn't correlated to most of the features in the dataset, although it shows a mild correlation with 'Groceries' and 'Detergents_Paper'.
The distribution of all the features appears to be similar. It is strongly right skewed, in that most of the data points fall in then first few intervals. Judging by the summary statistics, especially the mean and maximum value points, of the features that we calculated earlier, we can expect that there are some outliers in each of the distributions. This conforms with the fact that there's a significant different between the mean and the median of the feature distributions.
In this section, we will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful.
If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.
# Scale the data using the natural logarithm
log_data = data.apply(lambda x: np.log(x))
# Scale the sample data using the natural logarithm
log_samples = samples.apply(lambda x: np.log(x))
# Produce a scatter matrix for each pair of newly-transformed features
pd.plotting.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal.
Let's check out our log transformed samples.
# Display the log-transformed sample data
display(log_samples)
Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.
# Select the indices for data points you wish to remove
outliers = []
# For each feature find the data points with extreme high or low values
for feature in log_data.keys():
# Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data[feature],25)
# Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data[feature],75)
# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = (Q3-Q1) * 1.5
# Display the outliers
print(f"Data points considered outliers for the feature '{feature}':")
out = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
display(out)
outliers = outliers + list(out.index.values)
#Creating list of more outliers which are the same for multiple features.
outliers = list(set([x for x in outliers if outliers.count(x) > 1]))
print(f"Outliers: {outliers}")
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
print(f"The good dataset now has {len(good_data)} observations after removing outliers.")
Upon quick inspection, our sample doesn't contain any of the outlier values.
There were 5 data points that were considered outliers for more than one feature based on our definition above. So, instead of removing all outliers (which would result in us losing a lot of information), only outliers that occur for more than one feature are removed.
We can also analyse these outliers independently to answer questions about how or when they occur (root cause analysis), but they might not be suitable for an aggregate analysis.
In this section we will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.
Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the good_data
to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data.
from sklearn.decomposition import PCA
# Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA().fit(good_data)
# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
The first and second features, in total, explain approx. 70.8% of the variance in our data.
The first four features, in total, explain approx. 93.11% of the variance.
In terms of customer spending,
Let's see how the log-transformed sample data has changed after having a PCA transformation applied to it in six dimensions.
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.
# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2).fit(good_data)
# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)
# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
Let's see how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions.
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))
A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1
and Dimension 2
). In addition, the biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.
Run the code cell below to produce a biplot of the reduced-dimension data.
# Create a biplot
vs.biplot(good_data, reduced_data, pca)
Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk'
, 'Grocery'
and 'Detergents_Paper'
, but not so much on the other product categories.
In this section, we will choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data. We will then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.
From what we know of both models.
Advantages of K-Means clustering:
Advantages of Gaussian Mixture Model clustering:
Gven what we know about the wholesale customer data so far, we'll chose to use Gaussian Mixture Model clustering over K-Means. This is because there might be some hidden patterns in the data that we may miss by assigning only one cluster to each data point. For example, let's take the case of the Supermarket customer in our sample: while doing PCA, it had similar and high positive weights for multiple dimensions, i.e. it didn't belong to one dimension over the other. So a supermarket may be a combination of a fresh produce store/grocery store/frozen goods store.
We'll choose GMM, so that we don't miss cases like these.
Depending on the problem, the number of clusters that we expect to be in the data may already be known. When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.
n_clusters = [8,6,4,3,2]
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
for n in n_clusters:
# Apply your clustering algorithm of choice to the reduced data
clusterer = GaussianMixture(n_components=n).fit(reduced_data.values)
# Predict the cluster for each data point
preds = clusterer.predict(reduced_data.values)
pickle.dump(clusterer, open('customer_segments.pkl', 'wb'))
# Find the cluster centers
centers = clusterer.means_
# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)
# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data,preds)
print(f"The silhouette_score for {n} clusters is {score}")
Of the several cluster numbers tried, 2 clusters had the best silhouette score.
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)
Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.
# Inverse transform the centers
log_centers = pca.inverse_transform(centers)
# Exponentiate the centers
true_centers = np.exp(log_centers)
# Display the true centers
segments = [f'Segment {i}' for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
An interesting observation here could be, considering the total purchase cost of each product category for the representative data points above, and referencing the statistical description of the dataset at the beginning of this project, what set of establishments could each of the customer segments represent?
Taking an educated guess,
Segment 0: This segment best represents supermarkets. They spend a higher than median amount on Milk, Grocery, Detergents_Paper and Deli, which are both essential to be stocked in such places.
Segment 1: This segment best represents restaurants. Their spend on Fresh, and Frozen is higher than the median, and lower, but still close to median on Deli. Their spend on Milk, Grocery and Detergents_Paper is lower than median, which adds to our assessment.
Let's find which cluster each sample point is predicted to be.
# Display the predictions
for i, pred in enumerate(sample_preds):
print("Sample point", i, "predicted to be in Cluster", pred)
Our guesses for Sample points 0,1, and 2 were restaurants, supermarket and cafe. It seems like we're close on the predictions for sample points 0 and 2, while incorrect, or rather inconsistent, with our predictions for sample point 1. Looking at the visualization for our cluster in the previous section, it could be that sample 1 is the point close to the boundary of both clusters.
In this final section, we will investigate ways that you can make use of the clustered data. First, we will consider how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme. Then, we will consider how giving a label to each customer (which segment that customer belongs to) can provide for additional features about the customer data.
Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively.
Making the change to the delivery service means that products will be delivered fewer times in a week.
The wholesale distributor can identify the clusters to conduct the A/B test on, but the test should be done on one cluster at a time because the two clusters represent different types of customers, so their delivery needs might be different, and their reaction to change will, thus, be different. In other words, the control and experiment groups should be from the same cluster, at a time.
Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a customer segment it best identifies with (depending on the clustering algorithm applied), we can consider 'customer segment' as an engineered feature for the data. Assume the wholesale distributor recently acquired ten new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor wants to classify each new customer to a customer segment to determine the most appropriate delivery service.
To label the new customers, the distributor will first need to build and train a supervised learner on the data that we labeled through clustering. The data to fit will be the estimated spends, and the target variable will be the customer segment i.e. 0 or 1 (i.e. grocery store or restaurant). They can then use the classifier to predict segments for new incoming data.