Grocery Item Clustering

In this project I used the K-means algorithm to cluster grocery items based on their transaction data.

Items that are often purchases together can be placed in the same aisle or aisles closer to each other, increasing sales!

This used to be done by human experts, which would require many years of experience in the industry to narrow things down. However, with the rise of big data and machine learning, why not let AI do all the trick and hard work for you?

Ok let’s get started. The first step is to gather the dataset. Here I downloaded the Instacart dataset from Kaggle, and filtered out items that don’t have a sufficient purchase history yet in the dataset (being purchased fewer than 100 times), because they may not contain enough information to be correctly classified(i.e. they may end up forming weird 1-item categories).

# Load dependencies
import pandas as pd
from scipy import sparse
from sklearn import metrics
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_rows', 30)

# Load the Instacart data and cluster items that have been purchased at least 100 times
df = pd.read_csv('data/order_products__train.csv')
df = df.drop(['add_to_cart_order', 'reordered'], axis=1)
df = df.groupby('product_id').filter(lambda x:len(x) >= 100)

df_products = pd.read_csv('data/products.csv')
df_aisles = pd.read_csv('data/aisles.csv')

The main table is pretty straightforward, classical order_id - product_id pairs.

df.head()

	order_id	product_id
1	1	11109
2	1	10246
3	1	49683
5	1	13176
6	1	47209

Now let’s transform the order data into a matrix where:

each row is an order
each column is a product
each value 1 indicates a purchase

df['filler'] = 1
df = df.pivot(index='order_id', columns='product_id', values='filler').fillna(0)

df.head()

product_id	10	34	45	79	95	116	117	130	141	160	...	49481	49517	49520	49533	49585	49605	49610	49621	49628	49683
order_id
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
36	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
38	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
96	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
98	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 2457 columns

print('Number of Transactions:', df.shape[0])
print('Number of Items:', df.shape[1])

Number of Transactions: 125956
Number of Items: 2457

Cool, now that everything has been set up, let’s transform the data into a similarity matrix. (here I used the cosine similarity, but other metrics should be fine).

Essentially, items that are often purchased together will have a similarity of 1, whereas those that have never been purchased together will have a similarity of 0.

# Save memories by converting it to a sparse matrix
data = df.to_numpy()
data_sparse = sparse.csr_matrix(data)
data_clustering = metrics.pairwise.cosine_similarity(data_sparse.T)
data_clustering = sparse.csr_matrix(data_clustering)

Then, time to apply the K-means algorithm using the similarity matrix we just obtained!

How do we choose K though? It’s usually more subject to the business requirements.

However, here what we can do is to pick a bunch of Ks and look at the inertia plot.

A range between 100 and 200 would probably be a good start since we don’t want to make things too complicated. (which works out to be around 10+ types of item per aisle)

cluster_inertia = []
for i in range(100, 220, 20):
    model = KMeans(n_clusters=i)
    model.fit(data_clustering)
    cluster_inertia.append(model.inertia_)

plt.plot(range(100, 220, 20), cluster_inertia)
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()

png

According to the plot, 200 seems to be a good choice since the line keeps going down.

Let’s plug it in!

kmeans = KMeans(n_clusters=200)

clusters = kmeans.fit_predict(data_clustering)
final_clusters = pd.DataFrame({'cluster':clusters,
                               'product_id':df.columns})
df_cluster = final_clusters.sort_values('cluster')
df_cluster = pd.merge(df_cluster, df_products, how='left', on='product_id')
df_cluster = df_cluster.drop(['aisle_id', 'department_id'], axis=1)

df_cluster

	cluster	product_id	product_name
0	0	31964	Passionfruit Sparkling Water
1	0	49520	Orange Sparkling Water
2	0	49191	Cran Raspberry Sparkling Water
3	1	5883	Organic Strawberry Lemonade
4	1	8929	Organic Chicken Pot Pie
...	...	...	...
2452	199	32747	Low Fat 1% Milk
2453	199	45116	Potatoes Sweet
2454	199	40354	California Clementines
2455	199	7419	Sweet Red Grape Tomatoes
2456	199	11265	100% Natural Diced Tomatoes

2457 rows × 3 columns

Get some basic stats regarding our clusters.

sns.distplot(df_cluster['cluster'].value_counts())

/Users/ree/PycharmProjects/Personal/grocery-item-clustering/venv/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='cluster', ylabel='Density'>

png

The group counts seem quite reasonable. Let’s inspect some of the categories manually!

df_cluster.query('cluster == 1')

	cluster	product_id	product_name
3	1	5883	Organic Strawberry Lemonade
4	1	8929	Organic Chicken Pot Pie
5	1	28769	Organic Apple Chicken Sausage
6	1	29370	Organic White Cheddar Popcorn
7	1	16363	Gluten Free Breaded Chicken Breast Tenders
8	1	37971	Organic Cranberry Pomegranate Juice
9	1	35383	Classic White Bread
10	1	45957	Peach Mango Salsa
11	1	30967	Organic Frosted Flakes Cereal
12	1	13640	Asian Pears
13	1	18200	Aged White Cheddar Gluten-Free Baked Rice And ...
14	1	40878	Macaroni And Cheese
15	1	37464	Apple Cinnamon Instant Oatmeal
16	1	39160	Kefir Cultured Strawberry Milk Drink

df_cluster.query('cluster == 2')

	cluster	product_id	product_name
17	2	17429	Jalapeno Hummus
18	2	39475	Total Greek Strained Yogurt
19	2	5479	Italian Sparkling Mineral Water
20	2	10017	Tilapia Filet
21	2	21543	Organic Quick Oats
22	2	16521	Walnut Halves & Pieces
23	2	16083	Organic Large Brown Eggs
24	2	45965	Steel Cut Oats
25	2	31869	Organic Edamame
26	2	19660	Spring Water
27	2	20842	Total 0% Greek Yogurt
28	2	5134	Organic Thompson Seedless Raisins
29	2	20734	Organic Medjool Dates
30	2	7503	Whole Almonds
31	2	17872	Total 2% Lowfat Plain Greek Yogurt
32	2	44422	Organic Old Fashioned Rolled Oats
33	2	18027	Ezekiel 4:9 Bread Organic Sprouted Whole Grain

df_cluster.query('cluster == 3')

	cluster	product_id	product_name
34	3	27398	Genuine Chocolate Flavor Syrup
35	3	12099	Honey Greek Yogurt
36	3	36994	Organic Graham Crunch Cereal
37	3	38563	Mint Chocolate Chip Ice Cream
38	3	32156	Cranberry Juice Cocktail
39	3	40486	Chicken Tenders
40	3	35199	100% Apple Juice
41	3	30597	French Vanilla Coffee Creamer
42	3	15780	Breakfast Blend Medium Roast Ground Coffee
43	3	43889	Dark Chocolate Covered Banana
44	3	17334	Coffee Ice Cream
45	3	3479	Classic Whipped Cream
46	3	19511	Half And Half
47	3	38548	Gala Apple
48	3	45061	Natural Vanilla Ice Cream
49	3	18171	Natural Sunflower Spread

Voila!

For cluster 1 I cannot seem to find an obvious category, but apparently they are often bought together!
Cluster 2 seems to be healthy/organic food
Cluster 3 seems to be sweet, breakfast and coffee stuff

Grocery Item Clustering

How can AI help arrange the aisles?

Grocery Item Clustering

How can AI help arrange the aisles?