How to Build a Market Basket Analysis Dashboard with SQL

Zettaflow blog. Figure 1. Market Basket Analysis Dashboard
Figure 1. Market Basket Analysis Dashboard. (Follow the link for the actual dashboard).

This article is about market basket analysis, how to perform it using SQL, and how to present the result in the form of a real-time dashboard. 

For the purpose of the article, we use a retail dataset of a global superstore: Superstore Sales. This is. We will work with data for 2014. 

If you find this article useful, please give it a boost by sharing it on your social media. 

Index:

Application of Market Basket Analysis

Market Basket Analysis Dashboard

Customer Characteristics

Order Characteristics

Item Popularity

Store

Time

Association Analysis

Support

Confidence

Lift

Good and bad findings

Conclusion

Application of Market Basket Analysis

Market basket analysis plays a crucial role in the analysis of retail businesses. The contents of the shopping carts of the customers may provide valuable information and help design promotions, ensure better stock management, identify the drivers of the purchase, build cross-sale models, and boost sales. 

The most commonly used technique for market basket analysis is association analysis, which generates association rules. 

Although market basket analysis and association rules originate from the analysis of retail transactions they may be applied to other business domains. For instance: 

  • Electronic medical record (EMR) systems may suggest to the physicians the possible need for a back MRI for a patient who has had this procedure in the past; or remind the physician when prescribing antibiotics also to prescribe probiotics.
  • Which other parts of a website are explored by the visitors that came from a particular landing page?
  • If a passenger entered the Mall of the Emirates metro station and exited the same station two hours later paying the minimum fare, was there system abuse?

Market basket analysis should not be limited to the application of association analysis. 

It should also not be limited to the items in the shopping cart. 

Since e-commerce with its user registration, as well as loyalty cards implemented by retail stores, allow customer identification and tracking of his or her purchases over time, the customer characteristics must be definitely considered.

  • Why does a particular customer always buy oil filters but never buy engine oil?

And you should not miss the store entity. 

  • Why is this beautiful reproduction of the Sagrada Familia a bestseller in Barcelona’s Eixample, but not a good seller in El Poblenou?

However, the most important entity to watch is the order or the transaction. It includes the following characteristics in our dataset: order_id, customer_id, product_id, order_date, ship_mode, shipping_cost, order_priority, ship_date, sales, discount, and profit. 

Zettaflow blog. Figure 2. Four entities of Market Basket Analysis
Figure 2. Four entities of Market Basket Analysis connected to each other via the order

Market Basket Analysis Dashboard

Market basket analysis dashboard may be used to make decisions about product promotions, visual merchandising, store segmentation, customer targeting, and more. 

To build the market basket analysis dashboard we use the Overfetch platform. We connect it to the dataset and use the platform to fetch the data with SQL queries and visualize the result.

We start with some basic measures that do not require association rules. 

Customer Characteristics:

Proportion of customers and average purchase size of the customers who purchase the most popular products (100 best-selling items)

To find out the proportion we use the following query: 

WITH 
orders AS 
(
    SELECT 
        * 
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date)= 2014
),
best_products AS 
(
    SELECT
        product_id,
        SUM(quantity) AS quantity
    FROM 
        orders
    GROUP BY 
        product_id
    ORDER BY 
        quantity DESC
    LIMIT 100
),
customers AS 
(
    SELECT  
        customer_id
    FROM 
        orders
    GROUP BY 
        customer_id
),
selected_customers AS 
(
    SELECT  
        customer_id
    FROM 
        orders
    WHERE 
        product_id IN (SELECT product_id FROM best_products)
    GROUP BY 
        customer_id
),
result AS 
(
    SELECT
        COUNT(customer_id) AS yes,
        ROUND((SELECT COUNT(customer_id) FROM customers),0) AS total,
        ROUND(((SELECT COUNT(customer_id) FROM customers) - COUNT(customer_id)),0) AS no
        FROM selected_customers
),
pie AS 
(
    SELECT
        yes AS values,
        'Yes' as labels
    FROM result
    UNION
    SELECT  
        no AS values,
        'No' as labels
    FROM result
)
SELECT * FROM pie

For the average purchase size:

WITH 
orders AS 
(
    SELECT * 
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date)= 2014
),
best_products AS (
    SELECT
        product_id,
        SUM(quantity) AS quantity
    FROM 
        orders
    GROUP BY 
        product_id
    ORDER BY 
        quantity DESC
    LIMIT 100
),
customers AS 
(
    SELECT  
        customer_id,
        SUM(sales) AS sales
    FROM 
        orders
    GROUP BY 
        customer_id
),
selected_customers AS 
(
    SELECT  
        customer_id,
        SUM(sales) AS sales
    FROM 
        orders
    WHERE 
        product_id IN (SELECT product_id FROM best_products)
    GROUP BY 
        customer_id
),
result AS 
(
    SELECT
        ROUND(AVG(sales),2) AS avg_order_size 
    FROM 
        selected_customers
)
SELECT * FROM result

Proportion of customers and average purchase size of customers who purchase the least popular products (100 worst selling items).

To find out the proportion we use this query: 

WITH 
orders AS 
(
    SELECT * 
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date)= 2014
),
worst_products AS 
(
    SELECT
        product_id,
        SUM(quantity) AS quantity
    FROM 
        orders
    GROUP BY 
        product_id
    ORDER BY 
        quantity
    LIMIT 100
),
customers AS 
(
    SELECT  
        customer_id
    FROM 
        orders
    GROUP BY 
    customer_id
),
selected_customers AS 
(
    SELECT  
        customer_id
    FROM 
        orders
    WHERE 
        product_id IN (SELECT product_id FROM worst_products)
    GROUP BY 
        customer_id
),
result AS 
(
    SELECT
        COUNT(customer_id) AS yes,
        ROUND((SELECT COUNT(customer_id) FROM customers),0) AS total,
        ROUND(((SELECT COUNT(customer_id) FROM customers) - COUNT(customer_id)),0) AS no
        FROM selected_customers
),
pie AS 
(
    SELECT
        yes AS values,
        'Yes' as labels
    FROM 
        result
    UNION
    SELECT  
        no AS values,
        'No' as labels
    FROM 
        result
)
SELECT * FROM pie

For the average purchase size:

WITH 
orders AS 
(
    SELECT * 
        FROM superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date)= 2014
),
worst_products AS 
(
    SELECT
        product_id,
        SUM(quantity) AS quantity
    FROM 
        orders
    GROUP BY 
        product_id
    ORDER BY 
        quantity
    LIMIT 100
),
customers AS 
(
    SELECT  
        customer_id,
        SUM(sales) AS sales
    FROM 
        orders
    GROUP BY 
        customer_id
),
selected_customers AS 
(
    SELECT  
        customer_id,
        SUM(sales) AS sales
    FROM 
        orders
    WHERE 
        product_id IN (SELECT product_id FROM worst_products)
    GROUP BY 
        customer_id
),
result AS 
(
    SELECT
        ROUND(AVG(sales),2) AS avg_order_size 
    FROM 
        selected_customers
)
SELECT * FROM result

To visualize the proportions I used the Pie Chart, and for the average order size – just a Scaled-up Number. As a result, I got such a visual block: 

Zettaflow blog. Figure 3. Customers that buy 100 best and 100 worst selling items
Figure 3. Customers that buy 100 best-selling and 100 worst-selling products.
  • Would replacing some of those worst-selling items with better-selling ones threaten the customer relationship?

The breadth of the customer relationship (the number of unique items ever purchased) by the depth of the relationship (the number of orders) for customers who purchased more than one item: 

WITH t AS 
(
    SELECT
        customer_id,
        COUNT(DISTINCT product_id) AS unique_items,
        COUNT(DISTINCT order_id) AS unique_orders
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date) = 2014
    GROUP BY 
        customer_id
    ORDER BY 
        customer_id
)
SELECT
    customer_id,
    unique_items,
    unique_orders
FROM t
WHERE 
    unique_items > 1
Zettaflow blog. Figure 4. Customer Relationship.
Figure 4. Customer Relationship.

The result of the query is visualized using a scatter. The concentration of the data points represents the number of customers who have that particular breadth and depth in their relationship. The largest concentration is on the lower left, and the smallest is on the higher up to the right. 

Quite a large number of customers in our case are buying little and rear, although better customers tend to purchase a greater diversity of goods and more often. 

  • Who are those customers on the higher right? 
  • What products do they buy?
  • How can we get more customers of this kind?

Orders per customer distribution 1Y:

SELECT 
    customer_name, 
    COUNT(order_id) AS TotalOrders
FROM 
    superstore_orders
WHERE 
    EXTRACT(YEAR FROM order_date)= 2014
GROUP BY 
    customer_name

The average number of orders per customer per month for the last month:

SELECT 
    COUNT(DISTINCT order_id) * 1.0 / 
    NULLIF(COUNT(DISTINCT customer_id), 0) AS orders_per_customer 
FROM 
    superstore_orders
WHERE 
    EXTRACT(YEAR FROM order_date)= 2014
    AND EXTRACT(MONTH FROM order_date) = 12

Change in the average number of orders per customer per month over the last year: 

WITH t AS 
(
    SELECT 
        (
            date_part('year', order_date) * 12 + 
            date_part('month', order_date) 
        ) 
        AS year_month,
        order_id,
        customer_id
    FROM
        superstore_orders
)
SELECT 
    year_month, COUNT(DISTINCT order_id) * 1.0 / 
    NULLIF(COUNT(DISTINCT customer_id), 0) AS orders_per_customer
FROM
    t
GROUP BY
    year_month
ORDER BY
    year_month DESC
LIMIT 12
Zettaflow blog. Figure 5. Orders per customer.
Figure 5. Orders per customer.

Order Characteristics:

Unique items per order distribution 1Y:

SELECT 
    order_id, 
    COUNT(DISTINCT product_id) AS uniqueitems 
FROM 
    superstore_orders
WHERE 
    EXTRACT(YEAR FROM order_date)= 2014
GROUP BY 
    order_id

The average number of unique items per order per month for the last month:

WITH t AS 
(
    SELECT
        order_id,
        COUNT(DISTINCT product_id) as product_count,
        date_part('year', order_date) * 12 + 
        date_part('month', order_date) as year_month
    FROM
        superstore_orders
    GROUP BY
        order_id, year_month    
)
SELECT
    year_month,
    COUNT(order_id) as order_count,
    SUM(product_count) AS product_count,
    SUM(product_count) / COUNT(order_id) as products_per_order
FROM
    t    
WHERE
    year_month = (SELECT MAX(year_month) FROM t)
GROUP BY
    year_month

Change in the average number of unique items per order over the last year:

WITH t AS (
    SELECT
        order_id,
        COUNT(DISTINCT product_id) as product_count,
        date_part('year', order_date) * 12 + 
        date_part('month', order_date) as year_month
    FROM
        superstore_orders
    GROUP BY
        order_id, year_month    
)
SELECT
    year_month,
    COUNT(order_id) as order_count,
    SUM(product_count) AS product_count,
    SUM(product_count) / COUNT(order_id) as products_per_order
FROM
    t    
GROUP BY
    year_month    
order by
    year_month DESC
LIMIT 12

Average order size for the last month:

SELECT 
    AVG(sales)
FROM 
    superstore_orders
WHERE 
    EXTRACT(YEAR FROM order_date)= 2014
    AND EXTRACT(MONTH FROM order_date) = 12

Changes in average order size over the last year:

WITH t AS 
(
    SELECT 
        (
            date_part('year', order_date) * 12 + 
            date_part('month', order_date) 
        ) 
        AS year_month,
        sales
    FROM
        superstore_orders
)
SELECT 
    year_month, AVG(sales) as avg_sales
FROM
    t
GROUP BY
    year_month
ORDER BY
    year_month DESC
LIMIT 12

Average order size by market 1Y (you may opt for any other important characteristics):

SELECT 
    market,
    AVG(sales)
FROM 
    superstore_orders
WHERE 
    EXTRACT(YEAR FROM order_date) = 2014 
GROUP BY 
    market
Zettaflow blog. Figure 6. Order Characteristics.
Figure 6. Order Characteristics.

The above visualizations give some insight into the business. We may see a great number of customers buy only one product per order.

  • How to improve cross-selling in the sales process? 

Item Popularity: 

It can be helpful to find out the most common item found in a one-item order. (As we don’t have such orders in the dataset, we go for two-item orders).

WITH orders AS 
(
    SELECT * 
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date)= 2014
),
selected_orders AS 
(
    SELECT 
        order_id, 
        SUM(quantity) AS total_quantity 
    FROM 
        orders 
    GROUP BY 
        order_id
    HAVING 
        SUM(quantity) = 2
),
selected_products AS 
(
    SELECT 
        product_name, 
        SUM(quantity) AS quantity
    FROM 
        orders
    WHERE order_id IN (SELECT order_id FROM selected_orders)
    GROUP BY 
        product_name 
)
SELECT * FROM selected_products ORDER BY quantity DESC
LIMIT 3

What is the most common item found in a multi-item order? 

WITH orders AS 
(
    SELECT * 
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date)= 2014
),
selected_orders AS 
(
    SELECT 
        order_id, 
        SUM(quantity) AS total_quantity 
    FROM 
        orders 
    GROUP BY 
        order_id
    HAVING 
        SUM(quantity) > 2
),
selected_products AS 
(
    SELECT 
        product_name, 
        SUM(quantity) AS quantity
    FROM 
        orders
    WHERE 
        order_id IN (SELECT order_id FROM selected_orders)
    GROUP BY 
        product_name 
)
SELECT * FROM selected_products ORDER BY quantity DESC
LIMIT 3

And the most common item found among customers who repeated purchasers:

WITH orders AS 
(
    SELECT * 
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date)= 2014
),
selected_customers AS 
(
    SELECT 
        customer_id, 
        COUNT(order_id) AS total 
    FROM 
        orders 
    GROUP BY 
        customer_id
    HAVING 
        COUNT(order_id) > 10
),
selected_products AS 
(
    SELECT 
        product_name, 
        SUM(quantity) AS quantity
    FROM 
        orders
    WHERE 
        customer_id IN (SELECT customer_id FROM selected_customers)
    GROUP BY 
        product_name 
)
SELECT * FROM selected_products ORDER BY quantity DESC
LIMIT 3
Zettaflow blog. Figure 7. Popular Items.
Figure 7. Popular Items.

These visualizations may promote ideas for growing customer relationships. 

Store:

For the offline stores, it might be good to check how the popularity changes across the different stores. Different stores may have different sales models due to local management decisions, local advertising, demographics, regional trends, and even the region’s climate.

Time

Time is one of the most important attributes of market basket data. It may point to the exact marketing conditions at the time of the sale (promotion, special campaign, PR event, etc.) 

Numerous questions about customer behavior can be answered by looking at sales over time. 

  • What is the best-selling item in this quarter? What about the previous quarter? 
  • How did a particular event affect the sales? 

Association Analysis

When building association rules it is important to choose the proper item set. Figure 8 demonstrates partial product hierarchies in the supermarket. You may see how the level of detail increases from top to bottom.

Zettaflow blog. Figure 8. Product hierarchies.
Figure 8. Product hierarchies.

For the stakeholders of a chain of ice cream parlors, it might be interesting not only the combination of the ice cream flavors, but also toppings as well as types: such as cup, stick, cone, etc., and even whether the ice cream eaten in the ice cream parlor or taken away. 

But when analyzing a large supermarket, probably the initial level of detail for the Ice Cream will be Frozen Desserts. Because as the number of items used in the analysis increases, the number of combinations to consider will grow very quickly. 

After the initial analysis, we can repeat rule generation to hone in on more specific items. And Because the analysis focuses on more specific items, we use only the subset of the transactions that contain those items.

Market basket analysis works better when the items appear in approximately the same number of transactions. This prevents rules from being dominated by the most common items. That is why we may roll up rare items to higher levels in the hierarchy, so they become more frequent. Such generalized items also help find rules with sufficient support. There are more transactions supporting a given rule at higher levels of the hierarchy.

But because some items are generalized does not mean that all items must be moved to the same level up. For instance, in the example shown in Figure 8, we have chosen to generalize all the types of ice creams as Frozen Desserts for the analysis done for a large supermarket, however, expensive items in the same supermarket, such as home appliances, may remain at a low level.

  • What is more important: brand, size, or flavor of tea?
  • Will we consider different volumes of shampoo as one product?

The dataset that we are using to build our dashboard contains three levels of product hierarchies:

Zettaflow blog. Figure 9. Superstore’s product hierarchies.
Figure 9. Superstore’s product hierarchies.

For the purposes of the dashboard that is designed to work in real-time, we choose sub_category,  

which hopefully, will be able to find some useful rules, or at least, promote the ideas that you can confirm with more deep analysis. 

So, we select the following transactions:

SELECT
    order_id AS order,
    sub_category AS items,
    quantity
FROM 
    superstore_orders
WHERE 
    EXTRACT(YEAR FROM order_date)= 2014
Zettaflow blog. Figure 10. Transactions for Association Analysis.
Figure 10. Transactions for Association Analysis.

There were almost 30,000 transactions in 2014. 

We may group these transactions to see which items are purchased together. 

SELECT
    order_id AS orders,
    string_agg(DISTINCT sub_category, ', ') AS items,
    SUM(quantity)
FROM 
    superstore_orders
WHERE 
    EXTRACT(YEAR FROM order_date)= 2014
GROUP BY 
    order_id
Zettaflow blog. Figure 11. Online orders.
Figure 11. Online orders.

Even in this fragment of the data, you may see that items “Art” and “Binders” are often sold together.

We may count how many times each item was bought.

WITH 
orders AS
(
    SELECT
        order_id,
        sub_category AS item
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS 
(
    SELECT 
        COUNT(DISTINCT order_id) AS value
    FROM
        orders
),
item_orders AS 
(
    SELECT
        item,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders
    GROUP BY 
        item
)
SELECT * FROM item_orders 
ORDER BY order_count DESC 
LIMIT 100
Zettaflow blog. Figure 12. Item sales.
Figure 12. Item sales.

Let’s associate items that are bought together and then count the number of such transactions.

WITH 
orders AS
(
    SELECT
        order_id,
        sub_category AS item
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS 
(
    SELECT 
        COUNT(DISTINCT order_id) AS value
    FROM
        orders
),
associated_orders AS 
(
    SELECT
        o1.item AS item1,
        o2.item AS item2,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders AS o1
    INNER JOIN
        orders AS o2 ON 
            o1.order_id = o2.order_id AND 
            o1.item < o2.item
    GROUP BY 
        item1, item2
)
SELECT * FROM associated_orders
ORDER BY order_count DESC
LIMIT 150
Zettaflow blog. Figure 13. Association Rules.
Figure 13. Association Rules.

As we can see in Figure 13, there are 136 combinations that might be considered. Such a combination of items is called an item set. We do not necessarily consider the quantity of the items in a particular transaction. Item sets are interesting themselves. 

  • Why are those two items purchased together?

Since we analyze online sales where customers purchase relatively few items at one time, two-item sets might be sufficient. We will not waste our efforts searching for the rules with three, four, or more items. However, in some cases, such as with supermarkets, where the average transaction is much larger, more complex rules will be useful.

“Art” and “Binders” were bought together 1,688 times, and are more likely to be purchased together than any other two items. The combination of “Art” and “Fasteners” appears only 680 times. Does it mean that the rule “If Art, then Binders” is better than “If Art, then Fasteners”? We may learn from Figure 12 that the total sales of the Fasteners are only 1588, while the total sales of the “Binders” are 4048.

So, how good is a particular rule?

To find this out we use the following methods: support, confidence, and lift

Support

The most common way to measure support is to calculate the proportion of transactions that contain all the items in the rule.

For the rule “If Art, then Binders” the Support = Number of transactions when “Art” and “Binders” are bought together / Total number of transactions. 

1688 / 7794= 0.2166 or 21.65%

Let’s check Support for all the combinations:

WITH 
orders AS
(
    SELECT
        order_id,
        sub_category AS item
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS 
(
    SELECT 
        COUNT(DISTINCT order_id) AS value
    FROM
        orders
),
associated_orders AS 
(
    SELECT
        o1.item AS item1,
        o2.item AS item2,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders AS o1
    INNER JOIN
        orders AS o2 ON 
            o1.order_id = o2.order_id AND 
            o1.item < o2.item
    GROUP BY 
        item1, item2
),
associated_support AS 
(
    SELECT
        item1,
        item2,
        order_count,
        total_order_count,
        ROUND((order_count::numeric / total_order_count), 5) AS support
    FROM 
        associated_orders
)
SELECT * FROM associated_support 
ORDER BY support DESC 
LIMIT 100
Zettaflow blog. Figure 14. Support.
Figure 14. Support.

Confidence

Confidence helps to understand how good a rule is at predicting what is on the right-hand side. 

It is the ratio of the number of transactions with all the items in the rule to the number of transactions with just the items on the left-hand side. Alternatively, we may count the ratio of all the items in the rule support to the support of the item on the left-hand side. 

For the rule “If Art, then Binders” the Confidence = Support (Number of transactions when “Art” and “Binders” are bought together) / Support (Number of transactions when “Art” is bought). 

0.21658 / 0.41827 = 0.51782 or 51.78%

It means that, when “Art” appears in an order, there is a 51.78% chance that “Binders” also appears in it.

Let’s check Confidence for all the combinations:

WITH 
orders AS
(
    SELECT
        order_id,
        sub_category AS item
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS 
(
    SELECT 
        COUNT(DISTINCT order_id) AS value
    FROM
        orders
),
item_orders AS 
(
    SELECT
        item,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders
    GROUP BY 
        item
),
item_support AS 
(
    SELECT
        item,
        order_count,
        total_order_count,
        ROUND((order_count::numeric / total_order_count), 5) AS support
    FROM 
        item_orders
),
associated_orders AS 
(
    SELECT
        o1.item AS item1,
        o2.item AS item2,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders AS o1
    INNER JOIN
        orders AS o2 ON 
            o1.order_id = o2.order_id AND 
            o1.item < o2.item
    GROUP BY 
        item1, item2
),
associated_support AS 
(
    SELECT
        item1,
        item2,
        order_count,
        total_order_count,
        ROUND((order_count::numeric / total_order_count), 5) AS support
    FROM 
        associated_orders
),
associated_confidence AS 
(
    SELECT
        tas.item1,
        tas.item2,
        tas.support AS associated_support,
        tis1.support AS item1_support,
        ROUND(tas.support / tis1.support, 5) as confidence
    FROM
        associated_support AS tas
    INNER JOIN
        item_support AS tis1
        ON tis1.item = tas.item1
)
SELECT * FROM associated_confidence
ORDER BY associated_support DESC 
LIMIT 100
Zettaflow blog. Figure 15. Confidence.
Figure 15. Confidence.

Lift

Lift measures the power of the rule by comparing the full rule to randomly guessing the right-hand side. 

If the item on the right-hand side already is very common, the rule is not telling us anything. 

For the rule “If Art, then Binders” the Lift = Support (Number of transactions when “Art” and “Binders” are bought together) / (Support (Number of transactions when “Art” is bought) * Support (Number of transactions when “Binders” are bought)). 

0.21658 / (0.41827 * 0.51937) = 0.99698

Lift greater than 1 means that the product on the left-hand side increases the chance of buying the product on the right-hand side. 

When lift is less than 1, the rule is doing worse than informed guessing. So is true for the rule “If Art, then Binders”.

Let’s find out Lift for all the rules:

WITH 
orders AS
(
    SELECT
        order_id,
        sub_category AS item
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS 
(
    SELECT 
        COUNT(DISTINCT order_id) AS value
    FROM
        orders
),
item_orders AS 
(
    SELECT
        item,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders
    GROUP BY 
        item
),
item_support AS 
(
    SELECT
        item,
        order_count,
        total_order_count,
        ROUND((order_count::numeric / total_order_count), 5) AS support
    FROM 
        item_orders
),
associated_orders AS 
(
    SELECT
        o1.item AS item1,
        o2.item AS item2,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders AS o1
    INNER JOIN
        orders AS o2 ON 
            o1.order_id = o2.order_id AND 
            o1.item < o2.item
    GROUP BY 
        item1, item2
),
associated_support AS 
(
    SELECT
        item1,
        item2,
        order_count,
        total_order_count,
        ROUND((order_count::numeric / total_order_count), 5) AS support
    FROM 
        associated_orders
),
associated_lift AS 
(
    SELECT
        tas.item1,
        tas.item2,
        tas.support AS associated_support,
        tis1.support AS item1_support,
        tis2.support AS item2_support,
        ROUND(tas.support / (tis1.support * tis2.support), 5) as lift
    FROM
        associated_support AS tas
    INNER JOIN
        item_support AS tis1
        ON tis1.item = tas.item1
    INNER JOIN
        item_support AS tis2
        ON tis2.item = tas.item2
)
SELECT * FROM associated_lift
ORDER BY associated_support DESC 
LIMIT 100
Zettaflow blog. Figure 16. Lift.
Figure 16. Lift.

For the dashboard I opted to display only the rules with support greater than 5%, confidence greater than 20% and lift greater than 1.

WITH 
orders AS
(
    SELECT
        order_id,
        sub_category AS item
    FROM 
        superstore_orders
    WHERE 
        EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS 
(
    SELECT 
        COUNT(DISTINCT order_id) AS value
    FROM
        orders
),
item_orders AS 
(
    SELECT
        item,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders
    GROUP BY 
        item
),
item_support AS 
(
    SELECT
        item,
        order_count,
        total_order_count,
        ROUND((order_count::numeric / total_order_count), 5) AS support
    FROM 
        item_orders
),
associated_orders AS 
(
    SELECT
        o1.item AS item1,
        o2.item AS item2,
        COUNT(*) AS order_count,
        (SELECT value FROM total_orders) AS total_order_count
    FROM 
        orders AS o1
    INNER JOIN
        orders AS o2 ON 
            o1.order_id = o2.order_id AND 
            o1.item < o2.item
    GROUP BY 
        item1, item2
),
associated_support AS 
(
    SELECT
        item1,
        item2,
        order_count,
        total_order_count,
        ROUND((order_count::numeric / total_order_count), 5) AS support
    FROM 
        associated_orders
),
associated_confidence AS 
(
    SELECT
        tas.item1,
        tas.item2,
        tas.support AS associated_support,
        tis1.support AS item1_support,
        ROUND(tas.support / tis1.support, 5) as confidence
    FROM
        associated_support AS tas
    INNER JOIN
        item_support AS tis1
        ON tis1.item = tas.item1
),
associated_lift AS 
(
    SELECT
        tas.item1,
        tas.item2,
        tas.support AS associated_support,
        tis1.support AS item1_support,
        tis2.support AS item2_support,
        ROUND(tas.support / (tis1.support * tis2.support), 5) as lift
    FROM
        associated_support AS tas
    INNER JOIN
        item_support AS tis1
        ON tis1.item = tas.item1
    INNER JOIN
        item_support AS tis2
        ON tis2.item = tas.item2
),
summary AS 
(
    SELECT
        tao.item1,
        tao.item2,
        tao.order_count,
        tas.support AS support,
        tas.support * 100 AS support_pct,
        tac.confidence AS confidence,
        tal.lift AS lift
    FROM
        associated_orders AS tao
    INNER JOIN
        associated_support AS tas
        ON tas.item1 = tao.item1 AND tas.item2 = tao.item2
    INNER JOIN
        associated_confidence AS tac 
        ON tac.item1 = tao.item1 AND tac.item2 = tao.item2
    INNER JOIN
        associated_lift AS tal
        ON tal.item1 = tao.item1 AND tal.item2 = tao.item2
)
SELECT 
    concat(item1, ' + ', item2) AS items, *  
FROM 
    summary
WHERE 
    support > 0.05 AND 
    confidence > 0.2 AND 
    lift > 1
ORDER BY 
    order_count DESC 
LIMIT 100
Zettaflow blog. Figure 17. Selected association rules.
Figure 17. Selected association rules.
Zettaflow blog. Figure 18. Association rules visualization.
Figure 18. Association rules visualization.

Good and bad findings

Association analysis generates association rules automatically. It is a type of undirected data mining that finds patterns in the data where the target is not specified beforehand. Therefore it is our task to decide whether the patterns make sense.

Good findings are actionable. They help make decisions, such as creating promotional discounts on certain product bundles to improve cross-selling.

Beware of trivial and inexplicable rules.

Trivial rules reproduce common knowledge about the business.

  • The customer who has bought an extended warranty is likely to buy an expensive household appliance.

Inexplicable rules may not have explanations and recommendations for action. They just don’t make sense. 

  • The bookstore’s most popular product among American Express credit card shoppers is Barack Obama’s Promised Land.

Conclusion

In this article, we have briefly discussed market basket analysis and association rules. 

We went through the process of analyzing data at three levels: an item, an order, and a customer who, if identified, connects all the orders together. 

Also, we have described how to select the right level of detail in product hierarchies, and how many items to include in the association rule. 

We ran an association analysis with SQL and described how to evaluate the rules using such measures as support, confidence, and lift. 

And finally, we built a visual dashboard that represents our findings. 

Hope this was helpful!

Leave a Reply