
This article is about market basket analysis, how to perform it using SQL, and how to present the result in the form of a real-time dashboard.
For the purpose of the article, we use a retail dataset of a global superstore: Superstore Sales. This is. We will work with data for 2014.
If you find this article useful, please give it a boost by sharing it on your social media.
Index:
Application of Market Basket Analysis
Market Basket Analysis Dashboard
Application of Market Basket Analysis
Market basket analysis plays a crucial role in the analysis of retail businesses. The contents of the shopping carts of the customers may provide valuable information and help design promotions, ensure better stock management, identify the drivers of the purchase, build cross-sale models, and boost sales.
The most commonly used technique for market basket analysis is association analysis, which generates association rules.
Although market basket analysis and association rules originate from the analysis of retail transactions they may be applied to other business domains. For instance:
- Electronic medical record (EMR) systems may suggest to the physicians the possible need for a back MRI for a patient who has had this procedure in the past; or remind the physician when prescribing antibiotics also to prescribe probiotics.
- Which other parts of a website are explored by the visitors that came from a particular landing page?
- If a passenger entered the Mall of the Emirates metro station and exited the same station two hours later paying the minimum fare, was there system abuse?
Market basket analysis should not be limited to the application of association analysis.
It should also not be limited to the items in the shopping cart.
Since e-commerce with its user registration, as well as loyalty cards implemented by retail stores, allow customer identification and tracking of his or her purchases over time, the customer characteristics must be definitely considered.
- Why does a particular customer always buy oil filters but never buy engine oil?
And you should not miss the store entity.
- Why is this beautiful reproduction of the Sagrada Familia a bestseller in Barcelona’s Eixample, but not a good seller in El Poblenou?
However, the most important entity to watch is the order or the transaction. It includes the following characteristics in our dataset: order_id, customer_id, product_id, order_date, ship_mode, shipping_cost, order_priority, ship_date, sales, discount, and profit.

Market Basket Analysis Dashboard
Market basket analysis dashboard may be used to make decisions about product promotions, visual merchandising, store segmentation, customer targeting, and more.
To build the market basket analysis dashboard we use the Tepegraph platform. We connect it to the dataset and use the platform to fetch the data with SQL queries and visualize the result.
We start with some basic measures that do not require association rules.
Customer Characteristics:
Proportion of customers and average purchase size of the customers who purchase the most popular products (100 best-selling items).
To find out the proportion we use the following query:
WITH
orders AS
(
SELECT
*
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
),
best_products AS
(
SELECT
product_id,
SUM(quantity) AS quantity
FROM
orders
GROUP BY
product_id
ORDER BY
quantity DESC
LIMIT 100
),
customers AS
(
SELECT
customer_id
FROM
orders
GROUP BY
customer_id
),
selected_customers AS
(
SELECT
customer_id
FROM
orders
WHERE
product_id IN (SELECT product_id FROM best_products)
GROUP BY
customer_id
),
result AS
(
SELECT
COUNT(customer_id) AS yes,
ROUND((SELECT COUNT(customer_id) FROM customers),0) AS total,
ROUND(((SELECT COUNT(customer_id) FROM customers) - COUNT(customer_id)),0) AS no
FROM selected_customers
),
pie AS
(
SELECT
yes AS values,
'Yes' as labels
FROM result
UNION
SELECT
no AS values,
'No' as labels
FROM result
)
SELECT * FROM pieFor the average purchase size:
WITH
orders AS
(
SELECT *
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
),
best_products AS (
SELECT
product_id,
SUM(quantity) AS quantity
FROM
orders
GROUP BY
product_id
ORDER BY
quantity DESC
LIMIT 100
),
customers AS
(
SELECT
customer_id,
SUM(sales) AS sales
FROM
orders
GROUP BY
customer_id
),
selected_customers AS
(
SELECT
customer_id,
SUM(sales) AS sales
FROM
orders
WHERE
product_id IN (SELECT product_id FROM best_products)
GROUP BY
customer_id
),
result AS
(
SELECT
ROUND(AVG(sales),2) AS avg_order_size
FROM
selected_customers
)
SELECT * FROM resultProportion of customers and average purchase size of customers who purchase the least popular products (100 worst selling items).
To find out the proportion we use this query:
WITH
orders AS
(
SELECT *
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
),
worst_products AS
(
SELECT
product_id,
SUM(quantity) AS quantity
FROM
orders
GROUP BY
product_id
ORDER BY
quantity
LIMIT 100
),
customers AS
(
SELECT
customer_id
FROM
orders
GROUP BY
customer_id
),
selected_customers AS
(
SELECT
customer_id
FROM
orders
WHERE
product_id IN (SELECT product_id FROM worst_products)
GROUP BY
customer_id
),
result AS
(
SELECT
COUNT(customer_id) AS yes,
ROUND((SELECT COUNT(customer_id) FROM customers),0) AS total,
ROUND(((SELECT COUNT(customer_id) FROM customers) - COUNT(customer_id)),0) AS no
FROM selected_customers
),
pie AS
(
SELECT
yes AS values,
'Yes' as labels
FROM
result
UNION
SELECT
no AS values,
'No' as labels
FROM
result
)
SELECT * FROM pieFor the average purchase size:
WITH
orders AS
(
SELECT *
FROM superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
),
worst_products AS
(
SELECT
product_id,
SUM(quantity) AS quantity
FROM
orders
GROUP BY
product_id
ORDER BY
quantity
LIMIT 100
),
customers AS
(
SELECT
customer_id,
SUM(sales) AS sales
FROM
orders
GROUP BY
customer_id
),
selected_customers AS
(
SELECT
customer_id,
SUM(sales) AS sales
FROM
orders
WHERE
product_id IN (SELECT product_id FROM worst_products)
GROUP BY
customer_id
),
result AS
(
SELECT
ROUND(AVG(sales),2) AS avg_order_size
FROM
selected_customers
)
SELECT * FROM resultTo visualize the proportions I used the Pie Chart, and for the average order size – just a Scaled-up Number. As a result, I got such a visual block:

- Would replacing some of those worst-selling items with better-selling ones threaten the customer relationship?
The breadth of the customer relationship (the number of unique items ever purchased) by the depth of the relationship (the number of orders) for customers who purchased more than one item:
WITH t AS
(
SELECT
customer_id,
COUNT(DISTINCT product_id) AS unique_items,
COUNT(DISTINCT order_id) AS unique_orders
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date) = 2014
GROUP BY
customer_id
ORDER BY
customer_id
)
SELECT
customer_id,
unique_items,
unique_orders
FROM t
WHERE
unique_items > 1

The result of the query is visualized using a scatter. The concentration of the data points represents the number of customers who have that particular breadth and depth in their relationship. The largest concentration is on the lower left, and the smallest is on the higher up to the right.
Quite a large number of customers in our case are buying little and rear, although better customers tend to purchase a greater diversity of goods and more often.
- Who are those customers on the higher right?
- What products do they buy?
- How can we get more customers of this kind?
Orders per customer distribution 1Y:
SELECT
customer_name,
COUNT(order_id) AS TotalOrders
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
GROUP BY
customer_nameThe average number of orders per customer per month for the last month:
SELECT
COUNT(DISTINCT order_id) * 1.0 /
NULLIF(COUNT(DISTINCT customer_id), 0) AS orders_per_customer
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
AND EXTRACT(MONTH FROM order_date) = 12
Change in the average number of orders per customer per month over the last year:
WITH t AS
(
SELECT
(
date_part('year', order_date) * 12 +
date_part('month', order_date)
)
AS year_month,
order_id,
customer_id
FROM
superstore_orders
)
SELECT
year_month, COUNT(DISTINCT order_id) * 1.0 /
NULLIF(COUNT(DISTINCT customer_id), 0) AS orders_per_customer
FROM
t
GROUP BY
year_month
ORDER BY
year_month DESC
LIMIT 12
Order Characteristics:
Unique items per order distribution 1Y:
SELECT
order_id,
COUNT(DISTINCT product_id) AS uniqueitems
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
GROUP BY
order_idThe average number of unique items per order per month for the last month:
WITH t AS
(
SELECT
order_id,
COUNT(DISTINCT product_id) as product_count,
date_part('year', order_date) * 12 +
date_part('month', order_date) as year_month
FROM
superstore_orders
GROUP BY
order_id, year_month
)
SELECT
year_month,
COUNT(order_id) as order_count,
SUM(product_count) AS product_count,
SUM(product_count) / COUNT(order_id) as products_per_order
FROM
t
WHERE
year_month = (SELECT MAX(year_month) FROM t)
GROUP BY
year_monthChange in the average number of unique items per order over the last year:
WITH t AS (
SELECT
order_id,
COUNT(DISTINCT product_id) as product_count,
date_part('year', order_date) * 12 +
date_part('month', order_date) as year_month
FROM
superstore_orders
GROUP BY
order_id, year_month
)
SELECT
year_month,
COUNT(order_id) as order_count,
SUM(product_count) AS product_count,
SUM(product_count) / COUNT(order_id) as products_per_order
FROM
t
GROUP BY
year_month
order by
year_month DESC
LIMIT 12Average order size for the last month:
SELECT
AVG(sales)
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
AND EXTRACT(MONTH FROM order_date) = 12Changes in average order size over the last year:
WITH t AS
(
SELECT
(
date_part('year', order_date) * 12 +
date_part('month', order_date)
)
AS year_month,
sales
FROM
superstore_orders
)
SELECT
year_month, AVG(sales) as avg_sales
FROM
t
GROUP BY
year_month
ORDER BY
year_month DESC
LIMIT 12Average order size by market 1Y (you may opt for any other important characteristics):
SELECT
market,
AVG(sales)
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date) = 2014
GROUP BY
market
The above visualizations give some insight into the business. We may see a great number of customers buy only one product per order.
- How to improve cross-selling in the sales process?
Item Popularity:
It can be helpful to find out the most common item found in a one-item order. (As we don’t have such orders in the dataset, we go for two-item orders).
WITH orders AS
(
SELECT *
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
),
selected_orders AS
(
SELECT
order_id,
SUM(quantity) AS total_quantity
FROM
orders
GROUP BY
order_id
HAVING
SUM(quantity) = 2
),
selected_products AS
(
SELECT
product_name,
SUM(quantity) AS quantity
FROM
orders
WHERE order_id IN (SELECT order_id FROM selected_orders)
GROUP BY
product_name
)
SELECT * FROM selected_products ORDER BY quantity DESC
LIMIT 3What is the most common item found in a multi-item order?
WITH orders AS
(
SELECT *
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
),
selected_orders AS
(
SELECT
order_id,
SUM(quantity) AS total_quantity
FROM
orders
GROUP BY
order_id
HAVING
SUM(quantity) > 2
),
selected_products AS
(
SELECT
product_name,
SUM(quantity) AS quantity
FROM
orders
WHERE
order_id IN (SELECT order_id FROM selected_orders)
GROUP BY
product_name
)
SELECT * FROM selected_products ORDER BY quantity DESC
LIMIT 3
And the most common item found among customers who repeated purchasers:
WITH orders AS
(
SELECT *
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
),
selected_customers AS
(
SELECT
customer_id,
COUNT(order_id) AS total
FROM
orders
GROUP BY
customer_id
HAVING
COUNT(order_id) > 10
),
selected_products AS
(
SELECT
product_name,
SUM(quantity) AS quantity
FROM
orders
WHERE
customer_id IN (SELECT customer_id FROM selected_customers)
GROUP BY
product_name
)
SELECT * FROM selected_products ORDER BY quantity DESC
LIMIT 3
These visualizations may promote ideas for growing customer relationships.
Store:
For the offline stores, it might be good to check how the popularity changes across the different stores. Different stores may have different sales models due to local management decisions, local advertising, demographics, regional trends, and even the region’s climate.
Time
Time is one of the most important attributes of market basket data. It may point to the exact marketing conditions at the time of the sale (promotion, special campaign, PR event, etc.)
Numerous questions about customer behavior can be answered by looking at sales over time.
- What is the best-selling item in this quarter? What about the previous quarter?
- How did a particular event affect the sales?
Association Analysis
When building association rules it is important to choose the proper item set. Figure 8 demonstrates partial product hierarchies in the supermarket. You may see how the level of detail increases from top to bottom.

For the stakeholders of a chain of ice cream parlors, it might be interesting not only the combination of the ice cream flavors, but also toppings as well as types: such as cup, stick, cone, etc., and even whether the ice cream eaten in the ice cream parlor or taken away.
But when analyzing a large supermarket, probably the initial level of detail for the Ice Cream will be Frozen Desserts. Because as the number of items used in the analysis increases, the number of combinations to consider will grow very quickly.
After the initial analysis, we can repeat rule generation to hone in on more specific items. And Because the analysis focuses on more specific items, we use only the subset of the transactions that contain those items.
Market basket analysis works better when the items appear in approximately the same number of transactions. This prevents rules from being dominated by the most common items. That is why we may roll up rare items to higher levels in the hierarchy, so they become more frequent. Such generalized items also help find rules with sufficient support. There are more transactions supporting a given rule at higher levels of the hierarchy.
But because some items are generalized does not mean that all items must be moved to the same level up. For instance, in the example shown in Figure 8, we have chosen to generalize all the types of ice creams as Frozen Desserts for the analysis done for a large supermarket, however, expensive items in the same supermarket, such as home appliances, may remain at a low level.
- What is more important: brand, size, or flavor of tea?
- Will we consider different volumes of shampoo as one product?
The dataset that we are using to build our dashboard contains three levels of product hierarchies:

For the purposes of the dashboard that is designed to work in real-time, we choose sub_category,
which hopefully, will be able to find some useful rules, or at least, promote the ideas that you can confirm with more deep analysis.
So, we select the following transactions:
SELECT
order_id AS order,
sub_category AS items,
quantity
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014

There were almost 30,000 transactions in 2014.
We may group these transactions to see which items are purchased together.
SELECT
order_id AS orders,
string_agg(DISTINCT sub_category, ', ') AS items,
SUM(quantity)
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date)= 2014
GROUP BY
order_id

Even in this fragment of the data, you may see that items “Art” and “Binders” are often sold together.
We may count how many times each item was bought.
WITH
orders AS
(
SELECT
order_id,
sub_category AS item
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS
(
SELECT
COUNT(DISTINCT order_id) AS value
FROM
orders
),
item_orders AS
(
SELECT
item,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders
GROUP BY
item
)
SELECT * FROM item_orders
ORDER BY order_count DESC
LIMIT 100
Let’s associate items that are bought together and then count the number of such transactions.
WITH
orders AS
(
SELECT
order_id,
sub_category AS item
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS
(
SELECT
COUNT(DISTINCT order_id) AS value
FROM
orders
),
associated_orders AS
(
SELECT
o1.item AS item1,
o2.item AS item2,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders AS o1
INNER JOIN
orders AS o2 ON
o1.order_id = o2.order_id AND
o1.item < o2.item
GROUP BY
item1, item2
)
SELECT * FROM associated_orders
ORDER BY order_count DESC
LIMIT 150
As we can see in Figure 13, there are 136 combinations that might be considered. Such a combination of items is called an item set. We do not necessarily consider the quantity of the items in a particular transaction. Item sets are interesting themselves.
- Why are those two items purchased together?
Since we analyze online sales where customers purchase relatively few items at one time, two-item sets might be sufficient. We will not waste our efforts searching for the rules with three, four, or more items. However, in some cases, such as with supermarkets, where the average transaction is much larger, more complex rules will be useful.
“Art” and “Binders” were bought together 1,688 times, and are more likely to be purchased together than any other two items. The combination of “Art” and “Fasteners” appears only 680 times. Does it mean that the rule “If Art, then Binders” is better than “If Art, then Fasteners”? We may learn from Figure 12 that the total sales of the Fasteners are only 1588, while the total sales of the “Binders” are 4048.
So, how good is a particular rule?
To find this out we use the following methods: support, confidence, and lift.
Support
The most common way to measure support is to calculate the proportion of transactions that contain all the items in the rule.
For the rule “If Art, then Binders” the Support = Number of transactions when “Art” and “Binders” are bought together / Total number of transactions.
1688 / 7794= 0.2166 or 21.65%
Let’s check Support for all the combinations:
WITH
orders AS
(
SELECT
order_id,
sub_category AS item
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS
(
SELECT
COUNT(DISTINCT order_id) AS value
FROM
orders
),
associated_orders AS
(
SELECT
o1.item AS item1,
o2.item AS item2,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders AS o1
INNER JOIN
orders AS o2 ON
o1.order_id = o2.order_id AND
o1.item < o2.item
GROUP BY
item1, item2
),
associated_support AS
(
SELECT
item1,
item2,
order_count,
total_order_count,
ROUND((order_count::numeric / total_order_count), 5) AS support
FROM
associated_orders
)
SELECT * FROM associated_support
ORDER BY support DESC
LIMIT 100
Confidence
Confidence helps to understand how good a rule is at predicting what is on the right-hand side.
It is the ratio of the number of transactions with all the items in the rule to the number of transactions with just the items on the left-hand side. Alternatively, we may count the ratio of all the items in the rule support to the support of the item on the left-hand side.
For the rule “If Art, then Binders” the Confidence = Support (Number of transactions when “Art” and “Binders” are bought together) / Support (Number of transactions when “Art” is bought).
0.21658 / 0.41827 = 0.51782 or 51.78%
It means that, when “Art” appears in an order, there is a 51.78% chance that “Binders” also appears in it.
Let’s check Confidence for all the combinations:
WITH
orders AS
(
SELECT
order_id,
sub_category AS item
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS
(
SELECT
COUNT(DISTINCT order_id) AS value
FROM
orders
),
item_orders AS
(
SELECT
item,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders
GROUP BY
item
),
item_support AS
(
SELECT
item,
order_count,
total_order_count,
ROUND((order_count::numeric / total_order_count), 5) AS support
FROM
item_orders
),
associated_orders AS
(
SELECT
o1.item AS item1,
o2.item AS item2,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders AS o1
INNER JOIN
orders AS o2 ON
o1.order_id = o2.order_id AND
o1.item < o2.item
GROUP BY
item1, item2
),
associated_support AS
(
SELECT
item1,
item2,
order_count,
total_order_count,
ROUND((order_count::numeric / total_order_count), 5) AS support
FROM
associated_orders
),
associated_confidence AS
(
SELECT
tas.item1,
tas.item2,
tas.support AS associated_support,
tis1.support AS item1_support,
ROUND(tas.support / tis1.support, 5) as confidence
FROM
associated_support AS tas
INNER JOIN
item_support AS tis1
ON tis1.item = tas.item1
)
SELECT * FROM associated_confidence
ORDER BY associated_support DESC
LIMIT 100
Lift
Lift measures the power of the rule by comparing the full rule to randomly guessing the right-hand side.
If the item on the right-hand side already is very common, the rule is not telling us anything.
For the rule “If Art, then Binders” the Lift = Support (Number of transactions when “Art” and “Binders” are bought together) / (Support (Number of transactions when “Art” is bought) * Support (Number of transactions when “Binders” are bought)).
0.21658 / (0.41827 * 0.51937) = 0.99698
Lift greater than 1 means that the product on the left-hand side increases the chance of buying the product on the right-hand side.
When lift is less than 1, the rule is doing worse than informed guessing. So is true for the rule “If Art, then Binders”.
Let’s find out Lift for all the rules:
WITH
orders AS
(
SELECT
order_id,
sub_category AS item
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS
(
SELECT
COUNT(DISTINCT order_id) AS value
FROM
orders
),
item_orders AS
(
SELECT
item,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders
GROUP BY
item
),
item_support AS
(
SELECT
item,
order_count,
total_order_count,
ROUND((order_count::numeric / total_order_count), 5) AS support
FROM
item_orders
),
associated_orders AS
(
SELECT
o1.item AS item1,
o2.item AS item2,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders AS o1
INNER JOIN
orders AS o2 ON
o1.order_id = o2.order_id AND
o1.item < o2.item
GROUP BY
item1, item2
),
associated_support AS
(
SELECT
item1,
item2,
order_count,
total_order_count,
ROUND((order_count::numeric / total_order_count), 5) AS support
FROM
associated_orders
),
associated_lift AS
(
SELECT
tas.item1,
tas.item2,
tas.support AS associated_support,
tis1.support AS item1_support,
tis2.support AS item2_support,
ROUND(tas.support / (tis1.support * tis2.support), 5) as lift
FROM
associated_support AS tas
INNER JOIN
item_support AS tis1
ON tis1.item = tas.item1
INNER JOIN
item_support AS tis2
ON tis2.item = tas.item2
)
SELECT * FROM associated_lift
ORDER BY associated_support DESC
LIMIT 100
For the dashboard I opted to display only the rules with support greater than 5%, confidence greater than 20% and lift greater than 1.
WITH
orders AS
(
SELECT
order_id,
sub_category AS item
FROM
superstore_orders
WHERE
EXTRACT(YEAR FROM order_date) = 2014
),
total_orders AS
(
SELECT
COUNT(DISTINCT order_id) AS value
FROM
orders
),
item_orders AS
(
SELECT
item,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders
GROUP BY
item
),
item_support AS
(
SELECT
item,
order_count,
total_order_count,
ROUND((order_count::numeric / total_order_count), 5) AS support
FROM
item_orders
),
associated_orders AS
(
SELECT
o1.item AS item1,
o2.item AS item2,
COUNT(*) AS order_count,
(SELECT value FROM total_orders) AS total_order_count
FROM
orders AS o1
INNER JOIN
orders AS o2 ON
o1.order_id = o2.order_id AND
o1.item < o2.item
GROUP BY
item1, item2
),
associated_support AS
(
SELECT
item1,
item2,
order_count,
total_order_count,
ROUND((order_count::numeric / total_order_count), 5) AS support
FROM
associated_orders
),
associated_confidence AS
(
SELECT
tas.item1,
tas.item2,
tas.support AS associated_support,
tis1.support AS item1_support,
ROUND(tas.support / tis1.support, 5) as confidence
FROM
associated_support AS tas
INNER JOIN
item_support AS tis1
ON tis1.item = tas.item1
),
associated_lift AS
(
SELECT
tas.item1,
tas.item2,
tas.support AS associated_support,
tis1.support AS item1_support,
tis2.support AS item2_support,
ROUND(tas.support / (tis1.support * tis2.support), 5) as lift
FROM
associated_support AS tas
INNER JOIN
item_support AS tis1
ON tis1.item = tas.item1
INNER JOIN
item_support AS tis2
ON tis2.item = tas.item2
),
summary AS
(
SELECT
tao.item1,
tao.item2,
tao.order_count,
tas.support AS support,
tas.support * 100 AS support_pct,
tac.confidence AS confidence,
tal.lift AS lift
FROM
associated_orders AS tao
INNER JOIN
associated_support AS tas
ON tas.item1 = tao.item1 AND tas.item2 = tao.item2
INNER JOIN
associated_confidence AS tac
ON tac.item1 = tao.item1 AND tac.item2 = tao.item2
INNER JOIN
associated_lift AS tal
ON tal.item1 = tao.item1 AND tal.item2 = tao.item2
)
SELECT
concat(item1, ' + ', item2) AS items, *
FROM
summary
WHERE
support > 0.05 AND
confidence > 0.2 AND
lift > 1
ORDER BY
order_count DESC
LIMIT 100

Good and bad findings
Association analysis generates association rules automatically. It is a type of undirected data mining that finds patterns in the data where the target is not specified beforehand. Therefore it is our task to decide whether the patterns make sense.
Good findings are actionable. They help make decisions, such as creating promotional discounts on certain product bundles to improve cross-selling.
Beware of trivial and inexplicable rules.
Trivial rules reproduce common knowledge about the business.
- The customer who has bought an extended warranty is likely to buy an expensive household appliance.
Inexplicable rules may not have explanations and recommendations for action. They just don’t make sense.
- The bookstore’s most popular product among American Express credit card shoppers is Barack Obama’s Promised Land.
Conclusion
In this article, we have briefly discussed market basket analysis and association rules.
We went through the process of analyzing data at three levels: an item, an order, and a customer who, if identified, connects all the orders together.
Also, we have described how to select the right level of detail in product hierarchies, and how many items to include in the association rule.
We ran an association analysis with SQL and described how to evaluate the rules using such measures as support, confidence, and lift.
And finally, we built a visual dashboard that represents our findings.
Hope this was helpful!