In this post, I will give an alternative to popular techniques in market basket analysis that can help practitioners find high-value patterns instead of just the most frequent ones. We will gain some intuition about different pattern mining problems and see a real-world example. The full code can be found here here.All images are created by the author.
I have written something else Introductory article We've already talked about pattern mining; if you're not familiar with some of the concepts here, feel free to check that one out first.
In short, pattern mining tries to find patterns in data (duuh). Most of the time, this data comes in the form of (multi)sets or sequences. In my last article, for example, I looked at the sequence of actions a user performs on a website. In this case, wanted to Worry about the order of the articles.
In other cases, such as the one we will discuss below, we do not worry about the order of the items. We just list all the items that were in the transaction and the frequency with which they appeared.
For example, transaction 1 contained 3 times and once. As we see, we lose information about the order of the elements, but in many scenarios (like the one we will analyze below), there is no logical order of the elements. This is similar to a bag of words in NLP.
Shopping basket analysis (MBA) is a data analysis technique commonly used in retail and marketing to uncover relationships between products that customers tend to buy together. It aims to identify patterns in customers' shopping baskets or transactions by analyzing their purchasing behavior. The core idea is to understand the co-occurrence of items in purchase transactions, which helps companies optimize their product placement, cross-selling, and targeted marketing campaign strategies.
Frequent item set mining (FIM) is the process of finding frequent patterns in transaction databases. We can look at the frequency of a pattern (i.e., a set of items) by calculating its support. In other words, the support of a pattern x is the number of transactions T that contain x (and are in the database D). That is, we are simply looking at how often the pattern x appears in the database.
In FIM, then we want to find all sequences that have a support greater than a threshold (often called minutes of ascent). If the support of a sequence is greater than minsup, it is considered frequent.
Limitations
In FIM, we only look at the existence of an element in a sequence. That is, it doesn't matter if an element appears twice or 200 times, we just represent it as one. But we often have cases (like MBA) where it's not just the existence of an element in a transaction that's relevant, but also how many times it appeared in the transaction.
Another problem is that frequency does not always imply relevance. In that sense, the FIM method assumes that all elements of the transaction are equally important. However, it is reasonable to assume that someone who buys caviar may be more important to a company than someone who buys bread, since caviar is a potentially high return on investment (ROI) and high profit item.
These limitations lead us directly to High Utility Itemset Mining (HUIM) and Mining high utility quantitative itemsets (HUQIM), which are generalizations of the FIM that attempt to address some of the problems of the normal FIM.
Our first generalization is that items can appear more than once in a transaction (i.e. we have a multiset instead of a simple set). As said before, in normal itemset mining, we transform the transaction into a set and just look at whether the item exists in the transaction or not. So for example, the two transactions below would have the same representation.
t1 = (a,a,a,a,a,b) # repr. as {a,b} in FIM
t2 = (a,b) # repr. as {a,b} in FIM
Above, both transactions would be represented as (a,b) in normal FIM. We quickly see that in some cases we might miss important details. For example, if a and b were some items in a customer's shopping cart, it would be very important if we have to (e.g. a loaf of bread) five times or just once. Therefore, we represent the transaction as a multiset in which we note how many times each item appeared.
# multiset representation
t1_ms = {(a,5),(b,1)}
t2_ms = {(a,1),(b,1)}
This is also efficient if the elements can appear in a large number of items (e.g. 100 or 1000 times). In that case, we don't need to write all the a or b, but just the frequency with which they appear.
The generalization that both quantitative and non-quantitative methods make is to assign each element of the transaction a utility (e.g. profit or time). Next, we have a table that assigns each possible element a unit of profit.
We can calculate the utility of a specific pattern, such as {, }, by summing the utility of those elements across the transactions that contain them. In our example, we would have:
(3 * $1 + 1 * $2) +
(1 * $1 + 2 * $2) = $10
Hence, we get that this pattern has a utility of $10. With FIM, we were tasked with finding frequent patterns. Now, we need to find patterns with high utility. This is mainly because we assume that frequency does not imply importance. In regular FIM, we might have missed rare (infrequent) patterns that provide high utility (e.g. diamond), which is not true with HUIM.
We also need to define the notion of transaction utilityThis is simply the sum of the utility of all the elements of the transaction. For our transaction 3 in the database, this would be
1 * $1 + 2*$10 + 2*$2 = $25
Note that solving this problem and finding all high utility items is more difficult than with the regular FPM method. This is because the utility does not follow the Apriori property.
The Apriori property
Let x and Y be two patterns that occur in a transaction database D. The apriori property says that if x is a subset of Y, then the support of x must be at least as large as that of Y.
This means that if a subset of Y is infrequent, then Y itself must be infrequent since it must have a smaller support. Let's say we have x = {a} and Y = {a,b}. If Y appears 4 times in our database, then x must appear at least 4 times since x is a subset of Y. This makes sense since we are making the pattern less general/more specific by adding an element, meaning it will fit fewer transactions. This property is used in most algorithms since it implies that if {a} is infrequent then all supersets are also infrequent and we can remove them from the search space (3).
This property does not hold when we talk about utility. A superset Y of transaction x might have more or less utility. If we take the example above, {} has a utility of $4. But this does not mean that we cannot analyze supersets of this pattern. For example, the superset we analyzed {, } has a higher utility of $10. At the same time, a superset of a pattern will not always have more utility, since it could be that this superset simply does not appear very often in the DB.
The idea behind HUIM
Since we cannot use the a priori property for HUIM directly, we have to find another upper bound to limit the search space. One such bound is called Transaction-weighted utilization (TWU). To calculate it, we sum the transaction utility of transactions that contain the pattern x of interest. Any superset Y of x cannot have a higher utility than the TWU. Let’s clarify this with an example. The TWU of {,} is $30 ($5 from transaction 1 and $5 from transaction 3). When we look at a superset pattern Y like { } we can see that there is no way for it to have more utility since all transactions that have Y in them also have x in them.
There are currently several algorithms for solving HUIM. All of them take a minimum utility and produce the patterns that have at least that utility as output. In this case, I have used the EFIM algorithm because it is fast and efficient in its use of memory.
For this article, I will be working with the Shopping basket analysis Kaggle dataset (used with permission from the author of the original dataset).
Above, we can see the distribution of transaction values found in the data. There are a total of around 19,500 transactions with an average transaction value of $526 and 26 distinct items per transaction. In total, there are around 4000 unique items. We can also make a ABC analysis where we put the items into different categories based on their percentage of total revenue. We can see that around 500 out of 4000 items account for around 70% of revenue (items A). Then we have a long right tail of items (around 2250) that account for around 5% of revenue (items C).
Preprocessing
The initial data is in a long format, where each row is an item within an invoice. From the invoice number, we can see which transaction the item belongs to.
After preprocessing, we obtain the data in the format required by BETWEEN which is the Python library that we are going to use to apply the EFIM algorithm.
data('item_id') = pd.factorize(data.Itemname)(0).astype(str) # map item names to id
data("Value_Int") = data("Value").astype(int).astype(str)
data = data.loc(data.Value_Int != '0') # exclude items w/o utilitytransaction_db = data.groupby('BillNo').agg(
items=('item_id', lambda x: ' '.join(list(x))),
total_value=('Value', lambda x: int(x.sum())),
values=('Value_Int', lambda x: ' '.join(list(x))),
)
# filter out long transactions, only use subset of transactions
transaction_db = transaction_db.loc(transaction_db.num_items < 10).iloc(:1000)
Then we can apply the EFIM algorithm.
import PAMI.highUtilityPattern.basic.EFIM as efim obj = efim.EFIM('tdb.csv', minUtil=1000, sep=' ')
obj.startMine() #start the mining process
obj.save('out.txt') #store the patterns in file
results = obj.getPatternsAsDataFrame() #Get the patterns discovered into a dataframe
obj.printResults()
The algorithm then returns a list of patterns that meet this minimum utility criterion.