Mining Erasable Itemsets from a Product Database with the MEI algorithm (SPMF documentation)

This example explains how to run the MEI algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "MEI" algorithm, (2) select the input file "contextVME.txt", (3) set the output file name (e.g. "output.txt"), (4) set the threshold to 15%, and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run MEI contextVME.txt output.txt 15% in a folder containing spmf.jar and the example input file contextVME.txt.
If you are using the source code version of SPMF, launch the file "MainTestMEI.java" in the package ca.pfv.spmf.algorithms.frequentpatterns.mei.

What is the MEI algorithm?

MEI (Mining Erasable Itemsets) was proposed by Le & Vo (2014) as an efficient algorithm for mining erasable itemsets from a product database with profit information. Like VME, MEI uses a vertical tid-list representation and an Apriori-style level-by-level candidate generation. Its key improvement over VME is that it prunes unpromising candidates earlier and more aggressively, leading to reduced memory usage and faster execution on large databases.

What is the input?

MEI takes as input a product database and a threshold (a value between 0 and 100%). A product is defined as a set of items that are used to assemble the product. Moreover, each product is annotated with a profit (a positive integer) that indicates how much money this product generates for the company. For example, let's consider the following product database, consisting of 6 products and 7 items. Each product is annotated with profit information. For example, the first line indicates that product 1 generates a total profit of 50 $ for the company and that its assembly requires parts 2, 3, 4 and 6. This product database is provided in the file "contextVME.txt" of the SPMF distribution:

	profit	items
product 1	50$	{2, 3, 4, 6}
product 2	20$	{2, 5, 7}
product 3	50$	{1, 2, 3, 5}
product 4	800$	{1, 2, 4}
product 5	30$	{6, 7}
product 6	50$	{3, 4}

What is the output?

The output is the set of erasable itemsets whose loss of profit is lower than or equal to the user-specified threshold. The idea is to discover sets of items that the company could stop purchasing while keeping the resulting profit loss under control.

The loss of profit generated by an itemset is defined as the sum of the product profits for all products containing at least one item from this itemset. For example, the loss of profit of itemset {5, 6} is: 50 $ + 20 $ + 50 $ + 30 $ = 150 $. In this database the total profit is 50 + 20 + 50 + 800 + 30 + 50 = 1000 $, so the loss caused by {5, 6} can be expressed as 15%.

By running MEI with a threshold of 15%, we obtain the same 8 erasable itemsets as VME (having a profit loss less than or equal to 15% × 1000 $ = 150 $):

erasable itemsets	loss of profit
{3}	150.0
{5}	70.0
{6}	80.0
{7}	50.0
{5, 6}	150.0
{5, 7}	100.0
{6, 7}	100.0
{5, 6, 7}	150.0

This means that if the items from one of those erasable itemsets are no longer purchased, the loss of profit will be lower than or equal to 15%.

Input file format

The input file format of MEI is identical to that of VME. It is a text file where each line represents a transaction (product). Each line is composed of two sections:

First, the profit of the transaction is indicated by an integer number, followed by a single space.
Second, the items in the transaction are listed. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction.

For example, the input file for the previous example is defined as follows:

50 2 3 4 6
20 2 5 7
50 1 2 3 5
800 1 2 4
30 6 7
50 3 4

Consider the first line. It means that the transaction {2, 3, 4, 6} has a profit of 50 and contains the items 2, 3, 4 and 6. The following lines follow the same format.

Output file format

The output file format of MEI is identical to that of VME. It is a text file where each line represents an erasable itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer followed by a single space. After all the items, the keyword "#LOSS:" appears, followed by a numeric value indicating the loss of profit for that itemset.

3 #LOSS: 150.0
5 #LOSS: 70.0
6 #LOSS: 80.0
7 #LOSS: 50.0
5 6 #LOSS: 150.0
5 7 #LOSS: 100.0
6 7 #LOSS: 100.0
5 6 7 #LOSS: 150.0

For example, the first line indicates that the itemset {3} would generate a loss of profit of 150.0. The following lines follow the same format.

Performance

MEI improves upon VME by applying tighter pruning during candidate generation. It avoids storing candidate itemsets that cannot possibly be erasable, which reduces both memory usage and runtime. MEI is generally faster than VME, especially on databases with many items and products. For databases where the vertical representation is beneficial, MEI is recommended over VME.

Where can I get more information about the MEI algorithm?

Here is the article describing the MEI algorithm:

T. Le, B. Vo: MEI: an efficient algorithm for mining erasable itemsets. Engineering Applications of Artificial Intelligence, 2014, 27: 155–166.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.