Mining Low-cost High Utility Itemsets using the LCIM algorithm (SPMF documentation)
This example explains how to run the LCIM algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "LCIM" algorithm, (2) select the input file "DB_cost.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum utility to 10, the maximum cost to 10 and minsup to 0.30 (which means 30%) and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run LCIM DB_cost.txt output.txt 10 10 0.30
in a folder containing spmf.jar and the example input file DB_cost.txt. - If you are using the source code version of SPMF, launch the file "MainTestLCIM.java" in the package ca.pfv.SPMF.tests.
What is LCIM?
LCIM (Nawaz et al., 2022) is an algorithm for discovering low-cost high-utility itemsets in a transaction database containing utility and cost information.
So what does it mean? To explain this in a simple way, let's start by explaining what is the utility and cost. The utility means some desirable property of the data, while the cost means some undesirable property. Let's take an example. In the context of e-learning data, the utility may be the grades obtained by students after doing some learning activities. Obtaining higher grades is something desirable. And the cost may be the time spend for studying or the money spent on some learning activities. Thus, in that context, we may want to find ways to obtain high utility (grades) while reducing the cost (the time or money spent to be more efficient at studying). There are many times of data that can be expressed in terms of utility and cost. For example, another example is hospital data. In that context, the utility may be a measure of the benefits from receiving some medical treatments while the cost may be the money spent on those treatments. The LCIM is thus an algorithm to analyze data with utility and cost values.
The type of data that LCIM take as input is called a transaction database. It means a set of records containing items (symbols), where items and cost values are also included. The output of LCIM is some sets of values that often appear together (what is called an itemset) and have in general a low cost and a high utility. Finding these patterns can then help to understand the data. An example of input and output data is explained in more details below.
What is the input?
LCIM takes as input a transaction database with utility and cost information three parameters called the minimum utility threshold min_utility (a positive integer), maxcost (an integer), and minsup (a double value) . Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_cost.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Items | Transaction utility | Item cost values for this transaction | |
t1 | 3 5 1 2 4 6 | 40 | 1 3 5 10 6 5 |
t2 | 3 5 2 4 | 20 | 3 3 8 6 |
t3 | 3 1 4 | 8 | 1 5 2 |
t4 | 3 5 1 7 | 37 | 6 6 10 5 |
t5 | 3 5 2 7 | 21 | 2 3 4 2 |
Each line of the database is:
- a set of items (the second column of the table) representing some nominal values,
- a utility value (an integer) that is assigned to this transaction (the third column of the table),
- a cost value (integer) for each item of this transaction (the fourth column of the table).
Let's explain what this means with a real example. Let's say that this data is from e-learning. Each transaction (record) is a set of items that may represent the courses taken by students. Here the items are called 1,2,3,4,5,6,7, which means different courses. For example, the first transaction t1 indicates that a student took the courses 3, 5, 1, 2, 4 and 6 (without any specific order). Then, the utility of the transaction t1 is 40, which represents the grade that the student obtained after taking an exam after taking the courses. For example, the student described by transaction t1 received a utility (grade) of 40 at the final exam. Then, the last column indicates the cost of each item, which may represent the time spent on each course. For example, the transaction t1 has the cost values 1, 3, 5, 10, 6 and 5, which means that the student spent 1 time unit on course 3, 3 time units on course 5, 5 time units on course 1, 10 time units on course 2, 6 time units on course 4, and 5 time units on course 6. The database contains five transactions (t1, t2, t3, t4, and t5), which may indicate the information for five students.
It is to be noted that here we use the example of e-learning but other types of data could also be represented using that same format such as medical data, or shopping data.
What is the output?
The output of LCIM is the set of low-cost high utility itemsets. A low cost high utility itemset is a set of items that appear in several transactions and that generally has a high utility and a low cost. For example, in the context of e-learning, it can be a set of courses that lead to high grades and generally does not require a lot of time for studying.
To be more precise, a low-cost high utility itemset is an itemset that must satisfy three criteria:
- Its average utility must be no less than the min_utility threshold set by the user.
- Its average cost must be no greater than the max_cost threshold set by the user.
- Its support (occurrence frequency) must be no less than the minsup threshold set by the user.
To explain this more clearly, it is necessary to review some definitions. An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contains the itemset. For example, the support of the itemset {7, 5} is 2 because the items 7 and 5 appears together in two transactions (t4 and t5). The support can be expressed as number like 2 indicating two transactions, or it can be expressed as a percentage. For instance, the support of itemset {7, 5} can be said to be 40% (or 0.40) because that itemset appears in two transactions from the five transactions in the input database (2 / 5 = 0.40).
The utility of itemset is the sum of the utility of transactions where that itemset appears. For example, the itemset {7, 5} appears in transactions t4 and t5, which have a utility of 37 and 21, respectively. Thus the utility of {7, 5} is said to be 37 + 21 = 58. The average utility of itemset is its utility divided by its support. For example, the average utility of itemset {7, 5} is 58 / 2 = 29. It is to be noted that this definition of average utility is not the same as in some other research papers.
The cost of an itemset is the sum of its cost values for the transactions where the itemset appears. For example, the itemset {7, 5} appears in transactions t4 and t5. In the transaction t4, the cost values of items 7 and 5 are 5 and 6, respectively. In the transaction t5, the cost values of items 7 and 5 are 2 and 3, respectively. Thus, the cost of itemset {7, 5} is 5 + 6 + 2 + 3 = 16. The average cost of an itemset is its cost divided by its support. For example, the average cost of itemset {7, 5} is 16 / 2 = 8.
For example, if we run LCIM with a min_utility = 10, max_cost = 10 and minsup = 0.3 , we obtain 14 low-cost high-utility itemsets:
itemsets | average utility | average cost | support |
{7} | 29 | 3.50 | 2 |
{7, 5} | 29 | 8.00 | 2 |
{7, 3} | 29 | 7.50 | 2 |
{1} | 31.66 | 6.66 | 3 |
{1, 4} | 29 | 9.00 | 2 |
{1, 3} | 31.66 | 9.33 | 3 |
{2} | 27 | 7.33 | 3 |
{2, 3} | 27 | 9.33 | 3 |
{4} | 26 | 4.66 | 3 |
{4, 5} | 30 | 9.00 | 2 |
{4, 3} | 26 | 6.33 | 3 |
{5 } | 29.5 | 3.75 | 4 |
{5, 3} | 29.5 | 6.75 | 4 |
{3} | 27.2 | 2.60 | 5 |
If the database is a transaction database about e-learning, we could interpret these results as all the groups of items that represent some efficient ways of learning, that is sets of courses that lead to high grades and do not require a high effort in terms of studying time.
Input file format
The input file format of LCIM is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.
- First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
- Second, the symbol ":" appears and is followed by the transaction utility (an integer).
- Third, the symbol ":" appears and is followed by the cost of each item in this transaction (an integer), separated by single spaces.
For example, for the previous example, the input file is defined as follows:
3 5 1 2 4 6:40:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:18:1 5 2
3 5 1 7:37:6 6 10 5
3 5 2 7:21:2 3 4 2
Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 40 and that items 3, 5, 1, 2, 4 and 6 respectively have a cost of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.
Output file format
The output file format of LCIM is defined as follows. It is a text file, where each line represents a low-cost high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword " #AUTIL: " appears and is followed by the average utility of the itemset. After that, the keyword " #ACOST: " appears and is followed by the average cost of the itemset. Then, the keyword " #SUP: " appears and is followed by the support of the itemset.
For example, we show below the output file for this example.
7 #AUTIL: 29.0 #ACOST: 3.5 #SUP: 2
7 5 #AUTIL: 29.0 #ACOST: 8.0 #SUP: 2
7 3 #AUTIL: 29.0 #ACOST: 7.5 #SUP: 2
1 #AUTIL: 31.666666666666668 #ACOST: 6.666666666666667 #SUP: 3
1 4 #AUTIL: 29.0 #ACOST: 9.0 #SUP: 2
1 3 #AUTIL: 31.666666666666668 #ACOST: 9.333333333333334 #SUP: 3
2 #AUTIL: 27.0 #ACOST: 7.333333333333333 #SUP: 3
2 3 #AUTIL: 27.0 #ACOST: 9.333333333333334 #SUP: 3
4 #AUTIL: 26.0 #ACOST: 4.666666666666667 #SUP: 3
4 5 #AUTIL: 30.0 #ACOST: 9.0 #SUP: 2
4 3 #AUTIL: 26.0 #ACOST: 6.333333333333333 #SUP: 3
5 #AUTIL: 29.5 #ACOST: 3.75 #SUP: 4
5 3 #AUTIL: 29.5 #ACOST: 6.75 #SUP: 4
3 #AUTIL: 27.2 #ACOST: 2.6 #SUP: 5
For instance, the second line indicates that the itemset {7, 5} has an average utility of 29, an average cost of 8 and a support of 2 (40%). The following lines follows the same format.
Performance
The LCIM algorithm is the first algorithm for the problem of low-cost high utility itemset mining. This is the original implementation of the algorithm, containing all the optimizations described in the research paper.
Implementation details
Note that the input format is not exactly the same as described in the original article. But it is equivalent.
Where can I get more information about the algorithm?
The LCIM algorithm was presented in this paper:
Nawaz, M. S., Fournier-Viger, P., Alhusaini, N., He, Y., Wu, Y. and Bhattacharya, D. (2022). LCIM: Mining Low Cost High Utility Itemsets . Proc. of the 15th Multi-disciplinary International Conference on Artificial Intelligence (MIWAI 2022), 12 pages, Springer LNAI, to appear.
While the LCIM algorithm is designed for analyzing transactions, there exists similar algorithms that are designed for analyzing sequences with utility and cost value. For this type of data, the CEPB, CorCEPB and CEPN algorithms are available in SPMF. Those are presented in this paper:
Fournier-Viger, P., Li, J., Lin, J. C., Chi, T. T., Kiran, R. U. (2019). Mining Cost-Effective Patterns in Event Logs. Knowledge-Based Systems (KBS), Elsevier, 191: 105241
Besides, for a general overview of high utility itemset mining, you may read this survey paper.