Mining Cost-Efficient Sequential Patterns Using CEPB Algorithm (SPMF documentation)

This example explains how to run the CEPB algorithm using the SPMF open-source data mining library.

How to run this example?

What is CEPB?

CEPB is an algorithm proposed by Fournier-Viger, P., Li, Jiaxuan et al. (2020) for discovering cost-effective patterns in a sequence database (a set of sequences), where each event/activity in a sequence has a cost, and each sequence has an outcome called "utility" that is a binary class (good or bad outcome).

This type of data can be found in many domains. For example, it can model how students use an e-learning system. In that case, each sequence is a list of activities or sessions performed by a student, and the cost represents the time spent for studying for each activity. Then, the utility represents the outcome of a sequence such whether the student passed or failed the course or exam (binary utility). Another example of cost-utility sequences is hospital data, where a sequence is a list of medicines taken by some patient, the cost indicates how much money or time was spent for these medical treaments, and the utility is whether a patient has cured or died.

The CEPB algorithm can discover patterns in such sequences that are said to be cost-efficient. This means that the pattern has a low cost and is likely to bring a high utility. In the above examples, some cost-effective patterns may be that studying some e-learning materials A and B typically require not much time but leadw to high scores, or that taking some given medicines has a low cost but a high possibility of success to cure a disease.

It is to be noted that another algorithm named CorCEPB is offered in SPMF, which can be viewed as an improvement of CEPB that also calculates the correlation between a pattern and a good outcome. Besides, another algorithm named CEPN is provided to handle the case where utility is a numeric value instead of a binary value for each sequence.

What is the input of CEPB?

CEPB takes as input a sequence database where each sequence is an ordered list of events, each event has a cost value (a positive integer), and each sequence has a utility value ( a boolean value indicating for example, a good or bad outcome). Moreover, the user must set two parameters: (1) a minimum support threshold minsup (a positive integer), (2) a maximum cost threshold maxcost (a positive integer), and a (3) minimum occupancy minoccupancy threshold (a value in the [0,1] interval). 

For example, consider the following sequence database, which is provided in the file example_CEPB.txt of the SPMF distribution:

1[4] -1 2[2] -1 5[4] -1 3[4] -1 4[5] -1 -2 SUtility:1
2[3] -1 3[2] -1 6[1] -1 4[1] -1 5[2] -1 -2 SUtility:1
1[2] -1 6[2] -1 5[1] -1 3[3] -1 4[5] -1 -2 SUtility:1
1[2] -1 2[2] -1 3[1] -1 6[2] -1 -2 SUtility:1

This database contains four lines. Each line is a sequence.

Moreover, each sequence (line) is an ordered list of events separated by -1.

An event is represented by a positive integer and it is followed by a cost value (e.g. spent time on the event) indicated between squared brackets [ ].

The end of a sequence is indicated by -2. Finally, at the end of each line, the keyword "SUtility:" is followed by 0 or 1, which respectively represent a negative or positive outcome.

For example, the first line indicates that event "1" had a cost of 4, was followed by event "2" with a cost of 2, followed by event "5" with a cost of 4, followed by event "3" with a cost of 4, and followed by event "4", with a cost of 5. Moreover, this sequence has a utility of 1, which means a positive outcome. The other sequences follow the same format.

This database could for example represents sequences of learning activities made by learners, where the events 1,2,3,4 and 5 are learning activities, cost values are the time spent on a learning activity and the utility is to pass or faill an exam

What is the output of CEPB?

The output of CEPB is the set of all cost-effective patterns meeting the criteria specified by the user.

A pattern is a subsequence of events that appear in some sequences of the input database.

A pattern is said to be cost-effective if it meets the following criteria:

Note: It is important to note that the CEPB algorithm only considers the sequences that have a positive utility (SUtility: 1). There is another algorithm called CorCEPB that considers both sequences with a positive and negative utility.

For example, if we set minsup = 2, maxcost = 50, minoccupancy = 0.1, we find 29 cost-effective patterns.

This is the output file, where each line is a cost-effective pattern:

1 -1 -2 #SUP: 3 #AVGCOST: 2.667 #OCCUP: 0.108
2 -1 -2 #SUP: 3 #AVGCOST: 2.333 #OCCUP: 0.108
5 -1 -2 #SUP: 3 #AVGCOST: 2.333 #OCCUP: 0.100
3 -1 -2 #SUP: 4 #AVGCOST: 2.500 #OCCUP: 0.106
4 -1 -2 #SUP: 3 #AVGCOST: 3.667 #OCCUP: 0.100
6 -1 -2 #SUP: 3 #AVGCOST: 1.667 #OCCUP: 0.108
1 -1 2 -1 -2 #SUP: 2 #AVGCOST: 5.000 #OCCUP: 0.450
1 -1 5 -1 -2 #SUP: 2 #AVGCOST: 5.500 #OCCUP: 0.400
1 -1 3 -1 -2 #SUP: 3 #AVGCOST: 5.333 #OCCUP: 0.433
1 -1 4 -1 -2 #SUP: 2 #AVGCOST: 8.000 #OCCUP: 0.400
1 -1 6 -1 -2 #SUP: 2 #AVGCOST: 4.000 #OCCUP: 0.450
2 -1 5 -1 -2 #SUP: 2 #AVGCOST: 5.500 #OCCUP: 0.400
2 -1 3 -1 -2 #SUP: 3 #AVGCOST: 4.667 #OCCUP: 0.433
2 -1 4 -1 -2 #SUP: 2 #AVGCOST: 5.500 #OCCUP: 0.400
2 -1 6 -1 -2 #SUP: 2 #AVGCOST: 4.000 #OCCUP: 0.450
5 -1 3 -1 -2 #SUP: 2 #AVGCOST: 6.000 #OCCUP: 0.400
5 -1 4 -1 -2 #SUP: 2 #AVGCOST: 7.500 #OCCUP: 0.400
3 -1 4 -1 -2 #SUP: 3 #AVGCOST: 6.667 #OCCUP: 0.400
3 -1 6 -1 -2 #SUP: 2 #AVGCOST: 3.000 #OCCUP: 0.450
6 -1 5 -1 -2 #SUP: 2 #AVGCOST: 3.000 #OCCUP: 0.400
6 -1 4 -1 -2 #SUP: 2 #AVGCOST: 4.500 #OCCUP: 0.400
1 -1 2 -1 3 -1 -2 #SUP: 2 #AVGCOST: 7.500 #OCCUP: 0.675
1 -1 5 -1 3 -1 -2 #SUP: 2 #AVGCOST: 9.000 #OCCUP: 0.600
1 -1 5 -1 4 -1 -2 #SUP: 2 #AVGCOST: 10.500 #OCCUP: 0.600
1 -1 3 -1 4 -1 -2 #SUP: 2 #AVGCOST: 11.500 #OCCUP: 0.600
2 -1 3 -1 4 -1 -2 #SUP: 2 #AVGCOST: 8.500 #OCCUP: 0.600
2 -1 3 -1 6 -1 -2 #SUP: 2 #AVGCOST: 5.500 #OCCUP: 0.675
5 -1 3 -1 4 -1 -2 #SUP: 2 #AVGCOST: 11.000 #OCCUP: 0.600
1 -1 5 -1 3 -1 4 -1 -2 #SUP: 2 #AVGCOST: 14.000 #OCCUP: 0.800

Consider the last line:
1 -1 5 -1 3 -1 4 -1 -2 #SUP: 2 #AVGCOST: 14.000 #OCCUP: 0.800

This line is the pattern where event "1" is followed by event "5", followed by event "3", and finally followed by event "4". On that line, there is #SUP:2, which indicates that the support (occurrence frequency) of this pattern is 2, and its average cost is 14.

The support of that pattern is 2 because this pattern appears in the first and the third sequences of the input database. Below, I have highlighted these two occurrences in blue, and the corresponding cost values in red:

1[4] -1 2[2] -1 5[4] -1 3[4] -1 4[5] -1 -2 SUtility:1
2[3] -1 3[2] -1 6[1] -1 4[1] -1 5[2] -1 -2 SUtility:1
1[2] -1 6[2] -1 5[1] -1 3[3] -1 4[5] -1 -2 SUtility:1
1[2] -1 2[2] -1 3[1] -1 6[2] -1 -2 SUtility:1

In that first sequence, the cost of that pattern is 4 + 4 + 4 + 5 = 17 (the sum of the red values). In the third sequence, the cost of that pattern is 2 + 1 + 3 + 5 = 11. Thus, the average cost of that pattern is : (17 + 11) / 2 = 14.

Besides, "#OCCUP: 0.800" indicates that this pattern has an occupancy of 0.8. The occupancy of a pattern represents how much a pattern covers the sequences in which it appears. The occupancy of the pattern in Sequence 1 is 0.8 because the pattern consists of 4 of the 5 events of that sequence. The occupancy of the pattern in Sequence 3 is also 0.8 because it consists of 4 of the 5 events in that sequence. The occupancy of that pattern for the whole database is the average of its occupancy in Sequence 1 and Sequence 3, that is 0.8.

Since the cost of that pattern is 14 <= maxcost, its support is 2 >= minsup, and its occupancy is 0.8 >= minoccupancy, this pattern is a cost-effective pattern, and it is output.

If the database is sequences of events from an e-learning website, we could interpret this pattern as follows. Two learners have some learning activity 1, followed by some learning activity 5, followed by the activities 3 and 4, and they have spent an average of 14 hours, and then passed the exam..

Note that if one does not wish to use the occupancy measure, it is possible to simply set minoccupancy = 0.

Optional parameter(s)

The CEPB implementation allows to specify an additional optional parameter:

These parameter(s) are available in the GUI of SPMF and also in the example(s) "MainTestCEPB.java" provided in the source code of SPMF.

The parameter(s) can be also used in the command line with the Jar file. If you want to use these optional parameter(s) in the command line, it can be done as follows. Consider this example:
java -jar spmf.jar run CEPB example_CEPB .txt output.txt 2 50 0.1 10
This command means to apply the algorithm on the file "example_CEPB.txt" and output the results to "output.txt". Moreover, it specifies that the user wants to find patterns for minsup = 2, maxcost = 50, minoccupancy = 0.1, and patterns must have a maximum length of 10 items.

Implementation details

This is the original implementation of the algorithm.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "example_CEPB.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=homework
@ITEM=2=lesson
@ITEM=3=anotherlesson
@ITEM=4=groupwork
@ITEM=5=selfstudy
@ITEM=-1=selfstudy
1[4] -1 2[2] -1 5[4] -1 3[4] -1 4[5] -1 -2 SUtility:1
2[3] -1 3[2] -1 6[1] -1 4[1] -1 5[2] -1 -2 SUtility:1
1[2] -1 6[2] -1 5[1] -1 3[3] -1 4[5] -1 -2 SUtility:1
1[2] -1 2[2] -1 3[1] -1 6[2] -1 -2 SUtility:1

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "homework". The third line indicates that the item 2 is called "lesson".... The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file looks like this:

homework selfstudy -2 #SUP: 3 #AVGCOST: 2.667 #OCCUP: 0.108
lesson selfstudy -2 #SUP: 3 #AVGCOST: 2.333 #OCCUP: 0.108
selfstudy selfstudy -2 #SUP: 3 #AVGCOST: 2.333 #OCCUP: 0.100
anotherlesson selfstudy -2 #SUP: 4 #AVGCOST: 2.500 #OCCUP: 0.106
groupwork selfstudy -2 #SUP: 3 #AVGCOST: 3.667 #OCCUP: 0.100
6 selfstudy -2 #SUP: 3 #AVGCOST: 1.667 #OCCUP: 0.108
homework selfstudy lesson selfstudy -2 #SUP: 2 #AVGCOST: 5.000 #OCCUP: 0.450
homework selfstudy selfstudy selfstudy -2 #SUP: 2 #AVGCOST: 5.500 #OCCUP: 0.400
homework selfstudy anotherlesson selfstudy -2 #SUP: 3 #AVGCOST: 5.333 #OCCUP: 0.433
homework selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 8.000 #OCCUP: 0.400
homework selfstudy 6 selfstudy -2 #SUP: 2 #AVGCOST: 4.000 #OCCUP: 0.450
lesson selfstudy selfstudy selfstudy -2 #SUP: 2 #AVGCOST: 5.500 #OCCUP: 0.400
lesson selfstudy anotherlesson selfstudy -2 #SUP: 3 #AVGCOST: 4.667 #OCCUP: 0.433
lesson selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 5.500 #OCCUP: 0.400
lesson selfstudy 6 selfstudy -2 #SUP: 2 #AVGCOST: 4.000 #OCCUP: 0.450
selfstudy selfstudy anotherlesson selfstudy -2 #SUP: 2 #AVGCOST: 6.000 #OCCUP: 0.400
selfstudy selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 7.500 #OCCUP: 0.400
anotherlesson selfstudy groupwork selfstudy -2 #SUP: 3 #AVGCOST: 6.667 #OCCUP: 0.400
anotherlesson selfstudy 6 selfstudy -2 #SUP: 2 #AVGCOST: 3.000 #OCCUP: 0.450
6 selfstudy selfstudy selfstudy -2 #SUP: 2 #AVGCOST: 3.000 #OCCUP: 0.400
6 selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 4.500 #OCCUP: 0.400
homework selfstudy lesson selfstudy anotherlesson selfstudy -2 #SUP: 2 #AVGCOST: 7.500 #OCCUP: 0.675
homework selfstudy selfstudy selfstudy anotherlesson selfstudy -2 #SUP: 2 #AVGCOST: 9.000 #OCCUP: 0.600
homework selfstudy selfstudy selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 10.500 #OCCUP: 0.600
homework selfstudy anotherlesson selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 11.500 #OCCUP: 0.600
lesson selfstudy anotherlesson selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 8.500 #OCCUP: 0.600
lesson selfstudy anotherlesson selfstudy 6 selfstudy -2 #SUP: 2 #AVGCOST: 5.500 #OCCUP: 0.675
selfstudy selfstudy anotherlesson selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 11.000 #OCCUP: 0.600
homework selfstudy selfstudy selfstudy anotherlesson selfstudy groupwork selfstudy -2 #SUP: 2 #AVGCOST: 14.000 #OCCUP: 0.800

Note that what is described above is the normal way of using item names in SPMF. But the CEPB algorithm also allows to directly replace the items like 1 and 2 with strings in the input file like this:

homework[4] -1 selfsttudy [2] -1 -2 SUtility:1

This will achieve a similar result

Where can I get more information about the CEPB algorithm?

The CEPB algorithm is described in this article:

Fournier-Viger, P., Li, J., Lin, J. C., Chi, T. T., Kiran, R. U. (2019). Mining Cost-Effective Patterns in Event Logs. Knowledge-Based Systems (KBS), Elsevier,

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.