Mining the Top-K Stable Periodic Frequent Patterns using the TSPIN algorithm (SPMF documentation)

This example explains how to run the TSPIN algorithm using the SPMF open-source data mining library.

How to run this example?

• If you are using the graphical interface, (1) choose the "TPIN" algorithm, (2) select the input file "contextTSPIN.txt", (3) set the output file name (e.g. "output.txt") (4) set the maximum periodicity, the parameter k, the maximum lability and the boolean parameter indicating that timestamps are provided, respectively to 5, 3, 1, false and (5) click "Run algorithm".
• If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run TSPIN contextTSPIN.txt output.txt 5 3 1 false in a folder containing spmf.jar and the example input file contextTSPIN.txt.
• If you are using the source code version of SPMF, to run respectively TSPIN, launch the file "MainTestTSPIN.java"in the package ca.pfv.SPMF.tests.

What is TSPIN ?

TSPIN is an algorithm for discovering stable periodic frequent itemsets in a sequence of transactions (a transaction database) with or without timestamps. It was proposed by Fournier-Viger, P., Wang, Y., P. Yang, et al. (2020). TSPIN can discover patterns that have a stable periodic behavior in a sequence of transactions. Periodic pattern mining has many applications such as discovering periodic behavior of customers, and finding recurring events.

The first algorithm for discovering stable periodic patterns is SPP-Growth (2019). TSPIN is a variation of that algorithm that discovers the top-k most frequent stable periodic frequent itemsets. The advantage of TSPIN is that the user can directly set k, the number of patterns to be discovered.

What is the input of the TSPIN algorithm?

The input is a transaction database (a sequence of transactions) and four parameters that are set by the user:

• a maximum periodicity threshold maxper ( a positive integer)
• a parameter k( a positive integer) representing the number of patterns to be found
• a maximum lability threshold maxla ( a positive integer)
• a boolean parameter hasNoTimestamps, that must be set to true if the input dataset has no timestamps or otherwise false.

A transaction database is a set of transactions. Each transaction is a set of items (symbols) that has a timestamp. For example, consider the following transaction database. It contains 8 transactions (t1, t2, ..., t8) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 2, 3 and 5. This database is provided as the file contextTSPIN.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted according to some total order in each transaction such as the alphabetical order.

 Transaction id Items Timestamp t1 {2, 3, 5} 1 t2 {2, 3, 4, 5} 3 t3 {2, 3, 4, 5} 3 t4 {1, 2, 3, 4, 5} 5 t5 {1, 3, 5} 6 t6 {2,3,5} 7 t7 {1, 3, 4} 9 t8 {1, 3, 5} 10

What is the output of the algorithm?

TSPIN is an algorithm for discovering the top-k stable periodic frequent patterns, which are also called top-k stable periodic frequent itemsets. An itemset is a group of items. The TSPIN algorithm finds the k itemsets that appear periodically in a sequence of transactions and have the highest support. To measure if an itemset is periodic, the algorithm calculates its periods. To explain this in more details, it is necessary to introduce a few definitions.

The set of transactions containing an itemset X is denoted as g(X). For example, consider the itemset {2, 3}. The set g({2,3}) is equal to {t1,t2, t3, t4, t6}. In other words, the itemset {2, 3} appears in the transactions t1 t2,t3, t4 and t6. It is also said that these are five occurrences of the itemset {2,3}.

Now, to assess the periodicity of an itemset X, its list of periods is calculated. A period is the time in terms of number of transactions between two occurrences of an itemset in the database (see the paper for the formal definition). For example, the periods of the itemset {2, 3} are {1,1,1,1,2,2}. The first period of {2,3} is 1 because {2,3} appears in the first transaction after the creation of the database. The second period of {2,3} is 1 because the itemset appears in transaction t2, which is one transactions after t1. The third period of {2,3} is 1 because the itemset appears in transactions t3, which is one transactions after t3. Then, the fourth period of {2,3} is 1 because the itemset appears in t4, which is one transaction after t3. Then, the fifth period of {2,3} is 2 because the itemset appears in t6, which is one transaction after t4. Finally, the fifth period of {2,3} is 2 because there is two transactions appearing after the last occurrence of {2,3} in the database (in t6).

The TSPIN algorithms utilize the list of periods of an itemset X to calculate its maximum periodicity and a measure called lability that measures by how far the periods deviates from the maxper threshold set by the user . The maximum periodicity of an itemset is the largest period among the periods of the itemset. The lability of an itemset X is the cummulative sum of the difference between each period length and maxper (see the paper for a formal definition and some detailed examples).

The TSPIN algorithm finds the k itemsets that have the highest support and have a maximum periodicity that is no greater than the maxper threshold, and a maximum lability that is no greater than the maxla threshold.

For example, if TSPIN is run on the previous transaction database with a maxper = 3 , maxla =1, and k = 3, the TSPIN algorithm finds 3 stable periodic frequent itemsets.

 itemset support (number of transactions where the itemset appear) maximum lability {3} 7 0 {3, 5} 7 1 {5} 7 1

How should I interpret the results?

Each stable periodic frequent itemset is annotated with its support (number of transactions where it appears) as well as its maximum lability. For example, the itemset {3, 5} has a support of 7 because it appears in 7 transactions (t1, t2, t3, t4, t5, t6 and t8. The lability of {3,5} is 1 which means that the periodic behavior of this pattern is quite stable through the database (a smaller lability value means a more stable periodic behavior).

Input file format

The input file format used by TSPIN is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line. Items are followed by the character "|" and then the timestamp of the transaction. Note that it is possible to run TSPIN on a database that has no timestamps. In that case, the Boolean parameter of TSPIN (the fourth parameter) should be set to true to indicate that the dataset has no timestamps.

In the previous example, the input file is defined as follows:

2 3 5|1
2 3 4 5|3
2 3 4 5|3
1 2 3 4 5|5
1 3 5|6
2 3 5|7
1 3 4|9
1 3 5|10

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a stable periodic frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. Then, the keyword #MAXLA: appears and is followed by a space, an integer indicating the maximum lability of the itemset.

For example, here is the output file for this example. The first line indicates that the itemset {2} is a frequent periodic itemset, having a support of 3 transactions, and a maximum lability of 1 transaction.

5 #SUP: 6 #MAXLA: 0
3 #SUP: 7 #MAXLA: 0
3 5 #SUP : 7 #MAXLA: 0

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of giving names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextTSPIN.txt". Here we have modified the file to give a name to each item:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
2 3 5|1
2 3 4 5|3
2 3 4 5|3
1 2 3 4 5|5
1 3 5|6
2 3 5|7
1 3 4|9
1 3 5|10

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define eight transactions in the SPMF format.

Then, if we apply an algorithm on this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

tomato bread #SUP: 7 #MAXLA: 1
tomato #SUP: 7 #MAXLA: 0

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Performance

TSPIN is currently the only algorithm designed to mine the top-k stable periodic frequent itemset in a sequence of transaction (a transaction database) with or without timestamps, offered in SPMF.

TSPIN is a variation of the SPP-Growth algorithm. The difference is that TSPIN let the user decide "k", the number of patterns to be found, while SPP-Growth requires that the user set a minimum support threshold instead. The performance of the two algorithm is similar but mining the top-k patterns is a more difficult problem.

This is the article proposing the TSPIN algorithm:

Fournier-Viger, P., Wang Y., Yang, P., Lin, J. C.-W., Yun, U. (2021). TSPIN: Mining Top-k Stable Periodic Patterns. Applied Intelligence, to appear.

Related to this, you can also view a PPT presentation of the SPP-Growth algorithm (which is similar to TSPIN) or the same presentation as a video.