Mining Frequent Itemsets using the MPFPS-BFS and MPFPS-DFS Algorithms (SPMF documentation)

This example explains how to run the MPFPS-BFS and MPFPS-DFS algorithms using the SPMF open-source data mining library.

How to run this example?

What are the MPFPS-BFS and MPFPS-DFS algorithms?

MPFPS-BFS and MPFPS-DFS (Fournier-Viger et al., 2019) are algorithms to find all periodic frequent patterns common to multiple sequences in a sequence database. The MPFPS-BFS algorithm applies a breadth-first search, while the MPFPS-DFS algorithms relies on a depth-first search. Periodic frequent patterns have several applications. For example, consider that we have a database of sequences of transactions made by shoppers in a retail store. Finding periodic frequent patterns in such data can reveal products that are regularly bought by several customers (e.g. every week or month) to promote the sale of groups of items.

What is the input of these algorithms?

The input of MPFPS-BFS and MPFPS-DFS algorithms is a sequence database (a set of sequences) and four thresholds, named minSup (an integer), maxStd (a double value), maxPr (an integer) and minRa (a double value), respectively.

A sequence database is a set of sequences where each sequence is a list of itemsets (also called transactions). An itemset is an unordered set of items. For example, the table shown below contains four sequences (S0, S1, S2 and S3). The first sequence, named S0, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, then followed by item 4, and then followed by items 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S0 (1), (1 2 3), (1 3), (4), (3 6)
S1 (1 4), (3), (2 3), (1 5)
S2 (5 6), (1 2), (4 6), (3), (2)
S3 (5), (7), (1 6), (3), (2), (3)

Sequence databases are found in many fields. For example, the first sequence S0 could represents transactions made by a customer, while sequences S1, S2 and S3 could be transactions made by other customers.

What is the output of these algorithms?

MPFPS-BFS and MPFPS-DFS are algorithms for discovering itemsets (sets of items) that appear periodically in many input sequences. An itemset is called periodic frequent in a sequence if it appears in at least minsup transactions from that sequence, the periods (the intervals between two consecutive appearances) of that itemset in a sequence is never larger than a threshold, named maxPr, and the standard deviation of its periods is no larger than a threshold, named maxStd. Moreover, to mine periodic frequent patterns common to multiple sequences, there is a requirement that the mined patterns should be periodic and frequent in at least minRa sequences of the database (minRa is a ratio). For the same input, the MPFPS-BFS and MPFPS-DFS algorithms have the same output.

For example, if MPFPS-BFS and MPFPS-DFS are run on the previous sequence database with minSup = 2, maxPr = 3, maxStd = 1.0 and minRa = 0.4, MPFPS-BFS and MPFPS-DFS find that there are five periodic frequent patterns:

Itemset RA Occurrences
{1} 0.5 sequence S0 Itemsets: [0] [1] [2]
sequence S1 Itemsets: [0] [3]
{2} 0.25 sequence S2 Itemsets: [1] [4]
{3} 0.75

sequence S0 Itemsets: [1] [2] [4]
sequence S1 Itemsets: [1] [2]
sequence S3 Itemsets: [3] [5]

{6} 0.25 sequence S2 Itemsets: [0] [2]
{1,3} 0.25 sequence S0 Itemsets: [1] [2]

These results can be interpreted as follows.

Consider the first pattern {1}, presented in the second line of the above table. This pattern contains the item 1. This pattern is said to have a RA of 0.5 because {1} is a periodic frequent pattern in two sequences out of four sequences (i.e. 2 / 4 = 0.5). The third column of the table indicates that these two sequences are S0 and S1. Moreover, the third column indicates in which itemsets of these sequences the pattern {1} appeared. It appeared in the three first itemsets of sequence S0, denoted as 0, 1, 2. Moreover, the itemset {1} appeared in the first and third itemsets of the sequence S1, denoted as 0 and 3. These occurrences of the pattern {1} are highlighted in bold, below.

ID Sequences
S0 (1), (1 2 3), (1 3), (4), (3 6)
S1 (1 4), (3), (2 3), (1 5)
S2 (5 6), (1 2), (4 6), (3), (2)
S3 (5), (7), (1 6), (3), (2), (3)

Note that the pattern {1} also appears in sequence S2 and S3, but it is not periodic in those sequences. The other patterns can be understood in a similar way.

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a periodic frequent itemset. On a line, the items of the itemset are first listed. Each item is represented by an integer separated by a space. After, that the keyword #RA: appears followed by a double value indicating the RA value of the pattern. Then, the keyword #SIDOCC: appears followed by an integer representing a sequence identifier and the list of itemsets where the pattern appears between squared brackets.

For example, for the above example, the result is the following:

1 #RA:0.5 #SIDOCC: 0[0] [1] [2] 1[0] [3]
2 #RA:0.25 #SIDOCC: 2[1] [4]
3 #RA:0.75 #SIDOCC: 0[1] [2] [4] 1[1] [2] 3[3] [5]
6 #RA:0.25 #SIDOCC: 2[0] [2]
1 3 #RA:0.25 #SIDOCC: 0[1] [2]

Consider the last line. It indicates that the pattern {1,3} is a periodic frequent patterns with a RA of 0.25, and that this pattern is periodic in sequence S0 (the first sequence) and appears in the second and third itemsets of that sequences (denoted as [1] and [2]).

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPrefixSpan.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
@ITEM=6=noodle
@ITEM=7=rice
@ITEM=-1=|
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.

Then, if we apply the pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns having this format:

apple #RA:0.5 #SIDOCC: 0[0] [1] [2] 1[0] [3]
orange #RA:0.25 #SIDOCC: 2[1] [4]
tomato #RA:0.75 #SIDOCC: 0[1] [2] [4] 1[1] [2] 3[3] [5]
noodle #RA:0.25 #SIDOCC: 2[0] [2]
apple tomato #RA:0.25 #SIDOCC: 0[1] [2]

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided.

Performance

This is the original implementation of the MPFPS-DFS and MPFPS-BFS algorithms. MPFPS-BFS and MPFPS-DFS algorithms are designed to mine periodic frequent patterns common to multiple sequences, which is a new research topic. Thus, these two algorithms have been compared in the original for several parameter values with baseline versions designed to find all frequent patterns. The experimental evaluation have shown that they can output the correct result and are efficient. Both algorithms have the same input and output but apply different techniques to find the result. For this reason, their performance will not be the same, and one may be better than the other on different types of data. You may look at the original paper describing these algorithms (below) for a comparison of their performance, and a more detailed explanation of their difference.

Where can I get more information about the algorithm?

This is the paper describing the algorithm:

Fournier-Viger, P., Li, Z., Lin, J. C. W., Kiran, R. U., Fujita, H. (2019). Efficient Algorithms to Identify Periodic Patterns in Multiple Sequences. Information Sciences, Elsevier, 489: 205-226