Mining Rare Correlated Periodic Patterns Common to Multiple Sequence using the MRCPPS algorithm (SPMF documentation)

This example explains how to run the MRCPPS algorithm using the SPMF open-source data mining library.

How to run this example?

What is the MRCPPS algorithm?

MRCPPS (Fournier-Viger, P. Yang, P. et al., 2020) is an algorithm to find all rare correlated periodic frequent patterns common to multiple sequences (a sequence database). Periodic frequent patterns have several applications. For example, consider that we have a database of sequences of transactions made by shoppers in a retail store. Finding periodic frequent patterns in such data can reveal products that are regularly bought by several customers (e.g. every week or month) to promote the sale of groups of items. Rather than finding frequent periodic patterns as done by other algorithms like MPFPS, this algorithms looks for the rare patterns that have a high strong correlation..

What is the input of these algorithms?

The input is a sequence database and four parameters that are set by the user:

A sequence database is a set of sequences where each sequence is a list of itemsets (also called transactions). An itemset is an unordered set of items. For example, the table shown below contains four sequences (S0, S1, S2 and S3). The first sequence, named S0, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, then followed by item 4, and then followed by items 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S0 (1), (1 2 3), (1 3), (4), (3 6)
S1 (1 4), (3), (2 3), (1 5)
S2 (5 6), (1 2), (4 6), (3), (2)
S3 (5), (7), (1 6), (3), (2), (3)

Sequence databases are found in many fields. For example, the first sequence S0 could represents transactions made by a customer, while sequences S1, S2 and S3 could be transactions made by other customers.

What is the output of these algorithms?

MRCPPS is an algorithm for discovering rare correlated periodic patterns (RCPPs) common to multiple sequences. The algorithm find itemsets (sets of items) that appear rarely, contains correlated items, and occur periodically in a sequence database.

An itemset is called rare periodic frequent pattern in a sequence if it appears in no more than maxsup  transactions from that sequence, the periods (the intervals between two consecutive appearances) of that itemset in a sequence is never larger than a threshold, named maxPr, and the standard deviation of its periods is no larger than a threshold, named maxStd. Moreover, to mine periodic frequent patterns common to multiple sequences, there is a requirement that the mined patterns should be periodic and frequent in at least minRa sequences of the database (minRa is a ratio), and that the correlation as measured by the bond measure is no less than a parameter minBond. (see the research paper for more details and a formal definition)

A rare pattern is an itemset having a support no larger than maxSup parameter provided by the user.
A correlated pattern is an itemset having a bond no less than minBond parameter provided by the user (the bond measure is: support of itemsets divided by its disjunctive support)
A periodic pattern is an itemset having a standard deviation of periods no larger than minStd parameter provided by the user.
A rare correlated periodic pattern common to multiple sequence is an itemset having rare, correlated and periodic behavior at many sequence (satisfy user-defined ratio, in other word, its sequence periodic ratio no less than minRa provided by the user )

For example, if MRCPPS is run on the previous sequence database with maxSup = 2, maxStd =1, minBond = 0.5 and minRa = 0.5, MRCPPS finds five RCPPs:

Itemset

RA Occurrences

{1}

0.5

sequence S2 Itemsets: [1]
sequence S3 Itemsets: [2]

{2}

0.5

sequence S0 Itemsets:[1]
sequence S1 Itemsets: [2]

{4}

0.5

sequence S1 Itemsets: [0]
sequence S2 Itemsets: [2]

{6}

0.5 sequence S2 Itemsets: [0] [2]
sequence S3 Itemsets: [2]

{2, 3}

0.5 sequence S0 Itemsets: [1]
sequence S1 Itemsets: [2]

These results can be interpreted as follows.

Consider the first pattern {1}, presented in the third line of the above table. This pattern contains the item 1. This pattern is said to have a RA of 0.5 because {1} is a periodic rare pattern in two sequences out of four sequences (i.e. 2 / 4 = 0.5). The third column of the table indicates that this pattern is periodic rare in sequences S2 and S3. Moreover, the third column indicates in which itemsets of these sequences the pattern {1} appeared for these calculations. It appeared in thethe second itemset of sequence S2, denoted as 1. Moreover, the itemset {1} appeared in the second itemset of the sequence S3, denoted as 2. The occurrences of the pattern {1} used for these calculations are highlighted in bold, below.

ID Sequences
S0 (1), (1 2 3), (1 3), (4), (3 6)
S1 (1 4), (3), (2 3), (1 5)
S2 (5 6), (1 2), (4 6), (3), (2)
S3 (5), (7), (1 6), (3), (2), (3)

The other patterns can be understood in a similar way.

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a RCPP. On a line, the items of the itemset are first listed. Each item is represented by an integer separated by a space. After, that the keyword #RA: appears followed by a double value indicating the RA value of the pattern. Then, the keyword #SIDOCC: appears followed by an integer representing a sequence identifier and the list of itemsets where the pattern appears between squared brackets.

For example, for the above example, the result is the following:
1 #RA: 0.5 #SIDOCC: 2[1] 3[2]
2 #RA: 0.5 #SIDOCC: 0[1] 1[2]
4 #RA: 0.5 #SIDOCC: 1[0] 2[2]
6 #RA: 0.5 #SIDOCC: 2[0] [2] 3[2]
2 3 #RA: 0.5 #SIDOCC: 0[1] 1[2]

Consider the last line. It indicates that the pattern {2,3} is a rare correlated periodic copatterns with a RA of 0.5, and that this pattern is periodic in sequence S0 (the first sequence) and appears in the second itemset of that sequence (denoted as [1]). Moreover, this pattern is periodic in sequence S1 (the second sequence) and appears in the third itemset of that sequence (denoted as [2]).

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPrefixSpan.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
@ITEM=6=noodle
@ITEM=7=rice
@ITEM=-1=|
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.

Then, if we apply the pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns having this format:

apple #RA:0.5 #SIDOCC: 2[1] 3[2]

...

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided.

Performance

This is the original implementation of the MRCPPS algorithm.

Where can I get more information about the algorithm?

This is the paper describing MRCPPS.

Fournier-Viger, P., Yang, P., Li, Z., Lin, J. C.-W., Kiran. R. U. (2019). Discovering Rare Correlated Periodic Patterns in Multiple Sequences. Data & Knowledge Engineering (DKE), to appear, DOI: 10.1016/j.datak.2019.101733.