# Mining Non-Redundant Periodic Frequent Patterns using the NPFPM algorithm (SPMF documentation)

This example explains how to run the NPFPM algorithm using the SPMF open-source data mining library.

## How to run this example?

• If you are using the graphical interface, (1) choose the "NPFPM" algorithm, (2) select the input file "contextPFPM.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum support treshold, periodicity threshold, and the difference threshold, respectively to 0.1, to 2, and to 1.5 and (5) click "Run algorithm".
• If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run NPFPM contextPFPM.txt output.txt 0.1 2 1.5
in a folder containing spmf.jar and the example input file contextPFPM.txt.
• If you are using the source code version of SPMF, to run respectively NPFPM, launch the file "MainTestNPFPM_saveToFile.java"in the package ca.pfv.SPMF.tests.

## What is NPFPM ?

NPFPM is an algorithm for discovering non-redundant periodic frequent itemsets in a sequence of transactions (a transaction database). Non-redundant periodic frequent patterns are those whose periodic occurences cannot be directl inferred from their subsets that are periodic. That is, they are periodic not because their proper subsets are periodic. It was proposed by Afriyie et. al. (2020), Afriyie et. al. (2021). Periodic pattern mining has many applications such as discovering periodic behavior of customers, and finding recurring events.

## What is the input of the NPFPM algorithm?

The input is a transaction database (a sequence of transactions) and two parameters that are set by the user:

• a minimum support treshold ( a positive double value representing a percentage)
• a periodicity threshold per ( a positive integer)
• a difference threshold diff ( a positive integer)

Note that two optional parameters are also offered to specify constraints on the minimum and maximum number of items that patterns should contain (positive integers).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 7 transactions (t1, t2, ..., t7) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1 and 3. This database is provided as the file contextPFPM.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

 Transaction id Items t1 {1, 3} t2 {5} t3 {1, 2, 3, 4, 5} t4 {2, 3, 4, 5} t5 {1, 3, 4} t6 {1,3,5} t7 {2, 3, 5}

## What is the output of the algorithm?

NPFPM is an algorithm for discovering non-redundant periodic frequent patterns whose periodic occurrences cannot be derived from the periodic occurrence of their proper subsets. An itemset is a group of items. The NPFPM algorithm finds itemsets that appears periodically in a sequence of transactions having similar periodicities. To measure if an itemset is periodic, the algorithm calculates its periods. To explain this in more details, it is necessary to introduce a few definitions.

The set of transactions containing an itemset X is denoted as g(X). For example, consider the itemset {1, 3}. The set g({1,3}) is equal to {t1, t3, t5, t6}. In other words, the itemset {1, 3} appears in the transactions t1, t3, t5 and t6. It is also said that these are four occurrences of the itemset {1,3}.

Now, to assess the periodicity of an itemset X, its list of periods is calculated. A period is the time in terms of number of transactions between two occurences of an itemset in the database (see the paper for the formal definition). For example, the periods of the itemset {1, 3} are {1,2,2,1,1}. The first period of {1,3} is 1 because {1,3} appears in the first transaction after the creation of the database. The second period of {1,3} is 2 because the itemset appears in transaction t3, which is two transactions after t1. The third period of {1,3} is 2 because the itemset appears in transactions t5, which is two transactions after t3. Then, the fourth period of {1,3} is 1 because the itemset appears in t6, which is one transaction after t5. Finally, the fifth period of {1,3} is 1 because there is one transaction appearing after the last occurrence of {1,3} in the database (in t6).

The NPFPM algorithm utilize the list of periods of an itemset X to calculate its average periodicity, minimum periodicity, maximum periodicity and standard deviation among the set of periods. The average periodicity is calculated as the average of the periods of the itemset. The minimum periodicity is the smallest period among the periods of the itemset (see the paper for details). The maximum periodicity is the largest period among the periods of the itemset.

The NPFPM algorithm finds all the itemsets that have, (per-diff) less than or equal to (average period - standard deviation) and (per+diff) less than or equal to (average period + standard deviation).

Also note that NPFPM is designed to output only the non redundant itemsets. A periodic itemset is non redundant if it is not a proper subset of any other itemsets having the same support. It is said that non redundant itemsets may be more useful for decision making in some scenarios.

For example, if NPFPM is run on the previous transaction database with a per = 2, diff = 1.5, the NPFPM algorithm finds 8 non-redundant periodic frequent itemsets.

 itemset support (number of transactions where the itemset appear) minimum periodicity maximum periodicity average periodicity {4} 3 1 3 1.75 {4, 5} 2 1 3 2.333 {2, 4} 2 1 3 2.333 {1, 4} 2 2 3 2.333 {1, 4, 5} 1 3 3 3.0 {1, 2} 1 3 3 3.0 {1} 4 1 2 1.4 {1, 5} 2 1 3 2.333

## How should I interpret the results?

Each frequent periodic itemset is annotated with its support (number of transactions where it appears) as well as its minimum/maximum periodicity and average periodicity. For example, the itemset {1} has a support of 4 because it appears in four transactions (t1, t3, t5 and t6. The average periodicity of {1} is 1.4 because on average it appears every 1.4 transactions in terms of time. The smallest period of {1} (minimum periodicity) is 1 and the largest period of {1} is 2 transactions.This indicates that {1} appears quite periodically.

## Input file format

The input file format used by NPFPM is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

3 1
5
3 5 1 2 4
3 5 2 4
3 1 4
3 5 1
3 5 2

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

## Output file format

The output file format is defined as follows. It is a text file, where each line represents a frequent periodic itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. Then, the keyword #MINPER: appears and is followed by a space, an integer indicating the minimum periodicity of the itemset, and another space.Then, the keyword #MAXPER: appears and is followed by a space, an integer indicating the maximal periodicity of the itemset, and another space. Then, the keyword #AVGPER: appears and is followed by a space, a double value indicating the average periodicity of the itemset.

For example, here is the output file for this example. The first line indicates that the itemset {4} is a frequent periodic itemset, having a support of 3 transactions, a minimum periodicity of 1 transactions, a maximum periodicity of 3 transactions and an average periodicity f 1.75 transactions..

4 #SUP: 3 #MAXPER: 3 #MINPER: 1 #AVGPER: 1.75
4 5 #SUP: 2 #MAXPER: 3 #MINPER: 1 #AVGPER: 2.3333
2 4 #SUP: 2 #MAXPER: 3 #MINPER: 1 #AVGPER: 2.3333
1 4 #SUP: 2 #MAXPER: 3 #MINPER: 2 #AVGPER: 2.3333
1 4 5 #SUP: 1 #MAXPER: 3 #MINPER: 3 #AVGPER: 3.0
1 2 #SUP: 1 #MAXPER: 3 #MINPER: 3 #AVGPER: 3.0
1 #SUP: 4 #MAXPER: 2 #MINPER: 1 #AVGPER: 1.4
1 5 #SUP: 2 #MAXPER: 3 #MINPER: 1 #AVGPER: 2.3333

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

## Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPFPM.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
3 1
5
3 5 1 2 4
3 5 2 4
3 1 4
3 5 1
3 5 2

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define eight transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

milk #SUP: 3 #MAXPER: 3 #MINPER: 1 #AVGPER: 1.75
milk, bread #SUP: 2 #MAXPER: 3 #MINPER: 1 #AVPGER: 2.3333
orange, milk #SUP: 2 #MAXPER: 3 #MINPER: 1 #AVPGER: 2.3333
apple, milk #SUP: 2 #MAXPER: 3 #MINPER: 2 #AVPGER: 2.3333
apple, milk, bread #SUP: 1 #MAXPER: 3 #MINPER: 3 #AVPGER: 3.0

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

## Performance

This is the original implementation