Mining Nonoverlapping Sequential Patterns in One or Many Sequences Using the NOSEP Algorithm (SPMF documentation)

This example explains how to run the NOSEP algorithm using the SPMF open-source data mining library.

How to run this example?

What is NOSEP?

NOSEP is an algorithm for discovering nonoverlapping sequential patterns in one or multiple sequences of symbols (characters, events, etc.), proposed by Wu et al. (2001).

A particularity of this algorithm is that it can count the multiple occurrences of a pattern in a same sequence. This is different from traditional sequential pattern mining algorithms that only check if a pattern appear in a sequence but do not consider how many times they appear. Another great benefit of NOSEP is that it can be applied to a single sequence or to many sequences.

This is a Java version, converted from the original C++ implementation by Youxi Wu et al. (available at : https://github.com/wuc567/Pattern-Mining)

What is the input of NOSEP?

The input is a sequence of symbols (e.g. events or characters) and five parameters that are set by the user:

A sequence is defined as a list of symbols (characters). For example, the sequence A,A,G,T,A,C,G,A,C,G,C,A,T,C,T,A is a sequence of symbols from the set {A,C,G,T} that is a biological sequence. This sequence indicates that the symbol A is followed by A, followed by G, followed by T, then followed by A and so on.

To run the NOSEP algorithm, an input file must be given, which contains one or more sequences. In the input file format, the symbols are encoded as positive integer numbers. For example, lets say that 1 = A, 3 = C, 7 = G and 20 = T. Then the above sequence is encoded as follow according to the input file format:

1 -1 1 -1 7 -1 20 -1 1 -1 3 -1 7 -1 1 -1 3 -1 7 -1 3 -1 1 -1 20 -1 3 -1 20 -1 1 -1 -2

In that format, the symbol -1 is used as a separator between characters and the symbol -2 indicates the end of the sequence. The above sequence is an encoding of A,A,G,T,A,C,G,A,C,G,C,A,T,C,T,A that can be given to the NOSEP algorithm. This sequence is provided as the file contextNOSEP.txt in the SPMF distribution, and will be used for this example.

What is the output of NOSEP?

The output of NOSEP is all the subsequences that appear at least minsup times in the input sequence and respect the constraints set by the user. For example, in the above example, it can be intuitively observed that the subsequence A,C appears several times in the input sequence. Three occurrences of A,C are illustrated here in bold: A,A,G,T, A,C,G,A,C,G,C,A,T,C,T,A. It can be observed from this example that sometimes the occurrences will have some gaps (this means that some symbols can be skipped). For instance, the third occurrence of A,C skips one symbol T. Because the gaps can be interesting or not to the user, the NOSEP algorithm let the user set some constraints on the minimum and maximum lengths of gaps that are allowed. Moreover, rather than finding all possible occurrences of each pattern, the NOSEP algorithm will only look for the non-overlapping occurrences. NOSEP also allow to set constraints on the minimum and maximum lengths of patterns, to avoid finding patterns that are for example too long or too short.

For example, in the above example, if we set the parameters as:

It means that we are looking for patterns (subsequences) that have at least three non overlapping occurrences. Moreover, we set the constraints that these patterns must have from 1 to 20 symbols, and the gaps between symbols cannot be smaller than 0 and greater than 2. Then the ouput is like this:

Pattern (in SPMF format) Meaning Support (number of occurrences
1 -1 A 6
3 -1 C 4
7 -1 G 3
20 -1 T 3
1 -1 1 -1 A, A 3
1 -1 3 -1
A, C 3
1 -1 7 -1
A, G 3
3 -1 1 -1
C, A 3
3 -1 3 -1
C, C 3
7 -1 1 -1
G, A 3
7 -1 3 -1
G, C 3
1 -1 1 -1 7 -1
A, A, G 3
1 -1 3 -1 1 -1
A, C, A 3
1 -1 7 -1 1 -1
A, G, A 3
1 -1 7 -1 3 -1
A, G, C 3
7 -1 1 -1 3 -1
G, A, C 3
7 -1 3 -1 3 -1
G, C, C 3
1 -1 1 -1 7 -1 1 -1
A, A, G 3
1 -1 1 -1 7 -1 3 -1
A, A, G, C 3
1 -1 7 -1 1 -1 3 -1
A, G, A, C 3
1 -1 7 -1 3 -1 3 -1
A, G, C, C 3
7 -1 1 -1 3 -1 1 -1
G, A, C, A 3
1 -1 1 -1 7 -1 1 -1 3 -1
A, A, G, A, C 3
1 -1 1 -1 7 -1 3 -1 3 -1
A, A, G, C, C 3
1 -1 7 -1 1 -1 3 -1 1 -1
A, G, A, C, A 3
1 -1 1 -1 7 -1 1 -1 3 -1 1 -1 A, A, G, A, C, A 3

For example, consider the last pattern 1 -1 1 -1 7 -1 1 -1 3 -1 1 -1. This patterns represents the pattern A, A, G, A, C, A, that is that the symbol A is followed by A, and then followed by G, then by A, then by C and finally by A. This patterns has a support of 3 because it has three non overlapping occurrence satisfying the constraints in the example sequence.

Now, if you would like to have a more formal definition of the concept of patterns found by NOSEP, it is necessary to review some definitions.

NOSEP is an Apriori-based mining algorithm. It aims at finding patterns with gap constraints. It is based on these definitions (with more details in the research paper)

(Pattern With Gap Constraints and Sequence): Let there be a set of symbols (characters or event types) . A pattern p with gap constraints can be described as p1[ min_1, max_1 ]p2 ··· [ min_m−1, max_m−1 ]pm, where pj , minj and maxj are two nonnegative integers and represent the minimum gap constraint and maximum gap constraint, respectively, 1 ≤ j m.

(Occurrence): A group of m integers L = <l1, l2,... lm> is called an occurrence of p in a sequence s, if and only if 1 ≤l1 < l2 < ··· < lm ≤ n, min_jlj+1 − lj − 1 ≤ maxj, p1 = sl1 ,p2 = sl2 ,. . . , and pm = slm .

(Length Constraints):The length constraints can be written as len = [minlen, maxlen], where minlen and maxlen are the minimum length constraint and the maximum length constraint, respectively. If L = <l1, l2,... lm> satisfies minlenlm l1 + 1 ≤ maxlen, then L is an occurrence with length constraints.

(Non-Overlapping Occurrence Set and Support): Let L = <l1, l2,... lm> and L' = <l'1, l'2,... l'm> be two occurrences. If and only if ∀1 ≤ j m : lj = l'j, L and L' are two nooverlapping occurrences. If any two occurrences in a set are nonoverlapping, then the set is called nonoverlapping occurrence set. The support of p in s under the nonoverlapping condition, which is denoted by sup(p, s), is the size of the maximum nonoverlapping occurrence set.

(Frequent Pattern and NOS):If the support of pattern pin sequence sor in sequence database SDBis no less than the given minimum support threshold minsup, then pattern pis called the frequent pattern. The goal of NOS is to mine all the frequent patterns with the gap and length constraints in sequence sor in sequence database SDB.

(Nettree): Nettree is similar to a tree data structure, consisting of root, leaf, level, parent, child, and so on. Nevertheless, Nettree has three characteristics that are evidently different from the tree structure.
1) A Nettree may have n roots, where n > 1.
2) To describe a node effectively, node i in the jth level is denoted by nij since the same node label can occur on different levels.
3) Any node except the root may have more than one parent and all its parents must be at the same level; that is the nonroot node nij (j > 1) may have multiple parents{ni1j−1, ni2j−1,..., nimj−1} (m ≥ 1), and thus there may be multiple paths from a node to a root node.

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence. An input file can contain one or more sequence.. Each symbol from a sequence is a positive integer and each symbol in a sequence is separated by " -1 ". The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextNOSEP.txt" contains one sequence (one line)

1 -1 1 -1 7 -1 20 -1 1 -1 3 -1 7 -1 1 -1 3 -1 7 -1 3 -1 1 -1 20 -1 3 -1 20 -1 1 -1 -2

This line indicates that the symbol 1 was followed by symbol 1, followed by 7, followed by 20, and so on.

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent sequential pattern. Each symbol from a sequential pattern is a positive integer. The value "-1" is used to separate symbols . On each line, the sequential pattern is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern as a number of occurrences. For example, a few lines from the output file from the previous example are shown below:

1 -1 #SUP: 6
3 -1 #SUP: 4
7 -1 #SUP: 3
20 -1 #SUP: 3
1 -1 1 -1 #SUP: 3
1 -1 3 -1 #SUP: 3
1 -1 7 -1 #SUP: 3
3 -1 1 -1 #SUP: 3
3 -1 3 -1 #SUP: 3
7 -1 1 -1 #SUP: 3
7 -1 3 -1 #SUP: 3
1 -1 1 -1 7 -1 #SUP: 3
1 -1 3 -1 1 -1 #SUP: 3
1 -1 7 -1 1 -1 #SUP: 3
1 -1 7 -1 3 -1 #SUP: 3
7 -1 1 -1 3 -1 #SUP: 3
7 -1 3 -1 3 -1 #SUP: 3
1 -1 1 -1 7 -1 1 -1 #SUP: 3
1 -1 1 -1 7 -1 3 -1 #SUP: 3
1 -1 7 -1 1 -1 3 -1 #SUP: 3
1 -1 7 -1 3 -1 3 -1 #SUP: 3
7 -1 1 -1 3 -1 1 -1 #SUP: 3
1 -1 1 -1 7 -1 1 -1 3 -1 #SUP: 3
1 -1 1 -1 7 -1 3 -1 3 -1 #SUP: 3
1 -1 7 -1 1 -1 3 -1 1 -1 #SUP: 3
1 -1 1 -1 7 -1 1 -1 3 -1 1 -1 #SUP: 3

The fifth line indicates that the frequent sequential pattern consisting of the symbol 1 followed by the symbol 1 has a support of 3 (three occurrences in the input sequence).

Optional feature: giving names to symbols

Some users have requested the feature of given names to symbols instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of symbols in your file. For example, consider the example database "contextNOSEP.txt". Here we have modified the file to give names to the symbols:

@CONVERTED_FROM_TEXT
@symbol=1=A
@symbol=3=C
@symbol=7=G
@symbol=20=T
@symbol=-1=|
1 -1 1 -1 7 -1 20 -1 1 -1 3 -1 7 -1 1 -1 3 -1 7 -1 3 -1 1 -1 20 -1 3 -1 20 -1 1 -1 -2

In this file, the first line indicates, that it is a file where names are given to symbols. Then, the second line indicates that the symbol 1 is called "A". The third line indicates that the symbol 3 is called "C". The fourth line indicates that the symbol 7 represents the letter G, and the fifth line indicates that the symbol 20 represents the letter T. Finally, the 6th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define the input sequence for the example

Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

A | #SUP: 6
C | #SUP: 4
G | #SUP: 3
T | #SUP: 3
A | A | #SUP: 3
A | C | #SUP: 3
A | G | #SUP: 3
C | A | #SUP: 3
C | C | #SUP: 3
G | A | #SUP: 3
G | C | #SUP: 3
A | A | G | #SUP: 3
A | C | A | #SUP: 3
A | G | A | #SUP: 3
A | G | C | #SUP: 3
G | A | C | #SUP: 3
G | C | C | #SUP: 3
A | A | G | A | #SUP: 3
A | A | G | C | #SUP: 3
A | G | A | C | #SUP: 3
A | G | C | C | #SUP: 3
G | A | C | A | #SUP: 3
A | A | G | A | C | #SUP: 3
A | A | G | C | C | #SUP: 3
A | G | A | C | A | #SUP: 3
A | A | G | A | C | A | #SUP: 3

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided about how to do it.

Applying NOSEP on text file

Note that it is also possible to use a text file containing a text (containing one ore more sentences such as a book) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an symbol. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Performance

NOSEP not only meets the Apriori property but also is a complete algorithm. It employs an effective algorithm to completely calculate the support and also adopts an effective pattern growth approach to effectively reduce the candidate patterns. (see paper for more detail on experimental result)

Where can I get more information about NOSEP?

The NOSEP algorithm is described in this article:

Youxi Wu, Yao Tong, Xingquan Zhu, Xindong Wu: NOSEP: Nonoverlapping Sequence Pattern Mining With Gap Constraints. IEEE Trans. Cybern. 48(10): 2809-2822 (2018)