Mining Frequent Itemsets from Uncertain Data with the UApriori Algorithm (SPMF documentation)

This example explains how to run the UApriori algorithm using the SPMF open-source data mining library.

How to run this example?

What is UApriori?

UApriori is an algorithm for mining frequent itemsets from a transaction database where the data is uncertain (contains probabilities). The UApriori algorithm was proposed by Chui et al. (2007).

This algorithm can have multiple applications such as in mining medical data or sensor data where observations may be uncertain.

What is the input?

UApriori takes as input a transaction database containing probabilities and a minimum expected support threshold (a value between 0 and 1). A transaction database is a set of transactions where each transaction is a set of items. In UApriori, we assume that each item in a transaction is annotated with an existential probability. For example, let's consider the following transaction database, consisting of 4 transactions (t1,t2,t3,t4) and 5 items (1,2,3,4,5). The transaction t1 contains item 1 with a probability of 0.5, item 2 with a probability of 0.4, item 4 with a probability of 0.3 and item 5 with a probability of 0.7. This database is provided in the file "contextUncertain.txt" of the SPMF distribution:


1 2 3 4 5
t1 0.5 0.4
0.3 0.7
t2
0.5 0.4
0.4
t3 0.6 0.5
0.1 0.5
t4 0.7 0.4 0.3
0.9

What is the output?

The output of UApriori is the set of frequent itemsets. Note that the definition of a frequent itemset is here different from the definition used by the regular Apriori algorithm because we have to consider the existential probabilities.

The expected support of an itemset in a transaction is defined as the product of the existential probability of each item from the itemset in this transaction. It is a value between 0 and 1. For example, the expected support of itemset {1, 2} in transaction t1 is 0.5 x 0.4 = 0.2. The expected support of an itemset in a transaction database is the sum of its expected support in all transactions where it occurs. For example, the expected support of itemset {2, 3} is the sum of its expected support in t2 and t4 : 0.5 x 0.4 + 0.4 x 0.3 = 0.32. A frequent itemset is an itemset that has an expected support higher or equal to the minimum expected support set by the user. For example, by running UApriori with a minimum expected support of 0.10, we obtain 19 frequent itemsets, including:

itemsets expected support
{2 3 5} 0.188
{1 3 5} 0.189
{1 4 5} 0.135
{2 4 5} 0.109
{1 2 5} 0.542
{1 5} 1.28
{1 3} 0.21
{1 4} 0.21
{2 3} 0.32
{1 2} 0.78
... ...

Input file format

The input file format of UApriori is defined as follows. It is a text file. An item is represented by a positive integer. Each item is associated with a probability indicated as a double value between parentheses. A transaction is a line in the text file. In each line (transaction), each item is immediately followed by its probability between parentheses and a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line. Probabilities should be greater than 0 and not more than 1.

For example, for the previous example, the input file is defined as follows:

# This binary context contains uncertain data.
# Each line represents a transaction.
# For each item there is an existential probability.
1(0.5) 2(0.4) 4(0.3) 5(0.7)
2(0.5) 3(0.4) 5(0.4)
1(0.6) 2(0.5) 4(0.1) 5(0.5)
1(0.7) 2(0.4) 3(0.3) 5(0.9)

The first line represents the transaction {1, 2, 4, 5} where items 1, 2, 4 and 5 respectively have the existential probabilities 0.5, 0.4, 0.3 and 0.7.

Output file format

The output file format of UApriori is defined as follows. It is a text file where each line represents a frequent itemset. On each line, the items of the itemset are listed as space-separated integers, followed by the keyword "#SUP:" and a double value indicating the total expected support of the itemset across the database. For example, the output file for this example is shown below.

1 #SUP: 1.8
2 #SUP: 1.7999999999999998
3 #SUP: 0.7
4 #SUP: 0.4
5 #SUP: 2.5
1 2 #SUP: 0.78
1 3 #SUP: 0.21
1 4 #SUP: 0.21
1 5 #SUP: 1.2799999999999998
2 3 #SUP: 0.32
2 4 #SUP: 0.16999999999999998
2 5 #SUP: 1.09
3 5 #SUP: 0.43000000000000005
4 5 #SUP: 0.26
1 2 5 #SUP: 0.542
1 3 5 #SUP: 0.189
1 4 5 #SUP: 0.135
2 3 5 #SUP: 0.188
2 4 5 #SUP: 0.10899999999999999

For example, the last line indicates that the itemset {2, 4, 5} has a total expected support of approximately 0.109.

Optional parameter(s): constraints on the size of itemsets

Sometimes, there may be just too many itemsets, and itemsets containing many items may not be interesting. Thus, it is also possible to specify an optional parameter in the user interface of SPMF:

If you are using the command line interface of SPMF, it is also possible to use this optional parameter by adding it at the end of the command. For example:
java -jar spmf.jar run UApriori contextUncertain.txt output.txt 10% 2
means to run the above example to find only frequent itemsets having 2 items or less.

Performance

UApriori is not the most efficient algorithm for uncertain itemset mining but it is simple and it is the first algorithm designed for this task.

Where can I get more information about the UApriori algorithm?

Here is the article describing the UApriori algorithm:

C. Kit Chui, B. Kao, E. Hung: Mining Frequent Itemsets from Uncertain Data. PAKDD 2007: 47-58

For a good overview of frequent itemset mining algorithms, you may read this survey paper.