Hiding Sensitive Association Rules with the FHSAR algorithm (SPMF documentation)

This example explains how to run the FHSAR algorithm using the SPMF open-source data mining library.

How to run this example?

What is FHSAR?

FHSAR is an algorithm for hiding sensitive association rules in a transaction database.

What are the applications? For example, consider a company that want to release a transaction database to the public. But it does not want to disclose some sensitive associations between items that appear in the database and that could give a competitive advantage to their competitor. The FHSAR algorithm can hide these associations by modifying the database.

What is the input?

The FHSAR algorithm is designed to hide sensitive association rules in a transaction database so that they will not be found for a given minsup and minconf threshold generally used by association rule mining algorithms. The input are: minsup (a value in [0,1] that represents a percentage), minconf (a value in [0,1] that represents a percentage), a transaction database and some sensitive association rules to be hidden.

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1, 2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

An association rule X==>Yis an association between two sets of items X and Y such that X and Y are disjoint. The support of an association rule X==>Yis the number of transactions that contains both X and Y divided by the total number of transactions. The confidence of an association rule X==>Y is the number of transactions that contains both X and Y divided by the number of transactions that contain X. For example, the rule {1 2} ==> {4 5} has a support of 50 % because it appears in 3 transactions out of 5. Furthermore, it has a confidence of 75 % because {1 2} appears in 4 transactions and {1, 2, 4, 5} appears in 3 transactions.

What is the output?

The output is a new transaction database such that the sensitive rules will not be found if an association rule mining algorithm is applied with minsup and minconf.

For example, we can apply FHSAR with the parameters minsup = 0.5 and minconf = 0.60 to hide the following association rules provided in the file "sar.txt":

the result is a new transaction database where these rules are hidden for the given thresholds minsup and minconf:

Transaction id Items
t1 {4, 5}
t2 {3, 5}
t3 {4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

Note that the result of the algorithm is not always the same because I use the HashSet data structure to represent transactions internally and this data structure do not keep the order. Therefore, the items that are removed may not be the same if the algorithm is run twice.

Input file format

This algorithm takes two files as input.

The first file is a text file containing transactions (a transaction database) (e.g. contextIGB.txt). Each line represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:

1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

This file contains six lines (six transactions). Consider the first line. It means that the first transaction is the itemset {1, 2, 4, 5}. The following lines follow the same format.

The second file is a text file containing sensitive association rules to be hidden (e.g. sar.txt). Each line is an association rule. First, the rule antecedent is written. It is an itemset, where each item is represented by a positive integer, and each item is separated from the following item by a single space. Note that it is assumed that items within an itemset cannot appear more than once and are sorted according to a total order. Then the keyword " ==> " appears followed by the rule consequent. The consequent is an itemset where each item is represented by a positive integer, and each item is separated from the following item by a single space. For example, consider the file sar.txt.

4 ==> 1
1 2 ==> 4 5
5 ==> 2

This file contains three lines (three association rules). The second line indicates that the rule {1, 2} ==> {4, 5} should be hidden by the FHSAR algorithm.

Output file format

The output file format is defined as follows. It is a text file representing a transaction database. Each line represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, an output file generated by the FHSAR algorithm is:

4 5
3 5
4 5
1 2 3 5
1 2 3 4 5
2 3 4

In this example, the first line represents the transaction {4, 5}. Other lines follow the same format.

Where can I get more information about the FHSAR algorithm?

This algorithm was proposed in this paper:

C.-C.Weng, S.-T. Chen, H.-C. Lo: A Novel Algorithm for Completely Hiding Sensitive Association Rules. ISDA (3) 2008: 202-208

For a good overview of itemset mining and association rule mining, you may read this survey paper.