This example explains how to run the **TKG algorithm** using the SPMF open-source data mining library.

How to run this example?

**If you are using the graphical interface,**(1) choose the**"TKG"**algorithm**,**(2) select the input file**"contextTKG.txt"**, (3) set the output file name (e.g. "output.txt") (4) set minsup to 90% and (5) click "Run algorithm".**If you want to execute this example from the command line**, then execute this command:

java -jar spmf.jar run TKG**contextTKG**.txt output.txt 3 in a folder containing spmf.jar and the example input file contextTKG.txt.**If you are using the source code version of SPMF,**launch the file**"MainTestTKG.java"**in the package**ca.pfv.SPMF.tests**.

What is this algorithm?

TKGis an algorithm for discovering the top-k most frequent subgraphs in a graph database. It was proposed by Fournier-Viger et al. (2019).This algorithm was proposed as an alternative to traditional frequent subgraph mining algorithms such as gSpan that requires to set a minimum support threshold. It is argued that setting the parameter

kis more intuitive since the user can directly select the number of patterns to be found.

What is the input of the algorithm?

The input is a

set oflabeledconnectedgraphsand a numberkof subgraph to be found (a positive integer). Moreover, a few optional parameters can be set, which will be described further down this page.To explain the input more clearly, some definitions are first introduced. A

labeled graphis a set ofverticesandedges, having somelabels. Let’s me illustrate this idea with an example. Consider the following graph:This graph contains

fourvertices(depicted as yellow circles). These vertices have labels such as “10” and “11”. These labels provide information about the vertices. For example, imagine that this graph is a chemical molecule. The label 10 and 11 could represent the chemical elements of Hydrogen and Oxygen, respectively. Labels do not need to be unique. In other words, the same labels may be used to describe several vertices in the same graph. For example, if the above graph represents a chemical molecule, the labels “10” and “11” could be used for all vertices representing Oxygen and Hydrogen, respectively.Now, besides vertices, a graph also contains

edges. The edges are the lines between the vertices, here represented by thick black lines. Edges also have some labels. In this example, four labels are used, which are 20, 21, 22 and 23. These labels represents different types of relationships between vertices. Edge labels do not need to be unique. A graph is aconnected graphif by following the edges, it is possible to go from any vertex to any other vertices.The goal of

frequent subgraph miningis to discoverinteresting subgraph(s) appearing in a set of graphs (a graph database). But how can we judge if a subgraph is interesting? This depends on the application. The interestingness can be defined in various ways. Traditionally, a subgraph has been considered as interesting if it appears multiple times in a set of graphs. In other words, we want to discover subgraphs that are common to multiple graphs. This can be useful for example to find association between chemical elements common to several chemical molecules.The algorithm takes a

graph database as inputand a parameterk.Then, the TKG subgraph mining algorithm enumerate as output thetop-k most frequent subgraphs. For example, let’s consider the following graph database containing three graphs, which is provided in the file contextTKG.txt of the SPMF distribution.This database contains three graphs, respectively called Graph 1, 2 and 3.

What is the output of the algorithm?

The output is the set of all subgraphs that appear in at least

minsuppercent of the graphs of the input graph database, and their support values (number of input graphs containing them). In this example, consider that thekparameter is set to 3 (which means that we want to discover the top-3 most frequent subgraphs). By applying theTKGalgorithm, we obtain the three following subgraphs:Consider the third subgraph (“Frequent subgraph 3”). This subgraph is frequent and is said to have a

support(a frequency) of 3 since it appears in three of the input graphs. These occurrences are highlighted in red, below:In this example "Frequent subgraph 1" and "Frequent subgraph 2" also have a support of 3. Note that the support of a graph can be also expressed as a percentage. Since these subgraphs appear in three input graphs and the input database contains three graphs, the support of these subgraphs expressed as a percentage is 3 / 3 = 1 (which means 100 %).

Note: The TKG can return less thanksubgraphs if there does not existkdifferent subgraphs in the database. Moreover, it is possible that TKG returns more thank subgraphsif many of them have exactly the same support. For example, ifk = 1,the three above subgraphs may still be returned by TKG because they have the same support.

Optional parameters

This implementation of TKG also has four optional parameters:

**maxNumberOfEdges**: the maximum number of edges that frequent subgraphs should contain. This can be used to reduce the search space and find less frequent subgraphs.**outputSingleFrequentVertices**: if this parameter is set to true (by default), then frequent subgraphs containing a single vertex will also be output. Otherwise, not.**outputDotFile:**if this parameter is set to true (by default : false), then a DOT file will be also created as output. A DOT is a GraphViz file that can be used to visualize the frequent subgraphs using GraphViz**outputGraphIds**: if this parameter is set to true (by default: false), then the output file will indicate for each frequent subgraph, the list of input graphs where the subgraph appears.

The optional parameter can be specified in the graphical user interface of SPMF, but also when running SPMF from the command line. For example, to call TKG from the command line with **k = 3, maxNumberOfEdges** = 2, **outputSingleFrequentVertices = true, outputDotFile = false and outputGraphIds = true**, the command can be written as follows:

java -jar spmf.jar run TKG

contextTKG.txt output.txt 3 2 true false true

Input file format

The

input file formatis defined as follows. It is a text file which contains one or more graphs. A graph is defined by a few lines of text that follow the following format:

t # NThis is the first line of a graph. It indicates that this is the N-th graph in the filev M LThis line defines the M-th vertex of the current graph, which has a label Le P Q LThis line defines an edge, which connects theP-thvertex with the Q-th vertex. This edge has the label LFor the above example, the input file is defined as follows:

t # 0

v 0 10

v 1 11

v 2 10

v 3 11

e 0 1 20

e 1 2 23

e 1 3 22t # 1

v 4 10

v 5 11

e 4 5 20t # 2

v 6 10

v 7 10

v 8 11

v 9 11

e 6 7 21

e 7 8 23

e 7 9 20

e 8 9 22

Output file format

The

output file formatis defined as follows. It is a text file, listing all the frequent subgraphs found in the input graph database. A frequent subgraph is defined by a few lines of text that follow the following format:

t # N * ZThis is the first line of a subgraph. It indicates that this is the N-th subgraph in the file and that its support is Z.v M LThis line defines the M-th vertex of the current subgraph, which has a label Le P Q LThis line defines an edge, which connects theP-thvertex with the Q-th vertex. This edge has the label L- x X1 X2 ... This lines lists the identifiers of all the graphs X1, X2 ... that contains the current subgraph.
t # 0 * 3

v 0 10

x 0 1 2t # 1 * 3

v 0 11

x 0 1 2t # 2 * 3

v 0 10

v 1 11

e 0 1 20

x 0 1 2

Performance

This is the original implementation of TKG. It was shown in the paper presenting TKG that it is an efficient algorithm.

Where can I get more information about the algorithm?

This is the article describing the algorithm:

Fournier-Viger, P., Cheng, C., Lin, J. C.-W., Yun, U., Iran, U. (2019).

TKG: Efficient Mining of Top-K Frequent Subgraphs. Proc. of 7th Intern. Conf. on Big Data Analytics (BDA 2019), Spr inger, 20 pages, to appear.

<< Return to table of contents of SPMF documentation

Copyright © 2008-2020 Philippe Fournier-Viger. All rights reserved.