View a numeric cost sequence database file with the Cost Sequence Database Viewer (SPMF documentation)
Sequence databases with numeric cost information are a type of data taken as input by several data mining algorithms offered in SPMF such as CEPN.
SPMF offers a tool to view the content of a sequence database with numeric cost information. This tool is called the SPMF Cost Sequence Database Viewer.
This page explains how to use this tool with an example.
How to run this example?
If you want to run this example from the graphical interface of SPMF, (1) choose the algorithm "Open_sequence_database_cost_numeric_utility_file_with_viewer", (2) choose the example_CEPN.txt file as input, and then (3) click "run algorithm" .
- If you want to run this example from the source code of SPMF, run the file MainTestSequenceCostUtilityNumericViewer, which is located in the package ca.pfv.spmf.tests
- If you want to execute this example from the command line interface of SPMF, then execute this command:
java -jar spmf.jar run Open_sequence_database_cost_numeric_utility_file_with_viewer example_CEPN.txt
in a folder containing spmf.jar and the file example_CEPN.txt which is included with SPMF.
What is displayed?
After running the example, the content of the file will be displayed by the tool. The picture below shows the user interface of this viewer.
The window A) show in the picture below is the main window. It displays the cost sequence database using a table. The table has five rows in this example. Each row is a sequence from the sequence database.
Take the first row as example. The cell in the first column of the first row indicates that the ID of this sequence is 0. The cell in the second column indicates that the first itemset of that sequence contains the item 1 with a cost value of 20. The cell in the third column indicates that the second itemset of that sequence contains the item 2 with a cost value of 40. The fourth cell in that row indicates that the third itemset contains the item 4 with a cost value of 20. Lastly, the last column indicates that the utility of this whole sequence is 80.
The other sequences follow the same format.
This view as a table can be useful to understand the content of a cost sequence database file.
Besides, there are three buttons that provides additional features:
- By clicking on the button "View sequence length distribution ", a new window is opened, presented as window B) in the picture below. This window displays the frequency histogram of the different sequence lengths in the current file. The number of sequences is the Y axis and the different sequence lengths are the X axis. There are some buttons in this window to export the data from the frequency histogram as a CSV file so that it can be imported in other software (e.g. Excel), or as a picture. Besides some options are provided to adjust the bar width and the order in which the X axis is sorted.
- By clicking on the button "View item frequency distribution ", a new window is opened, presented as window C) in the picture below. This window displays the frequency of the different items in the current file. The different items are displayed on the X axis and their frequency (support or number of occurrences) are presented on the Y axis. There are some buttons in this window to export the data from the frequency histogram as a CSV file so that it can be imported in other software (e.g. Excel), or as a picture. Besides some options are provided to adjust the bar width and the order in which the X axis is sorted.
- By clicking on the button "View item cost distribution ", a new window is opened, presented as window D) in the picture below. This window displays the cost values of the different items in the current file. The different items are presented on the X axis and their cost values are displayed on the Y axis. There are some buttons in this window to export the data from the frequency histogram as a CSV file so that it can be imported in other software (e.g. Excel), or as a picture. Besides some options are provided to adjust the bar width and the order in which the X axis is sorted.
What is the input?
The algorithm takes as input a cost-event sequence database, as used by algorithm such CEPN.
The database used in this example is provided in the text file "example_CEPN.txt" in the package ca.pfv.spmf.tests of the SPMF distribution, which follows the file format for CEPN.
In that format, a sequence database contains multiple sequences, and each sequence is an ordered list of events, each event has a cost value (a positive integer), and each sequence has a utility value ( a numeric value such that a high value indicates a better outcome or result).
The file example_CEPN.txt contains the following content:
1[20] -1 2[40] -1 3[50] -1 4[20] -1 -2 SUtility:80
2[25] -1 4[12] -1 3[30] -1 5[25] -1 -2 SUtility:60
1[25] -1 5[14] -1 2[30] -1 -2 SUtility:50
1[40] -1 2[16] -1 4[40] -1 -2 SUtility:40
2[20] -1 5[24] -1 3[20] -1 -2 SUtility:70
Each line is a sequence. Moreover, each sequence (line) is an ordered list of events separated by -1.
An event is represented by a positive integer and it is followed by a cost value (e.g. time spent on the event) indicated between squared brackets [ ]. A cost is a positive integer.
The end of a sequence is indicated by -2. Finally, at the end of each line, the keyword "SUtility:" is followed by a positive integer which indicates how good the outcome of this sequence is (e.g. it could represents a final exam score)
For example, the first line indicates that event "1" had a cost of 20, was followed by event "2" with a cost of 40, followed by event "3" with a cost of 50, followed by event "4" with a cost of 20. Moreover, this sequence has a utility of 80, which means a quite good outcome (compared to other sequences in this database). The other sequences follow the same format.
This database could for example represents sequences of learning activities made by learners, where the events 1,2,3,4 and 5 are learning activities, cost values are the time spent on a learning activity and the utility is the grade obtained at a final exam.