# Calculate Statistics for a Sequence Database with Cost and Binary Utility Information(SPMF documentation)

This example explains how to calculate statistics for a sequence database with cost and binary utility information using the SPMF open-source data mining library.

## How to run this example?

• If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_sequence_database_with_cost_binary_utility" algorithm, (2) choose the input file example_CorCEPB.txt (3) click "Run algorithm".
• If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_a_sequence_database_with_cost_binary_utility example_CorCEPB.txt no_output_file in a folder containing spmf.jar and the input file example_CorCEPB.txt.
• If you are using the source code version of SPMF, launch the file "MainTestSequenceDBCostUtilityStats.java" in the package ca.pfv.SPMF.tests.

## What is this tool?

This tool is a tool for generating statistics about a sequence database with cost and binary utility information, as used by algorithms such as CEPB and CorCEPB.

## What is the input?

The input is a sequence database where each sequence is an ordered list of events, each event has a cost value (a positive integer), and each sequence has a utility value ( a boolean value indicating for example, a good or bad outcome). Moreover, the user must set two parameters: (1) a minimum support threshold minsup (a positive integer), (2) a maximum cost threshold maxcost (a positive integer), and a (3) minimum occupancy minoccupancy threshold (a value in the [0,1] interval).

For example, consider the following sequence database, which is provided in the file example_CorCEPB.txt of the SPMF distribution:

1[2] -1 2[4] -1 3[9] -1 4[2] -1 -2 SUtility:1
2[1] -1 4[12] -1 3[10] -1 5[1] -1 -2 SUtility:0
1[5] -1 5[4] -1 2[8] -1 -2 SUtility:1
1[3] -1 2[5] -1 4[1] -1 -2 SUtility:0
2[3] -1 5[4] -1 3[2] -1 -2 SUtility:1

This database contains five lines. Each line is a sequence.

Moreover, each sequence (line) is an ordered list of events separated by -1.

An event is represented by a positive integer and it is followed by a cost value (e.g. spent time on the event) indicated between squared brackets [ ]. A cost is a positive integer.

The end of a sequence is indicated by -2. Finally, at the end of each line, the keyword "SUtility:" is followed by 0 or 1, which respectively represent a negative or positive outcome.

For example, the first line indicates that event "1" had a cost of 2, was followed by event "2" with a cost of 4, followed by event "3" with a cost of 9, followed by event "4" with a cost of 2. Moreover, this sequence has a utility of 1, which means a positive outcome. The other sequences follow the same format.

This database could for example represents sequences of learning activities made by learners, where the events 1,2,3,4 and 5 are learning activities, cost values are the time spent on a learning activity and the utility is to pass or faill an exam.

## What is the output?

The output is statistics about the database. For example, if we use the tool on the previous database given as example, we get the following statistics:

============ SEQUENCE COST UTILITY DATABASE STATS ==========
File size (MB): 0.00
Number of sequences : 5
Max item: 5
Average number of itemsets per sequence : 3.4 standard deviation: 0.4898979485566356 variance: 0.24
Average number of items per itemset : 1.0 standard deviation: 0.0 variance: 0.0
Average cost per item: 4.470588235294118 standard deviation: 3.2560829419054236 variance: 10.602076124567478
Average cost per sequence: 15.2 standard deviation: 5.670978751503131 variance: 32.160000000000004
Average utility per sequence: 0.6 standard deviation: 0.48989794855663565 variance: 0.24000000000000005
=========================================================