项目名称:分析和序列预测算法开发及应用 / Title: Algorithms, data structures and visualization for sequence analysis and prediction
起止时间:2013.01.01 到 2015.05.31 / Dates: from 2013 to 2015
金额:100,000 $ 加拿大元 = 大约 500,000 人民币 / Amount : 100,000 $ CAD
项目来源: 加拿大国家科学与工程研究委员会/ Source: National Science and Engineering Research Council of Canada
承担任务/角色:主要研究者 (PHILIPPE FOURNIER-VIGER)

Description (English)

This research project is a 5 years research program funded by the main research agency of the national Government of Canada. The main goal of this project was to develop new algorithms for sequence analysis and prediction that improve the speed, memory usage and prediction accuracy, compared to previous best algorithms and models. In this project, we developed several algorithms to analyze data represented as sequences of events or symbols. Some of the highlights of this project are described below. The contributions of this project are mostly fundamental contributions in the field of data mining \ machine learning.

The first major contribution is the proposal of algorithms called CM-SPAM, CM-SPADE and CM-Clasp for identifying all frequent subsequences that are common to a set of discrete sequences (sequence of symbols or events). We have compared the new algorithms with previous state-of-the-art algorithms and found that the proposed algorithms can be up to two orders of magnitude faster, thus greatly reducing the time required for analyzing large sequence databases [2]. Those algorithms have been included in the SPMF data mining software that I have founded [16] and have then been used in more than 100 research projects for various applications such as analyzing sequences of words in text, sequences of locations visited by cars in a city, and sequences of webpages clicked by users. A more detailed example of application is for market basket analysis. By applying the designed algorithms on customer data, we can find frequent sequence of purchases made by several customers. This information can then be very useful for taking business or marketing decisions or for providing personalized product recommendations to customers.

Another contribution of this project is about sequence prediction. Sequence prediction is the task of predicting the next symbol (or event) from a sequence of symbols. Several sequence prediction models have been proposed in the past. However, most of them are lossy, which means that they do not keep all the training information. Because some information is discarded when building these models, they sometimes have a sub-optimal accuracy for predictions. To address this issue, we proposed a compact tree-based model for sequence prediction named CPT [5], which keeps all the information from the training data in memory and stores it in a tree-structure using an indexing mechanism for fast access. Then, for each prediction, the CPT model always use all the information needed from the training data to perform predictions. We made several experiments, which have shown that the CPT model could outperform traditional sequence prediction models. The CPT model was used in several papers and various applications such as webpage prediction, GPS location prediction, and word prediction.

A third major contribution of this project has been the design of algorithms to identify a special type of patterns called “sequential rules” in a set of sequences. Two algorithms called RuleGrowth and TRuleGrowth [1] were designed. These algorithms find rules of the form X à Y indicating that if a set of events X appeared in any order, they were followed by a another set of events Y with a given probability. Finding this type of rules in data is not an easy problem. Thus, we have designed efficient algorithms for this problem using appropriate data structures and search space pruning strategies. We have shown in some experiments that the novel types of rules proposed in this project provided higher accuracy for sequence prediction applications such as webpage prefetching than other types of rules proposed in previous work [1]. The paper presenting the RuleGrowth and TRuleGrowth algorithms has been cited 95 times and used in numerous projects.

Besides, the above highlights, a dozen of other algorithms have been designed as part of this project for analyzing and predicting sequences.

项目简介
本研究项目是一个为期 5 年的研究计划,它被加拿大国家政府主要研究机构的资金所
资助。项目的主要目标是开发新的序列分析和预测算法,这种算法与以前的最佳算法和模型 相比,它具有更高的速度、内存使用率和预测精度。这个项目中,我们开发了几种的算法来 分析以事件或者符号序列来表示的数据。本项目的一些亮点如下所述。本项目的贡献是主要 对数据挖掘和机器学习领域的基础性贡献。
第一个主要贡献是提出了称为 CM-SPAM, CM-SPADE 和 CM-Clasp 的算法,用于识别 一组离散序列(符号或时间序列)的所有通用频率子序列。我们将新算法与之前最为先进的 算法比较,发现我们提出的算法可以快连个数量级,从而大大减少分析大型序列数据库所需 的时间. 这些算法已经在我创建的 SPMF 数据挖掘软件中包含并已经被用于 100 多个项目 研究中,应用与各种的应用,例如分析文本中的单词序列、汽车在城市中的位置序列以及在 用户点击中的网页序列。超市篮子分析是这个应用程序的一个详尽例子。将所设计的算法应 用于客户的数据,我们可以发现客户们的购买频率序列,这些信息可以非常有用的做出业务 和营销推断,为客户提供个性化的产品推荐。
这个项目的另一个贡献是序列预测。序列预测是从一个符号序列中预测下一个符号(或 事件)的任务。以前曾提出过几种序列预测模型。但是,大多数都是有缺陷的,这意味着它 们不会保存所有的训练信息。因为在建立这些模型时,一些信息会被丢弃,所以它们有时对 预测的准确性不是最佳的。为了解决这个问题,我们提出了一种紧凑的基于树的序列预测模 型,名为 CPT,它将训练数据中的所有信息保存在内存中,并使用索引机制将其存储在树结 构中,以便快速访问。然后,对于每一个预测,CPT 模型总是使用来自训练数据的所有信息 来进行预测。实验表明,CPT 模型优于传统的序列预测模型。CPT 模型在多篇论文和各种应 用中得到了应用,如网页预测、GPS 位置预测和单词预测。
这个项目的第三个主要贡献是设计了识别一组序列中称为“序列规则”的特殊模式类型 的算法。设计名为RuleGrowth 和TRuleGrowth 算法。这些算法找到形式为X à Y 的规则, 这表明,如果一组事件 X 以任意的顺序出现并给定一个概率,那么这些算法就可以得到另 一组事件Y。在数据中找到这种类型的规则并不是一个容易的问题。因此,我们利用适当的 数据结构和搜索空间剪枝策略来设计了有效的算法。我们在一些实验中已经表明,在这个项 目中提出的新类型规则在网页预取等序列预测应用中,比在之前的工作中提出的其他类型规 则提供了更高的准确性。提出RuleGrowth 和TRuleGrowth 算法的论文被引用95 次,并在众 多项目中得到应用。
除此之外,本课题还设计了十余种序列分析和预测算法。