**项目名称**：分析和序列预测算法开发及应用 **/ Title: **Algorithms, data structures and visualization for sequence analysis
and prediction

**起止时间**：2016.03 – 2021.03/ **Dates**: from 2016.03 – 2021.03

**金额**：4,000,000 人民币 / **Amount **: 4,000,000 RMB

**项目来源**： 国家自然科学基金, 哈尔滨工业大学（深圳）/ / **Source: **National Science Foundation of China / Harbin Institute of Technology

**承担任务/角色**：主要研究者 （PHILIPPE FOURNIER-VIGER)

**Description (English) **

This research topic is a continuation of our previous work. The project is ongoing and is a fundamental research project focusing on developing novel techniques for analyzing data. The project was proposed specifically for analyzing sequences and for sequence prediction as in our previous work. However, in the last two years, we have extended the scope of this project to also study other types of data such as graphs, dynamic graphs and customer transaction data.

The first part of this project is about sequence analysis. In previous research, we have developed algorithms to identify various types of interesting patterns in sequences such as sequential patterns and sequential rules. However, the type of data considered by these algorithms is still quite simple. To broaden the applicability of these algorithms, I have recently adapted some of the previous algorithms to consider more complex types of data such as data where each symbol can have weights and values. This is useful for several applications such as to represent customer shopping data containing information about the profit yield by the sale of products. I have proposed several algorithms to analyze sequences and transactions having weights and values to find interesting patterns. Finding patterns by considering weights and values is known as “high-utility pattern mining”. It has several applications such as finding the sequences of purchases that yield a highest profit in a database of customer transactions. On this topic, I have made major contributions by proposing several algorithms for identifying novel types of patterns in data, and also proposed some more efficient algorithms for existing data mining tasks. Moreover, I have also organized three workshops (UDM 2018 at KDD 2018, UDML 2019 at ICDM 2019, UDML 2020 at ICDM 2020) on that topic, as well as edited a book, and organized a special issue in the IEEE Access journal.

The second part of this project is about proposing some novel techniques for the visualization of patterns found in data. We are currently developing visualization tools that provide a relevant and summarized view of a set of patterns found in sequences. The aim is to let the user easily navigate through a potentially very large set of rules to rapidly identify interesting and relevant rules according to various criteria (support, confidence, length, time constraints, window size, etc.). To determine what are appropriate visualizations and operations for the exploration of rules, we are drawing inspiration from works on sequential patterns, itemsets and association rule visualization. The result about this part of the project are expected to be published in the coming year.

In a third part of this project, we are working on providing a summarized view of patterns found in data to the users. For this, we have designed algorithm(s) for efficiently mining a compact and lossless set of patterns such that only a small set of informative patterns are discovered and presented to the user.

In this project, we have also recently considered developing algorithms to analyze other complex data types such as dynamic attributed graphs and sequences with cost information. We have also designed algorithms to discover special types of patterns in data such as periodic patterns (some patterns that regularly appears in the data) and local and peak patterns (patterns that are important during some specific time intervals rather than in the whole database).

**项目简介**

该研究项目是先前工作的一个后续。 该项目致力于开发用于分析数据的新技术，是一

个正在进行的基础性研究。正如前面的工作任务，它是专门为分析序列和进行序列预测而提 出的项目。 但在过去的两年中，我们将项目范围扩大到其他类型的数据，例如图形、动态 图形和客户交易数据等。

该项目的第一部分是有关序列分析的。在先前的研究中，我们已经开发了相应算法来识 别序列中各种类型的模式，例如顺序模式和顺序规则，但这些算法考虑的数据类型仍然非常 简单。为了拓宽这些算法的适用范围，我近期提出了一种改进算法以考虑更复杂的数据类型，

例如每个符号可以具有权重和值的数据。该算法能够应用于一些应用中，例如代表包含有关 产品销售利润率信息的客户购物数据。我提出了几种算法来分析具有权重和值的序列和交易 来查找模式，通过考虑权重和值来查找模式被称为“高效模式挖掘”。它具有多个应用方向， 例如在客户交易数据库中查找产生最高利润的购买顺序。在该项目上，我通过提出几种用于识别数据中新型模式的算法做出了重大贡献，并且为现有的数据挖掘任务提供了一些更有效 的算法。此外，我还组织了关于该主题的三个研讨会（UDM 2018 at KDD 2018，UDML 2019 at ICDM 2019，UDML 2020 at ICDM 2020），并编辑了一本书，还在IEEE Access 期刊中组 织了特刊。

该项目的第二部分是关于提出一些新颖的技术来可视化数据中的模式。我们目前正在开 发可视化工具，这些工具可以提供序列中找到的一组模式的相关摘要视图。目的是让用户轻 松浏览可能非常大的一组规则，以根据各种标准（支持度，置信度，长度，时间限制，窗口 大小等）快速识别有趣且相关的规则。为了确定适合进行规则探索的可视化和操作，我们从 序列模式，项集和关联规则可视化的工作中汲取灵感。有关该项目这一部分的结果预计将在 明年发布。

在该项目的第三部分中，我们正在努力为用户提供数据中发现的模式的摘要视图。为此， 我们设计了一种算法来有效地挖掘紧凑而无损的模式集，从而仅一小部分有用信息的模式被 发现并呈现给用户。

在这个项目中，我们最近还考虑开发算法来分析其他复杂数据类型，例如动态属性图和 带有成本信息的序列。我们还设计了算法，以发现数据中模式的特殊类型，例如周期性模式 某些规律性地出现在数据中的模式）以及局部和峰值模式（在某些特定时间间隔内而不是在 整个数据集上有重要意义的模式）。