Questions about Data Mining
(from the Pattern Mining Course)

Click on a question to see the answer.

Question 1: What are the main steps that should be carried out to analyze data?

The seven steps for data analysis are typically:

  1. data cleaning,
  2. data integration,
  3. data selection,
  4. data transformation,
  5. discovering patterns (data mining),
  6. evaluating the patterns
  7. visualization
Question 2: Why do we need data mining rather than just analyzing data by hand?

In general, analyzing data by hand is time consuming and costly. It can also lead to errors and it is possible that we may just miss some important patterns in the data. If we analyze data by hand, we might also be biased in our analysis.

Data mining is useful because we can analyze potentially very large datasets using tools that are automatic or semi-automatic. We can also find complex patterns or models that would be hard to find or build by hand.

Question 3: What is a data stream? What is the main challenge to analyze data from a data stream?
A data stream is a high-speed and non-stop stream of data that is potentially infinite. For example, it could be some data that is received from a satellite in real-time at a very fast speed and in large volume. To process a data stream, a key challenge is that we may not be able to store the data on the computer because it is potentially infinite. So we must design algorithms to analyze the data stream in real-time. These algorithms may extract summaries about the data, detect changes or provide updates on the states of the stream.
Question 4: The goal of data mining is to find patterns (or build models) that are interesting. What does it mean that a pattern (or model) is interesting?
There are several meaning to the term "interesting". A pattern or model can be interesting if: (1) it is easy to understand, (2) it is valid for some new data, (3) it is useful (for example, to take decision, understand the data, explain the past, or predict the future) and (4) it reveals something that is novel or unexpected. There are many measures or functions to evaluate a pattern or model that is produced by data mining. Some measures or evaluation functions are subjective (e.g. how interesting a pattern is to a person), while others are objective (how much money can be saved by using a given model produced by data mining).