Sei sulla pagina 1di 21

Data Mining:

Concepts and Techniques


Jiawei Han, Micheline Kamber,

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data?

Classifications of Data mining systems


Major Challenges in Data Mining

Data Mining: Concepts and Techniques

April 12, 2012

Why Data Mining?

The Explosive Growth of Data: from terabytes to jetabytes

Data collection and data availability

Automated data collection tools, database systems, Web, computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks,


Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!


Necessity is the mother of inventionData miningAutomated analysis of massive data sets
3 Data Mining: Concepts and Techniques April 12, 2012

Evolution of Sciences

Before 1600, empirical science 1600-1950s, theoretical science

Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)

1950s-1990s, computational science

Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!

1990-now, data science


Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. Mining: Concepts and Techniques April 12, 2012 4 Data 2002

Evolution of Database Technology

1960s:

Data collection, database creation, and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

1970s:

1980s:

Application-oriented DBMS (spatial, scientific, engineering, etc.)


Data mining, data warehousing, multimedia databases, and Web databases Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
Data Mining: Concepts and Techniques April 12, 2012

1990s:

2000s
5

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data? Time and Ordering: Sequential Pattern, Trend and Evolution Analysis Structure and Network Analysis Evaluation of Knowledge Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
6 Data Mining: Concepts and Techniques April 12, 2012

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Is everything data mining?



7

Simple search and query processing (Deductive) expert systems


Data Mining: Concepts and Techniques April 12, 2012

Knowledge Discovery (KDD) Process

This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant Data Data Warehouse Selection& transformation

Data Cleaning
Data Integration Databases

Data Mining: Concepts and Techniques

April 12, 2012

Data mining as a step in KDD process may generally involves:


Data cleaning Data integration

Data selection
Data transformation Data mining

Pattern evaluation
Knowledge presentation

Data Mining in Business Intelligence


Increasing potential to support business decisions

Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery

End User

Business Analyst Data Analyst

Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Concepts and Techniques April 12, 2012

DBA

10

KDD Process

Input Data

Data PreProcessing

Data Mining

PostProcessing

Data integration Normalization Feature selection Dimension reduction

Pattern discovery Association & correlation Classification Clustering Outlier analysis

Pattern Pattern Pattern Pattern

evaluation selection interpretation visualization

This is a view from typical machine learning and statistics communities


11 Data Mining: Concepts and Techniques April 12, 2012

Example: Medical Data Mining

Health care & medical data mining often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation

12

Data Mining: Concepts and Techniques

April 12, 2012

Data Mining: Confluence of Multiple Disciplines


Machine Learning Pattern Recognition Statistics

Applications

Data Mining

Visualization

Algorithm

Database Technology

High-Performance Computing

13

Data Mining: Concepts and Techniques

April 12, 2012

Chapter 1. Introduction

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data? Time and Ordering: Sequential Pattern, Trend and Evolution Analysis Structure and Network Analysis Evaluation of Knowledge Applications of Data Mining Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary
14 Data Mining: Concepts and Techniques April 12, 2012

Multi-Dimensional View of Data Mining

Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
15 Data Mining: Concepts and Techniques April 12, 2012

Why Confluence of Multiple Disciplines?

Tremendous amount of data

Algorithms must be highly scalable to handle such as tera-bytes of data Micro-array may have tens of thousands of dimensions Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations
Data Mining: Concepts and Techniques April 12, 2012

High-dimensionality of data

High complexity of data


New and sophisticated applications


16

Data Mining Functionalities


The functionalities of data mining system are to enumerate the different types of pattern present in data mining tasks which can be cateagorized into 2 types:
1. 2.

Descriptive task Predictive task 1. Descriptive task: 1.1 Data charectarization and descrimination a. Identifying data b. Selecting data

c. Identifying and selecting the data

17

Data Mining: Concepts and Techniques

April 12, 2012

1.2 Mining frequently used patterns, associations and correlations a. Frequent item sets b. Frequent sub sequence c. Frequent Substrcture 2. Predictive task 2.1 data classification and data prediction a. Decision trees b. Neural networks 2.2 Cluster evaluation 2.3 Outlier evaluation
18 Data Mining: Concepts and Techniques April 12, 2012

Data Mining: On What Kinds of Data?

Database-oriented data sets and applications

Relational database, data warehouse, transactional database

Advanced data sets and advanced applications


Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences)

Structure data, graphs, social networks and multi-linked data


Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
Data Mining: Concepts and Techniques April 12, 2012

19

Classifications of data mining systems


Classification according to the kinds of databases mined Classification according to the kinds of knowledge mined Classification according to the kinds of techniques utilized Classification according to the applications adapted.

20

Data Mining: Concepts and Techniques

April 12, 2012

21

Data Mining: Concepts and Techniques

April 12, 2012

Potrebbero piacerti anche