Sei sulla pagina 1di 56

Data Mining:

Concepts and
Techniques
UNIT 1
Introduction

Data Mining: Concepts and


Techniques

Chapter 1. Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Classification of data mining systems

Top-10 most popular data mining algorithms

Major issues in data mining

Overview of the course


Data Mining: Concepts and
Techniques

Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web,


computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks,

Science: Remote sensing, bioinformatics, scientific


simulation,

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

Necessity is the mother of inventionData miningAutomated


analysis of massive data sets
Data Mining: Concepts and
Techniques

Evolution of Sciences

Before 1600, empirical science

1600-1950s, theoretical science

Each discipline has grown a theoretical component. Theoretical models often


motivate experiments and generalize our understanding.

1950s-1990s, computational science

Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)

Computational Science traditionally meant simulation. It grew out of our inability


to find closed-form solutions for complex mathematical models.

1990-now, data science

The flood of data from new scientific instruments and simulations

The ability to economically store and manage petabytes of data online

The Internet and computing Grid that makes all these archives universally
accessible

Scientific info. management, acquisition, organization, query, and visualization


tasks scale almost linearly with data volumes. Data mining is a major new
challenge!

Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online
Science, Comm. ACM, 45(11): 50-54, Nov. 2002
Data Mining: Concepts and
Techniques

Evolution of Database
Technology

1960s:

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO, deductive,


etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:

Data collection, database creation, IMS and network DBMS

Data mining, data warehousing, multimedia databases, and Web


databases

2000s

Stream data management and mining

Data mining and its applications


Web technology (XML, data integration) and global information systems

Data Mining: Concepts and


Techniques

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously


unknown and potentially useful) patterns or knowledge
from huge amount of data

Alternative names

Data mining: a misnomer?


Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

Watch out: Is everything data mining?

Simple search and query processing

(Deductive) expert systems


Data Mining: Concepts and
Techniques

Knowledge Discovery (KDD) Process

Data miningcore of
knowledge discovery
process

Pattern Evaluation
Data Mining

Task-relevant Data
Data Warehouse

Selection

Data Cleaning
Data Integration
Databases

Data Mining: Concepts and


Techniques

Data Mining and Business


Intelligence
Increasing potential
to support
business decisions

Decisio
n
Making
Data Presentation
Visualization Techniques

End User

Business
Analyst

Data Mining
Information Discovery

Data
Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Concepts and
Techniques

DBA

Data Mining: Confluence of Multiple


Disciplines
Database
Technology

Machine
Learning
Pattern
Recognition

Statistics

Data Mining

Algorithm
Data Mining: Concepts and
Techniques

Visualization

Other
Disciplines

Why Not Traditional Data


Analysis?

Tremendous amount of data

High-dimensionality of data

Algorithms must be highly scalable to handle such as terabytes of data


Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications


Data Mining: Concepts and
Techniques

10

Multi-Dimensional View of Data


Mining

Data to be mined

Knowledge to be mined

Characterization, discrimination, association,


clustering, trend/deviation, outlier analysis, etc.

Multiple/integrated functions and mining at multiple levels

classification,

Techniques utilized

Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW

Database-oriented, data warehouse (OLAP), machine learning,


statistics, visualization, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data


mining, stock market analysis, text mining, Web mining, etc.
Data Mining: Concepts and
Techniques

11

Data Mining: Classification Schemes

General functionality (Data Mining Tasks)

Descriptive data mining

Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be


discovered

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted


Data Mining: Concepts and
Techniques

12

Predictive Tasks

Objective: Predict the value of a specific


attribute (target/dependent variable)based
on the value of other attributes
(explanatory).

Example: Judge if a patient has specific


disease based on his/her medical tests results.

Descriptive Tasks

Objective: To derive patterns


(correlation,trends) that summarizes the
underlying relationship between data.

Example: Identifying web pages that are


accessed
pattern)

together.(human

interpretable

Data Mining Models and Tasks

15

Data Mining Techniques

Data Mining Techniques


Descriptive

Predictive
Clustering

Classification

Association

Decision Tree

Sequential Analysis

Rule Induction
Neural Networks
Nearest Neighbor Classification

Regression

Basic Data Mining Tasks

Classification maps data into predefined


groups or classes

Supervised learning
Pattern recognition
Prediction

Regression is used to map a data item to a


real valued prediction variable.
Clustering groups similar data together into
clusters.

Unsupervised learning
Segmentation
Partitioning

17

Basic Data Mining Tasks


(contd)

Summarization maps data into subsets


with associated simple descriptions.

Characterization
Generalization

Link Analysis uncovers relationships


among data.

Affinity Analysis
Association Rules
Sequential Analysis determines sequential
patterns.

18

Ex: Time Series Analysis

Example: Stock Market


Predict future values
Determine similar patterns over time
Classify behavior

19

Data Mining: On What Kinds of


Data?

Database-oriented data sets and applications

Relational database, data warehouse, transactional database

Advanced data sets and advanced applications

Data streams and sensor data

Time-series data, temporal data, sequence data (incl. biosequences)

Structure data, graphs, social networks and multi-linked data

Object-relational databases

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data

Multimedia database

Text databases

The World-Wide Web


Data Mining: Concepts and
Techniques

20

Major Issues in Data Mining

Mining methodology

Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web

Performance: efficiency, effectiveness, and scalability

Pattern evaluation: the interestingness problem

Incorporation of background knowledge

Handling noise and incomplete data

Parallel, distributed and incremental mining methods


Integration of the discovered knowledge with existing one: knowledge
fusion

User interaction

Data mining query languages and ad-hoc mining

Expression and visualization of data mining results

Interactive mining of knowledge at multiple levels of abstraction

Applications and social impacts

Domain-specific data mining & invisible data mining


Protection of data security, integrity, and privacy
Data Mining: Concepts and
Techniques

21

Why Data Mining?Potential


Applications

Data analysis and decision support

Market analysis and management

Risk analysis and management

Target marketing, customer relationship management


(CRM), market basket analysis, cross selling, market
segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications

Text mining (news group, email, documents) and Web mining

Stream data mining

Bioinformatics and bio-data analysis


Data Mining: Concepts and
Techniques

22

Architecture: Typical Data Mining


System
Graphical User Interface
Pattern Evaluation
Data Mining Engine

Know
ledge
-Base

Database or Data
Warehouse Server
data cleaning, integration, and selection

Database

Data
World-Wide Other Info
Repositories
Warehouse
Web
Data Mining: Concepts and
Techniques

23

Understanding the term Data


Warehousing

Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which
he defined in the following way: "A warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support
of management's decision making process". He defined the terms in
the sentence as follows:

Subject Oriented:
Data that gives information about a particular subject instead of
about a company's ongoing operations.

Integrated:
Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.

Time-variant:
All data in the data warehouse is identified with a particular time
period.

Non-volatile
Data is stable in a data warehouse. More data is added but data is
never removed. This enables management to gain a consistent
picture of the business.

What is Data Warehousing?


Information

A
process
of
transforming data into
information
and
making it available to
users in a timely
enough
manner
to
make a difference
[Forrester Research, April
1996]

Data
25

Data Warehouse
Architecture

Data Warehouse
Architecture
Relational
Databases
Optimized Loader
ERP
Systems

Extraction
Cleansing
Data Warehouse
Engine

Purchased
Data

Legacy
Data

Metadata Repository
27

Analyze
Query

Data Mining works with


Warehouse Data

Data
Warehousing
provides the Enterprise
with a memory

Data Mining provides


the
Enterprise
with
intelligence
28

Rules of a Data Warehouse

Data Warehouse and Operational


Environments are Separated

Data is integrated

Contains historical data over a long period


of time

Data is a snapshot data captured at a given


point in time

Data is subject-oriented

Rules of Data Warehouse

Mainly read-only
updates

Development Life Cycle has a data driven


approach versus the traditional processdriven approach

Data contains several levels of detail

Current, Old,
Summarized

with

Lightly

periodic

Summarized,

batch

Highly

Rules of Data Warehouse

Environment is characterized by
transactions to very large data sets

System
that
traces
transformations, and storage

Metadata is a critical component

Source,
transformation,
relationships, history, etc

data

integration,

Read-only

sources,

storage,

Contains a chargeback mechanism for resource


usage that enforces optimal use of data by end
users

Warehousing on the Web


(Web Warehousing)

Webhouse

32

Data Webhouse

Defined by Ralph Kimball


Two distict focuses

Bringing the web to the warehouse


Clickstream data as a source of information
Bringing existing data warehouses to web
Fully distributed environment

Required Capabilities

Capture clickstream logs and convert to tables for analysis

Merge customer demographic and account info with above

Interpret customer paths in website

Identify abandoned sessions

Use dw to drive customer responses appearing on your website

DW querying and reporting available through web browsers

Attach multimedia to DW

DW security

Architecture Web to
Warehouse

Beyond comprehensive snapshot of


business on real-time basis also want
knowledge of customer behavior
Extended design factors

Timliness real-time
Data volume no upper limit
Response time less than 10 seconds

Data Mining: Concepts and


Techniques

Who are our users?

Traditional

Power users

Analysts

want to manipulate existing data

Report viewers

need database connectivity

view standardized reports

Web

Our customers
Our business partners
Our employees

Data Mining: Concepts and


Techniques

Data Mining: Concepts and


Techniques

Clickstreams
Clickstream as defined by Internet
Advertising Bureau (IAB) :
The electronic path a user takes while
navigating from site to site, and from page to
page within a site. It is a comprehensive body
of data describing the sequence of activity
between a users browser and any other
Internet resource, such as a Web site or third
party ad server
Data Mining: Concepts and
Techniques

Clickstreams

Clickstream not another data source

Basic form of clickstream data stateless

Distributed nature leads to multiple data sources


which require synchronization
Multiple parties
More than a dozen log file formats for capturing
clickstream data
Search specification

Log shows isolated page retrieval event

Clickstream data anonymous

Clickstreams

Clickstream post-processor receives raw


long data from web server and normalizes it
into a format which can be combined with
application derived data for insertion into
dw

Why Bring DW to Web?

Primary

function

of

dw

to

publish

information web good partner

Need

distributed

dw

web

universal connectivity

Universal front-end web browser

provides

Web Pushes Data


Warehouse

User interface effectiveness measurable


Queries and updates mixed
Speed expected 10 second rule
Global

27 X 7 expected
International characters, dates, addresses

Expanded multimedia

Animation, zoomable images, maps, video clips


Need material in digital form
Enterprise information portal will require items
to be searchable

Web Pushes Data


Warehouse

Mass customization

Fully distributed

Dynamically created web pages XML

Linking together all the data marts

Security and Privacy

Publish only to those who need to know


User profiles and access profiles defined in one
place
Full-time expert security person

Second Generation User


Interface Guidelines

Near- instantaneous performance


Website Design

Design for lowest common denominator

Measure page performance on a continuous basis

Paint navigation buttons immediately

Disclose content progressively

Implement page caching

Cache data, reports

Improve web server bandwidth

Improve server throughput

Second Generation User


Interface Guidelines

Data Webhouse design

Adapt all web design responses

Select appropriate DBMS software dimensional


models, OLAP

Use indexes, aggregations

Partition files

Increase RAM

Use parallel processing

Meet User Expectations

Website design

Site navigation choices

Help choices

Communication with various groups response


must be assured

Headlines serious and define content

Indicate off-screen material

Survey customer needs and wants

Meet User Expectations

Data Webhouse design

Report library

Folder of previous queries, reports

Dimension browser viewing dimension can


assist report creation

Business

metadata

interface

organizations data assets

understand

Streamline Process

Business processes designed from ground


up to work seamlessly on web

Website design

Reengineer to streamline process and make


navigation easier, uniform interfaces

Remove barriers to reaching page

Minimize clicks and new windows

Allow interruption and return

Streamline Process

Data Webhouse design

Build an explicit value chain for reporting and


analysis around the application suite using
conformed dimensions and facts

Drill across functions

Single user interface for reporting against all parts


of business

Master report library and FAQs

Single login and single console access to webhouse

Allow Problem Resolution

Website design

Allow backtracking, rollback, play forward


Keep old transactions
Easy error reporting
Acknowledge, track and follow-up all user inputs,
show wait time
Assist searching

Data Webhouse design

Provide adequate end user support


Show aggregates in use and available
Show system load and percent completed

Build Trust

Clearly state and observe websites policies


for using customers identity

Website design

Do not abuse privacy

Link to privacy statement

Use friendly pictures of people

Distinguish between ad content and editorial


content

Build Trust

Data Webhouse design

Two-factor security

What you know password

What you posses token

Track changes in employee and contractor status

Create and enforce roles for employees,


contractors and customers

Manage webhouse security directly

Provide Communication
Hooks

Website design

Provide useful links to others internal and


external

Remove links that invalidate the back button

Use copyable URLs

Use URL as medium of distribution

Data Mining: Concepts and


Techniques

Potrebbero piacerti anche