Sei sulla pagina 1di 45

DATA MINING

INTRODUCTION

DB Vs VLDB


The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly Despite the abundance of tools to capture, process and share all this information sensors, computers, mobile phones, etc.- it already exceeds the available storage space

Data Growth


The amount of digital information increases tenfold every five years. Moores law, says that the processing power and storage capacity of computer chips double or their prices halve roughly every 18 months. Data are becoming the new raw material of business: an economic input almost on par with capital and labour.

What is the use of VLDB?




Farecast, a part of Microsofts search engine Bing, can advise customers whether to buy an airline ticket now or wait for the price to come down by examining 225 billion flight and price records.

Industry Need


In recent years Oracle, IBM, Microsoft and SAP spent more than $15 billion on buying software firms specialising in data management and analytics. This industry is estimated to be worth more than $100 billion and growing at almost 10% a year, roughly twice as fast as the software business as a whole. Googles search engine, is partly guided by the number of clicks on an item to help determine its relevance to a search query. If the eighth listing for a search term is the one most people go to, the algorithm puts it higher up.

Data Mining Professional




Chief information officers (CIOs) have become somewhat more prominent in the executive suite a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist

Evolution of Database Technology




1960s:


Data collection, database creation, IMS and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) Data mining and data warehousing, multimedia databases, and Web databases

1970s:


1980s:


1990s2000s:


Motivation: Necessity is the Mother of Invention




Data explosion problem




Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

 

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
 

Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

A real life scenerio




 

A credit card company must determine whether to authorize credit card purchase by a customer Purchase can be placed under any one of the following classes : 1) Authorize 2) Ask for further id. 3) Do not Authorize 4) Do not authorize, contact police.

Why Mine Data? Commercial Viewpoint




Lots of data is being collected and warehoused


  

Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions

 

Computers have become cheaper and more powerful Competitive Pressure is Strong


Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint




Data collected and stored at enormous speeds (GB/hour)


   

remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data

 

Traditional techniques infeasible for raw data Data mining may help scientists
 

in classifying and segmenting data in Hypothesis Formation

What Is Data Mining?




Data mining (knowledge discovery in databases):  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their inside stories:  Data mining: a misnomer?


Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

DATA MINING - Definition




Process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses or other information repositories.

DM Definition contd.
Data Mining is the process of identifying valid, novel, Potentially useful, and ultimately comprehensible Knowledge from database that is used to make crucial Business decisions. - Gregory Shapiro, Editor, Kdnuggets.com

What is (not) Data Mining?




What is not Data Mining?




What is Data Mining?




Look up phone number in phone directory Query a Web search engine for information about Amazon

Certain names are more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

Why Data Mining




Credit ratings/targeted marketing:




Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotions Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :

Fraud detection


Customer relationship management:




Data Mining helps extract such information

Applications


Medicine: disease outcome, effectiveness of treatments




analyze patient disease history: find relationship between diseases

 

Molecular/Pharmaceutical: identify new drugs Scientific data analysis:




identify new galaxies by searching for sub clusters find affinity of visitor to pages and modify layout

Web site/store design and promotion:




Data Mining: A KDD Process




Data mining: the core of knowledge discovery process.

Pattern Evaluation

Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection

Steps of a KDD Process




Learning the application domain:




relevant prior knowledge and goals of application

  

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:


Find useful features, dimensionality/variable reduction, invariant representation. summarization, classification, regression, association, clustering.

Choosing functions of data mining




  

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation


visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Data Mining and Business Intelligence

Increasing potential to support business decisions

Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting

End User

Business Analyst Data Analyst

Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP

DBA

Architecture of a Typical Data Mining System

Graphical user interface

Pattern evaluation Data mining engine


Database or data warehouse server
Data cleaning & data integration Filtering

Knowledge-base

Databases

Data Warehouse

Data Mining: Confluence of Multiple Disciplines


Database Technology Statistics

Machine Learning

Data Mining

Visualization

Information Science

Other Disciplines

Data Mining: On What Kind of Data?


   

Relational databases Data warehouses Transactional databases Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial databases  Time-series data and temporal data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW

Data Mining Functionalities




Concept /class description:




Associating data with class (class of items : Computers and printers) and concepts (Concept on customers : big spenders and budgetspenders) Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Data characterization : summarizing the data of the class under study (Target class) Data Discrimination : comparison of target class with one or set of comparative classes

Association
 

Mining Frequent Patterns Frequent Itemset set of items frequently appear together in a transactional data set. Mining frequent patterns leads to discovery of interesting association and correlations within data Threshold measures : Support and Confidence Single-dimensional vs. Multi-dimensional association contains(T, computer) contains(x, software) [1%, 75%] buys(X, PC) age(X, 20..29) ^ income(X, 20..29K) [support = 2%, confidence = 60%]

   

Classification and Prediction




Finding models (or functions) that describe and distinguish classes or concepts, and use the model for future prediction Derived model is based on training data E.g., classify countries based on climate, or classify cars based on gas mileage Presentation of model : decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values , like regression analysis Both should precede by relevance analysis : identifying attributes contributing to classification or prediction process

 

Decision trees
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Salary < 1 M Prof = teacher Good Bad Age < 30 Bad

Good

Neural network


Set of nodes connected by directed weighted edges A more typical NN


x1 x2 w2 x3 w3 w1
n

Basic NN unit x1 x2 x3 Hidden nodes Output nodes

o ! W ( wi xi )
i !1

1 W ( y) ! 1  e y

Cluster analysis


Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intraclass similarity and minimizing the interclass similarity Facilitate taxonomy formation, i.e., organization of observations into a hierarchy of classes that group similar events together.

Outlier analysis


Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Statistical methods : distribution model or distance measures Deviation based methods : Examines the differences in the main characteristics of objects in a group

Trend and evolution analysis

   

Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Time series data analysis Similarity-based data analysis

Discovered Patterns Interestingness




A data mining system/query may generate thousands of patterns, not all of them are interesting.


Suggested approach: Human-centered, query-based, focused mining

Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures:


Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.

Can We Find All and Only Interesting Patterns?




Find all the interesting patterns: Completeness


 

Can a data mining system find all the interesting patterns? User-provided constraints and interestingness measures used to focus the search Ex: Association Rule Mining Can a data mining system find only the interesting patterns? Approaches


 

Search for only interesting patterns: Optimization


 

First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patternsmining query optimization

CLASSIFICATION OF DM SYSTEMS


Classification according to


Kinds of Databases mined (data models, types of data or applications) Kinds of knowledge mined (data mining functionalities) Kinds of techniques utilized (degree of user interaction involved or methods of data analysis employed) Applications adapted (like finance, Stock Markets, Telecommunications)

DM Task Primitives
 

Each user will have a DM task in mind Can be specified to DM System in the form of DM query DM query is defined in the form of DM Task primitives Allows interactive communication with DM system to direct Mining process

DM Primitives


Task-relevant data to be mined Relevant db attribute or DWH dimensions of interest Kinds of knowledge to be mined-Functionalities Background Knowledge-Concept Hierarchy Interestingness measures-Support & Confidence Knowledge Presentation & Visualization -Form of display

   

DM Query Language
 

To incorporate DM Task primitives Foundation on which User-friendly graphical interface can be built Example for DMQL :
        

Use database <dbname> Use hierarchy <type of hierarchy> for <attrib> Mine <functionality> as <name_of_pattern> In relevance to <relevant attributes> From <table names> Where <condition> Group by <attribute> Having <min threshold> Display as <visualization of result>

DM System Architecture
-

 

Coupling or integrating a DM system and a DB/DWH system No coupling (DM system will not utilise any function of DB or DWH system) Loose coupling (some facilities used) Semi tight coupling (few DM primitives provided as part of DB/DWH system Tight coupling (DM system integrated into DB/DW system)

MAJOR ISSUES


Mining methodology and user interaction issues Performance issues Diversity of database types issues

Mining methodology & user interaction issues


 

    

Mining different kinds of knowledge in db Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge DM query languages and ad hoc data mining Presentation & visualization of results Handling noisy or incomplete data Pattern evaluation the interestingness problem

Performance Issues


Efficiency and scalability of Data Mining algorithms Parallel, distributed and incremental mining algorithms

Diversity of DB types issues




Handling of relational and complex types of data Mining information from heterogeneous databases & global information systems

To conclude


Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

 

Potrebbero piacerti anche