Sei sulla pagina 1di 29

Unit 2

Introduction to Data Mining


a. What is Data Mining?
b. Data Mining Functionalities: What kinds of Patterns can be Mined?
c. Classification of Data Mining Systems
d. Data Mining Task Primitives
e. Integration of Data Mining system with a Data Warehouse System
f. Major issues in Data Mining
g. Data Mining statics: Guidelines for successful Data Mining
h. Applications and Trends in Data Mining
What Is Data Mining?

• Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data

• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

February 16, 2016 Data Mining: Concepts and Techniques 3


What is data Mining

• Data Mining is the extraction (mining) of


knowledge from large amount of data , mining
of gold from rocks or sand is called as gold
mining not rock mining or sand mining , so the
proper name of data mining is ―Knowledge
Mining‖ from data. But today’s business
trends calls it data mining.
What is Data Mining?

• Efficient automated discovery of previously unknown patterns


in large volumes of data.
• Patterns must be valid, novel, useful and understandable.
• Businesses are mostly interested in discovering past patterns to
predict future behaviour.
• A data warehouse, to be discussed later, can be an enterprise’s
memory. Data mining can provide intelligence using that
memory.

5
Data Mining Functionalities

Used to specify kind of patterns to be found in data


mining tasks

Data mining task -


Descriptive
Predictive
Data Mining Functionalities
• Multidimensional concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
• Frequent patterns, association, correlation vs. causality
– Diaper  Beer [0.5%, 75%] (Correlation or causality?)
• Classification and prediction
– Construct models (functions) that describe and distinguish classes or
concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
– Predict some unknown or missing numerical values
Data Mining Functionalities (2)
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general behavior of
the data
– Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera  large SD memory
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses

February 16, 2016 Data Mining: Concepts and Techniques 11


Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Data Mining
Learning

Pattern
Recognition Other
Algorithm Disciplines
Classification of Data Mining

Type of data
Application source mined
adopted

Data model
Data Mining drawn

Mining Kind of
Technique Knowledge
discovered
1)Classification according to type of data source mined:
spatial database
multimedia data
time series data
text data
world wide web
2)Classification according to data model drawn:
relational database
oodb
datawarehouse
transactional db
3)Classification according to kind of knowledge
discovered:
functionalities -
Characterization
Association
Classification
clustering
4)Classification according to mining techniques:
data analysis approach –
machine learning
neural network
genetic algorithm
5)Classification according to the application
adopted:
finance
telecommunication
stock market
Major issues in data mining
Major issues

Mining methodology Performance Diverse


and user interaction Issues data types
issues
(2) (2)
(7)
A) Mining methodology and user interaction issues:

1) Mining different kinds of knowledge in database.


2) Interactive mining of knowledge at multiple levels of abstraction
3) Incorporation of background knowledge
4) Data mining query language and ad hoc data mining
5) Presentation and visualization of data mining results
6) Handling noisy and incomplete data
7) Pattern evaluation- the interestingness problem

B) Performance issues:
1) Efficiency and scalability of data mining algorithms
2) Parallel, distributed and incremental mining algorithms

C) Issues relating to the diversity of database types:


1)Handling of complex and relational types of data
2)Mining information from heterogeneous database and global
information system
Primitives that Define a Data Mining Task

• Task-relevant data
– Database or data warehouse name
– Database tables or data warehouse cubes
– Condition for data selection
– Relevant attributes or dimensions
– Data grouping criteria
• Type of knowledge to be mined
– Characterization, discrimination, association, classification, prediction,
clustering, outlier analysis, other data mining tasks
• Background knowledge
• Pattern interestingness measurements
• Visualization/presentation of discovered patterns

February 16, 2016 Data Mining: Concepts and Techniques 20


Integration of Data Mining and Data Warehousing

• Data mining systems, DBMS, Data warehouse systems coupling

– No coupling, loose-coupling, semi-tight-coupling, tight-coupling

• On-line analytical mining data

– integration of mining and OLAP technologies

• Interactive mining multi-level knowledge

– Necessity of mining knowledge and patterns at different levels of


abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

• Integration of multiple mining functions

– Characterized classification, first clustering and then association

February 16, 2016 22


Coupling Data Mining with DB/DW Systems

• No coupling—flat file processing, not recommended


• Loose coupling
– Fetching data from DB/DW
• Semi-tight coupling—enhanced DM performance
– Provide efficient implement a few data mining primitives in a DB/DW
system, e.g., sorting, indexing, aggregation, histogram analysis,
multiway join, precomputation of some stat functions
• Tight coupling—A uniform information processing
environment
– DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing
methods, etc.
Guidelines for Successful Data Mining
• The data must be available
• The data must be relevant, adequate and clean
• There must be a well-defined problem
• The problem should not be solvable by means of ordinary
query or OLAP tools
• The results must be actionable

24
Guidelines for Successful Data Mining
1. Use a small team with a strong internal integration and a
loose management style.
2. Carry out a small pilot project before a major data mining
project.
3. Identify a clear problem owner responsible for the project.
Could be someone in a sales or marketing. This will benefit
the external integration.

25
Guidelines for Successful Data Mining
4. Try to realize a positive return on investment within 6 to 12
months.
5. The whole data mining project should have the support of
the top management of the company.

26
Statistical Data Mining
• Some of the Statistical Data Mining Techniques are as follows −
• Regression − Regression methods are used to predict the value of the
response variable from one or more predictor variables where the
variables are numeric. Listed below are the forms of Regression −
– Linear
– Multiple
– Weighted
– Polynomial
– Nonparametric
– Robust
• Generalized Linear Models - Generalized Linear Model includes −
– Logistic Regression
– Poisson Regression
• The model's generalization allows a categorical response variable to be
related to a set of predictor variables in a manner similar to the modelling
of numeric response variable using linear regression.
• Analysis of Variance − This technique analyzes −
– Experimental data for two or more populations described by a
numeric response variable.
– One or more categorical variables (factors).
• Mixed-effect Models − These models are used for analyzing
grouped data. These models describe the relationship between a
response variable and some co-variates in the data grouped
according to one or more factors.
• Factor Analysis − Factor analysis is used to predict a categorical
response variable. This method assumes that independent variables
follow a multivariate normal distribution.
• Time Series Analysis − Following are the methods for analyzing
time-series data −
– Auto-regression Methods.
– Univariate ARIMA (AutoRegressive Integrated Moving Average)
Modeling.
– Long-memory time-series modeling.
Data Mining - Applications & Trends
• Data mining is widely used in diverse areas. There are a
number of commercial data mining system available
today and yet there are many challenges in this field. In
this tutorial, we will discuss the applications and the trend
of data mining.
• Data Mining Applications
• Financial Data Analysis
• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection

Potrebbero piacerti anche