UNIT-1 Introduction To Data Mining

UNIT-1
INTRODUCTION TO DATA MINING
Q) Define data mining.

Data mining is a process of extracting or mining knowledge from large
amounts of data. Data mining should have been more appropriately named
as knowledge mining which emphasis on mining from large amounts of
data.
Q) What motivated data mining? Why is it important?

The major reason that data mining has attracted a great deal of attention in
information industry in recent years is due to the wide availability of huge
amounts of data and the imminent need for turning such data into useful
information and knowledge.
The information and knowledge gained can be used for applications ranging
from business management, production control, and market analysis, to
engineering design and science exploration.
The evolution of database technology.
Q) Explain Knowledge Discovery in Databases(KDD). (or)
Explain Data mining architecture with a neat sketch.
Data mining is a process of extracting or mining knowledge from large
amounts of data. Here is the list of steps involved in knowledge discovery
process:
Fig: Data mining as a process of Knowledge discovery
Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from
the database.
Data Transformation - In this step data are transformed or consolidated
into forms appropriate for mining by performing summary or aggregation
operations.
Data Mining - In this step intelligent methods are applied in order to extract
data patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step, knowledge is represented in various
forms like charts and graphs.
Fig: The architecture of a typical data mining systems
Databases, Data warehouses or other kinds data cleaning and data

integration techniques are applied on data database or data warehouse
server. It is used to fetch the relevant data from the databases or data
warehouses based on the user request.
 Knowledge base: it includes domain knowledge and which need to

guide the search and also needed to evaluate the interestingness of
resulting patterns.

 Data mining engine: functionalities like characterization, association,
classification, clustering, and outlier and evolution analysis have been
applied.

 Pattern evolution module: interesting measures like support,
confidence have been applied to filter the required patterns. To
confine the search pattern interestingness is pushed into pattern
evolution.

 Graphical User Interface: It is an interface between user and data
mining system. User allowed interacting with the system by posing
queries and thresholds which will explore the data mining.

Q) Explain on what kind of data, data mining can be performed.
Data mining is applicable to wide variety of information includes.
1. Relational Databases:
Database Management System consists of collection of interrelated data

called database. A relational database includes tables. Each table consists of
set of attributes and those are used to store large set of tuples. Each tuple
represents an object identified by unique key.
Eg. Relational schema for a relational database, AllElectronics.
customer (cust ID, name, address, age, occupation, annual income, credit
information, category )
item (item ID, brand, category, type, price, place made, supplier, cost )
employee (empl ID, name, category, group, salary, commission)
branch (branch ID, name, address)
purchases (trans ID, cust ID, empl ID, date, time, method paid, amount)
items sold (trans ID, item ID, qty)
works at (empl ID, branch ID)
2. Data Warehouses:
Data warehouse is a repository of information collected from multiple

sources stored under unified schema, which resides at a single site. Data
Warehouse modeled by a multidimensional database, where each dimension
corresponds to attribute or set of attributes in a schema. Pre computation
and fast accessing of summarized data is possible with multi dimensional
data cube.
Fig . Typical framework of a data warehouse for AllElectronics.

Fig. A multidimensional data cube, commonly used for data warehousing,
(a) showing summarized data for AllElectronics and (b) showing summarized
data resulting fromdrill-down and roll-up operations on the cube in (a). For
improved readability, only some of the cube cell values are shown.
3. Transactional Database:
Transactional database includes records as a transaction identity number,

and list of items which made the transactions.
Eg.
Fig. Fragment of a transactional database for sales at AllElectronics.
Advanced Database Systems
4. Object oriented Databases:

which are based on the object oriented programming paradigm, each entity
is considered as an object. Each object associated with following: A set of
variables that describe the objects. Objects can communicate with each other
using messages. A method holds the code used to implement a message.
Object relational databases:
These are constructed based on object relational data model. This model
extends the relational model by providing a rich data type for handling
complex objects and object orientation. Class hierarchies and object
inheritance properties are added to relational data model.
5. Spatial Database:
These databases may contain spatial related information which includes

geographic databases, medical and satellite image databases. 2-D satellite
image may be represented as raster data and maps are represented in vector
format.
6. Temporal databases and time series databases:
Both uses time related data. Temporal database usually stores relational
data that include time related attributes. Time series database stores
sequences of values that change with time, such as data collected regarding
the stock exchange.
7. Text Databases and Multimedia Database:
Text databases are databases that contain word descriptions for objects.
These descriptions may not be simple key words but rather long sentences
or paragraphs.
Multimedia databases store image, audio and video data. Specialized storage
and search techniques are required to access multimedia databases.
8. Hetero generous and legacy databases:
Objects in one component database may differ mostly from other objects
called Heterogeneous databases. Legacy database is a group of
heterogeneous databases that combine different kinds of data systems.
9. World Wild Web:
World wild online information services are linked together to facilitate

interactive access. Users can traverse from one object to other for
information sake.
Q) Explain indetail about Data Mining Functionalities.
Data mining functionalities are used to specify the kind of patterns in data
mining tasks. Descriptive mining tasks characterize the general properties
of the data, predictive mining perform the inference on the current data to
make predictions.
Characterization & Discrimination:
Summarization of data of the class is called data characterization (target

class).
Eg. summarizing the characteristics of a student who has obtained more
than 75% in every semester; the result could be a general profile of the
student.
Data discrimination means comparison of the target class with comparative
classes.
Eg. The general features of students with high GPA„s may be compared with
the general features of students with low GPA.
Association Analysis:
Frequent patterns, are patterns that occur frequently in data. A frequent
itemset typically refers to a set of items that often appear together in a
transactional data set
Eg. milk and bread, which are frequently
It is the discovery of association rules showing attribute value conditions

that occur frequently together.
X=>Y is interpreted as database tuples satisfy X conditions are also likely to
satisfy the condition Y.
Eg. age(X,‟20...29‟) ^ income(X,‟20K...29K‟) => buys(X,‟CD Player‟)
[Support=2%, confidence= 60%]
Association defined among multiple attributes (predicates) call

multidimensional association rule. Association rule that contain a single
predicate can be referred as single dimensional rule.
buys(X,’Computer’)=> buys(X,’Software’)[1%,50%]
Classification & Prediction:
Classification is the process of finding a set of models that describe

and distinguish data classes or concepts. Classification can be used for
predicting the class label of data objects.
Fig. A classification model can be represented in various forms: (a) IF-THEN
rules, (b) a decision tree, or (c) a neural network.
Cluster Analysis:
Clustering analyzes data objects without consulting class label. The

objects are clustered or grouped based on maximizing the intra class
similarity and minimizing the inter class similarity.
Fig. A 2-D plot of customer data with respect to customer locations in a

city, showing three data clusters.
Outlier Analysis:
The data objects that do not comply with the general behavior or
model of the data can be considered as outliers. Outliers may be detected
using statistical tests that assume a distribution or probability model for the
data. Distance or deviation based methods are used to identify the outliers.
Q) Classify Data Mining Systems
Data Mining is an inter disciplinary field confluence of a set of

disciplines includes databases, statistics; machine Learning, Visualization,
and information Science. Data mining systems can be categorized according
to following criteria.
Classification according to the kinds of databases mined: Data mining

system classified according to the kinds of databases mined. Classification
according to data models such as relational, transactional, object oriented
and data warehouse.
Classification according to the kinds of knowledge mined: Data mining

can be categorized according to the kinds of knowledge they mine, based on
data mining functionalities such as characterization, discrimination,
association, classification, clustering, outlier and evolution analysis. Data
mining systems are distinguished based on levels of abstraction of
knowledge mined including generalized or primitive level of knowledge.
Classification according to the kind of techniques utilized:

Classification according to underlying data mining techniques. These are
described according to degree of user interaction involved or the methods of
data analysis employed.
Q) Explain various technologies used in Data Mining.
Data mining involves an integration of techniques from multiple disciplines

such as database technology, statistics, machine learning, high performance
computing. Etc.,
Statistics:
Statistics is the collection, analysis, interpretation or explanation, and
presentation of data. Data mining has an inherent connection with
statistics.
Statistical models are widely used to model data and data classes.
Eg. Data mining tasks like data characterization and classification uses
statistics.
Machine Learning
Machine learning investigates how computers can learn based on data.
Classic problems in machine learning that are highly related to data mining
are:
Supervised learning is basically a synonym for classification. The
supervision in the learning comes from the labeled examples in the training
data set.
Eg. Postal code recognition problem.
Unsupervised learning is essentially a synonym for clustering. The learning

process is unsupervised since the input examples are not class labeled.
Eg. Identifying digits without labels.
Semi-supervised learning is a class of machine learning techniques that

make use of both labeled and unlabeled examples when learning a model.
Eg. labeled examples are used to learn class models and unlabeled examples
are used to refine the boundaries between classes.
Active learning is a machine learning approach that lets users play an active
role in the learning process.
Eg. An active learning approach can ask a user (e.g., a domain
expert) to label an example, which may be from a set of unlabeled examples.
Fig. Data mining adopts techniques from many domains.
Database Systems and DataWarehouses

Many data mining tasks need to handle large data sets or even real-time,
fast streaming data. Therefore, data mining can make good use of scalable
database technologies to achieve high efficiency and scalability on large data
sets.
Information Retrieval
Information retrieval (IR) is the science of searching for documents or
information in documents. Documents can be text or multimedia, and may
reside on the Web.
Q) Discuss Major Issues In Data Mining.
1. Mining Methodology: Researchers have been vigorously developing

new data mining methodologies. This involves the investigation of new
kinds of knowledge.
Mining various and new kinds of knowledge: Due to the diversity of

applications, new mining tasks continue to emerge, making data mining a
dynamic and fast-growing field.
Mining knowledge in multidimensional space: When searching for

knowledge in large data sets, we can explore the data in multidimensional
space like cube. That is, we can search for interesting patterns among
combinations of dimensions (attributes) at varying levels of abstraction.
Such mining is known as (exploratory) multidimensional data mining.
Data mining—an interdisciplinary effort: The power of data mining can be

substantially enhanced by integrating new methods from multiple
disciplines.
Eg. To mine data with natural language text, it makes sense to fuse data
mining methods with methods of information retrieval and natural language
processing.
Boosting the power of discovery in a networked environment: Most data

objects reside in a linked or interconnected environment, whether it be the
Web, database relations, files, or documents.
Knowledge derived in one set of objects can be used to boost the discovery of
knowledge in a “related” or semantically linked set of objects.
Handling uncertainty, noise, or incompleteness of data: Data cleaning,

data preprocessing, outlier detection and removal, and uncertainty
reasoning are examples of techniques that need to be integrated with the
data mining process.
Pattern evaluation and pattern- or constraint-guided mining:

Techniques are needed to assess the interestingness of discovered patterns
based on subjective measures.
2. User Interaction:
The user plays an important role in the data mining process. Interesting areas of
research include:
Interactive mining: The data mining process should be highly interactive.
Thus, it is important to build flexible user interfaces and an exploratory
mining environment, facilitating the user‟s interaction with the system.
Incorporation of background knowledge: Background knowledge,

constraints, rules, and other information regarding the domain under study
should be incorporated into the knowledge discovery process. Such
knowledge can be used for pattern evaluation as well as to guide the search
toward interesting patterns.
Ad hoc data mining and data mining query languages: Query languages
(e.g., SQL) have played an important role in flexible searching because they
allow users to pose ad hoc queries. Similarly, high-level data mining query
languages or other high-level flexible user interfaces will give users the
freedom to define ad hoc data mining tasks.
Presentation and visualization of data mining results: Data mining

system present data mining results, flexibly, so that the discovered
knowledge can be easily understood and directly usable by humans.
3. Efficiency and Scalability:
Efficiency and scalability of data mining algorithms: The running time of

a data mining algorithm must be predictable, short, and acceptable by
applications.
Parallel, distributed, and incremental mining algorithms: The

humongous size of many data sets, the wide distribution of data, and the
computational complexity of some data mining methods are factors that
motivate the development of parallel and distributed data-intensive mining
algorithms. Such algorithms first partition the data into “pieces.” Each piece
is processed, in parallel, by searching for patterns.
Cloud computing and cluster computing: which use computers in a

distributed and collaborative way to tackle very large-scale computational
tasks, are also active research themes in parallel data mining.
4. Diversity of Database Types:
The wide diversity of database types brings about challenges to data mining.
These include:
Handling complex types of data: Different applications generate a wide

spectrum of new data types, from structured data such as relational and
data warehouse data to semi-structured and unstructured data, from stable
data repositories to dynamic data streams. The construction of effective and
efficient data mining tools for diverse applications remains a challenging and
active area of research.
Mining dynamic, networked, and global data repositories:
The discovery of knowledge from different sources of structured, semi-

structured, or unstructured yet interconnected data with diverse data
semantics poses great challenges to data mining.
5. Data Mining and Society
Social impacts of data mining: With data mining penetrating our everyday
lives, it is important to study the impact of data mining on society. How can
we use data mining technology to benefit society? How can we guard against
its misuse?
The improper disclosure or use of data and the potential violation of
individual privacy and data protection rights are areas of concern that need
to be addressed.
Privacy-preserving data mining: Data mining will help scientific discovery,

business management, economy recovery, and security protection (e.g., the
real-time discovery of intruders and cyber attacks). However, it poses the
risk of disclosing an individual‟s personal information.

Q) Discuss data mining applications
Business Intelligence:
Business intelligence (BI) technologies provide historical, current, and
predictive views of business operations. Examples include reporting, online
analytical processing, business performance management.
Web Search Engines:

A Web search engine is a specialized computer server that searches for
information on the Web. Web search engines are essentially very large data
mining applications.
Various data mining techniques are used in all aspects of search engines,
ranging from crawling (e.g., deciding which pages should be crawled and the
crawling frequencies),
Indexing (e.g., selecting pages to be indexed and deciding to which extent
the index should be constructed), and
searching (e.g., deciding how pages should be ranked, which advertisements
should be added, and how the search results can be personalized or made
“context aware”).
Other applications: Retail, telecommunication, banking, fraud analysis,
bio-data mining, stock market analysis, text mining, Web mining, etc.
INTRODUCTION OT DATA WAREHOUSE
Q) Define Data warehouse.
"A warehouse is a subject-oriented, integrated, time-variant and non-volatile

collection of data in support of management's decision making process".
Subject Oriented: Data that gives information about a particular subject

instead of about a company's ongoing operations.
Eg. Customer, items etc.
Integrated: Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a particular
time period.
Eg. 2000,2001,2002, etc.
Non-volatile: Data is stable in a data warehouse. More data is added but
data is never removed.
Benefits of data warehousing:
Data warehouses are designed to perform well with aggregate queries

running on large amounts of data.
The structure of data warehouses is easier for end users to navigate,

understand and query against unlike the relational databases primarily
designed to handle lots of transactions.
Data warehouses enable queries that cut across different segments of a

company's operation. E.g. production data could be compared against
inventory data even if they were originally stored in different databases with
different structures.
Data warehousing is an efficient way to manage and report on data that is

from a variety of sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of

information from lots of users.
Data warehousing provides the capability to analyze large amounts of

historical data for piece of wisdom that can provide an organization with
competitive advantage.
Q) Difference between Operational databases & data warehouses:
Feature OLTP OLAP
 Characteristics Operational Processing Informational Processing
 Orientation Transaction Analysis
 User Clerk, DBA Knowledge worker
 Function Day to day Long term informational
requirements
 DB-Design ER Based Star / Snow flake
 Data Current Historical
 Summarization Primitive Summarized
 View Detailed Summarized
 Unit of work Simple Complex
 Access Read/ Write Mostly read
 Focus Data in Data (information) out
 Operations Index Lots of scans
 No. of records Tens Millions
 No. of Users Thousands Hundreds
 DB Size 100MB to GB 100GB to TB
 Priority High Performance High Flexibility
 Metric Transaction throughout Query throughout
Q) Explain architecture of Data warehouse with the help of a neat

diagram.
1. Bottom-Tier:
The bottom tier is a warehouse database server that is almost always a

relational database system.
Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile
information provided by external consultants).
These tools and utilities perform data extraction, cleaning, and
transformation, as well as load and refresh functions to update the data
warehouse.
The data are extracted using application program interfaces known as

gateways. A gateway is supported by the underlying DBMS and allows
client programs to generate SQL code to be executed at a server.
Examples of gateways include ODBC (Open Database Connection) and

OLEDB (Object Linking and Embedding Database) by Microsoft and JDBC
(Java Database Connection).
This tier also contains a metadata repository, which stores information

about the data warehouse and its contents.
Fig. 3-Tier data warehousing architecture

2. Middle-Tier:
The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP(ROLAP) model (i.e., an extended relational DBMS that
maps operations on multidimensional data to standard relational
operations); or
(2) a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server
that directly implements multidimensional data and operations).
3. Top-Tier:
The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).
Data Warehouse Models
There are three data warehouse models: the enterprise warehouse, the data
mart, and the virtual warehouse.
Enterprise warehouse: An enterprise warehouse collects all of the

information about subjects spanning the entire organization.
It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-
functional in scope.
It typically contains detailed data as well as summarized data, and can
range in size from a few gigabytes to hundreds of gigabytes, terabytes, or
beyond.
Data mart: A data mart contains a subset of corporate-wide data that is of

value to a specific group of users.
The scope is connected to specific, selected subjects.
Eg. a marketing data mart may connect its subjects to customer, item, and
sales. The data contained in data marts tend to be summarized.
Depending on the source of data, data marts can be categorized into the
following two classes:
(i).Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data
generated locally within a particular department or geographic area.
(ii).Dependent data marts are sourced directly from enterprise data
warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational

databases. For efficient query processing, only some of the possible
summary views may be materialized. A virtual warehouse is easy to build
but requires excess capacity on operational database servers.
Fig. Recommended approach for data warehouse development
Extraction, Transformation, and Loading(ETL):

Data warehouse systems use back-end tools and utilities to populate and
refresh their data. These tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous,

and external sources.
Data cleaning, which detects errors in the data and rectifies them when
possible.
Data transformation, which converts data from legacy or host format to
warehouse format.
Load, which sorts, summarizes, consolidates, computes views, checks
integrity, and builds indices and partitions.
Refresh, which propagates the updates from the data sources to the
warehouse.
Besides cleaning, loading, refreshing, and metadata definition tools, data

warehouse systems usually provide a good set of data warehouse
management tools.
Metadata Repository:
Metadata are data about data. A metadata repository should contain the
following:
A description of the data warehouse structure, which includes the

warehouse schema, view, dimensions, hierarchies, and derived data
definitions, as well as data mart locations and contents.
Operational metadata, which include data lineage (history of migrated data

and the sequence of transformations applied to it), currency of data (active),
and monitoring information (warehouse usage statistics, error reports, and
audit trails).
The algorithms used for summarization and reports.
Mapping from the operational environment to the data warehouse.
Data related to system performance, which include indices and profiles

that improve data access and retrieval performance.
Business metadata, which include business terms and definitions, data

ownership information, and charging policies.
Q) What is a multi-dimensional model? Explain with an example.
Data warehouses and OLAP tools are based on multidimensional data

model.
Data cube allows data to be modeled and viewed in multiple

dimensions.
Each dimension associated with a table called dimension table.

Multidimensional data organized around a central theme called fact table.
Facts are numerical measures. Fact table contains the names of the
facts, measures and leys related to each dimension table.
Cuboid that holds the lowest level of summarization is called base

cuboid. The 0-D cuboid, which holds the highest level of summarization is
called apex cuboid.
Fig. Multi-Dimensional Model

Q) Explain schemas for Multidimensional databases.
1. Star schema: The star schema is a modeling paradigm in which the

data warehouse contains (1) a large central table (fact table), and (2) a
set of smaller attendant tables (dimension tables), one for each
dimension. The schema graph resembles a starburst, with the
dimension tables displayed in a radial pattern around the central fact
table.
Fig. Star Schema of sales data warehouse
2. Snowflake schema: The snowflake schema is a variant of the star

schema model, where some dimension tables are normalized, thereby
further splitting the data into additional tables. The resulting schema
graph forms a shape similar to a snowflake.
Fig. Snow Flake schema for sales data warehouse
3. Fact constellation: Sophisticated applications may require multiple

fact tables to share dimension tables. This kind of schema can be
viewed as a collection of stars, and hence is called a galaxy schema or
a fact constellation.
Fig.Fact Constellation schema of sales and shipping data warehouse

Eg.
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
Defining a Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
Defining a Fact Constellation in DMQL

define cube sales [time, item, branch, location]:
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
Q) Explain briefly about Concept hierarchy.
A concept hierarchy defines a sequence of mappings from a set of low-level

concepts to higher-level.Many concept hierarchies are implicit within the
database schema.
Concept hierarchies may be provided manually by system users, domain
experts, or knowledge engineers, or may be automatically generated based
on statistical analysis of the data distribution.
A concept hierarchy that is a total or partial order among attributes in a

database schema is called a schema hierarchy.
Concept hierarchies may also be defined by discretizing or grouping values

for a given dimension, resulting in a set-grouping hierarchy.
–A total or partial order can be defined among groups of values.
Fig. Cocept Hierarchy

Concept hierarchies can be used to generalize data by replacing low-level
values (such as “day” for the time dimension) by higher-level abstractions
(such as “year”), or to specialize data by replacing higher-level abstractions
with lower-level values.
Q) Explain different OLAP Operations with an example.
In the multidimensional model, data are organized into multiple dimensions,

and each dimension contains multiple levels of abstraction defined by
concept hierarchies.
This organization provides users with the flexibility to view data from
different perspectives.
A number of OLAP data cube operations exist to materialize these different

views, allowing interactive querying and analysis of the data at hand.
Typical OLAP Operations are:
•Roll up (drill-up): The roll-up operation (also called the drill-up operation)
performs aggregation on a data cube, either by climbing up a concept
hierarchy for a dimension or by dimension reduction.
•Drill down (roll down): Drill-down is the reverse of roll-up. It navigates
from less detailed data to more detailed data.
Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.
•Slice and dice:

The slice operation performs a selection on one dimension of the given
cube, resulting in a subcube.
The dice operation defines a subcube by performing a selection on two or
more dimensions.
•Pivot (rotate): Pivot (rotate) is a visualization operation that rotates the

data axes in view in order to provide an alternative presentation of the data.
Fig. OLAP Operations

Q) Explain Data Warehouse Design Process and Implementation.
To design an effective data warehouse we need to understand and analyze

business needs and construct a business analysis framework. The Four
different views regarding a data warehouse design must be considered:
The top-down view allows the selection of the relevant information

necessary for the data warehouse. This information matches current and
future business needs.
The data source view exposes the information being captured, stored, and
managed by operational systems.
This information may be documented at various levels
of detail and accuracy, from individual data source tables to integrated data
source tables.
The data warehouse view includes fact tables and dimension tables. It
represents the information that is stored inside the data warehouse.
The business query view is the data perspective in the data warehouse
from the end-user‟s viewpoint.
Design Process:
A data warehouse can be built using a top-down approach, a bottom-up

approach, or a combination of both.
The top-down approach starts with overall design and planning. It is
useful in cases where the technology is mature and well known, and where
the business problems that must be solved are clear and well understood.
The bottomup approach starts with experiments and prototypes. This
is useful in the early stage of business modeling and technology
development.
From software engineering point of view:

The waterfall method performs a structured and systematic analysis at
each step before proceeding to the next, which is like a waterfall, falling
fromone step to the next.
The spiral method involves the rapid generation of increasingly functional
systems, with short intervals between successive. This is considered a good
choice for data warehouse development
The warehouse design process consists of the following steps:

Choose a business process to model, if the business process is
organizational and involves multiple complex object collections, a data
warehouse model.
Eg. orders, invoices, shipments, inventory, account administration, sales.
Choose the business process grain, which is the fundamental, atomic level
of data to be represented in the fact table for this process
Eg. individual transactions, individual daily snapshots.
Choose the dimensions that will apply to each fact table record
Eg. time, item, customer, supplier
Choose the measures that will populate each fact table record.
Eg. Typical measures are numeric additive quantities like dollars sold and
units sold.
There are three kinds of data warehouse applications:
Information processing supports querying, basic statistical analysis, and

reporting using crosstabs, tables, charts, or graphs.
Analytical processing supports basic OLAP operations, including slice-and-

dice, drill-down, roll-up, and pivoting. It generally operates on historic data
in both summarized and detailed forms.
Data mining supports knowledge discovery by finding hidden patterns and

associations, constructing analytical models, performing classification and
prediction, and presenting the mining results using visualization tools.
DataWarehouse Implementation:
Data cube can be viewed as a lattice of cuboids.

The bottom-most cuboid is the base cuboid.
The top-most cuboid (apex) contains only one cell.
Fig. Data Cube with 3-Dimensions
For a cube with n dimensions, there are a total of 2n cuboids(without

hierarchies), including the base cuboid. In the above data cube it is 23 = 8.
For a cube with n dimensions,with L levels(including hierarchies) and in the
above data cube it is 24.
n
T   (Li 1)
i 1
Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city,
year;
Q) Explain about the following:

a. Data cube Materialization
b. Bitmap Index
c. Join Index
To facilitate efficient data accessing, most data warehouse systems support

index structures and materialized views:
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not precompute any of the “nonbase” cuboids.
This leads to computing expensive multidimensional aggregates on-the-fly,
which can be extremely slow.
2. Full materialization: Precompute all of the cuboids. The resulting lattice

of computed cuboids is referred to as the full cube. This choice typically
requires huge amounts of memory space in order to store all of the
precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole

set of possible cuboids.
Bitmap Indexing:
The bitmap indexing method is popular in OLAP products because it allows
quick searching in data cubes. The bitmap index is an alternative
representation of the record ID (RID) list.
 Index on a particular column

 Each value in the column has a bit vector: bit-op is fast
 The length of the bit vector: # of records in the base table
 The i-th bit is set if the i-th row of the base table has the value for
the indexed column
 not suitable for high cardinality domains
Fig. Bitmap Index on Region and Type
Join Indexing:
The join indexing method gained popularity fromits use in relational
database query processing.
Traditional indexing maps the value in a given column to a list of rows
having that value. In contrast, join indexing registers the joinable rows of
two relations from a relational database.
Fig. join Index on location, sales and item
Q) Discuss Data Generalization by Attribute-Oriented Induction.
Attribute Oriented Induction:
 Collect the task-relevant data (initial relation) using a relational

database query
 Perform generalization by attribute removal or attribute
generalization
 Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts
 Interaction with users for knowledge presentation
Eg. Describe general characteristics of graduate students in the University

database.
Step 1. Fetch relevant set of data using an SQL statement, e.g.,
Select name, gender, major, birth_place, birth_date, residence, phone#, gpa

from student
where student_status in {“Msc”, “MBA”, “PhD” }
Step 2. Perform attribute-oriented induction

Step 3. Present results in generalized relation, cross-tab, or rule forms.
Attribute-removal: remove attribute A if there is a large set of distinct

values for A but (1) there is no generalization operator on A, or (2) A‟s
higher level concepts are expressed in terms of other attributes
Attribute-generalization: If there is a large set of distinct values for A,

and there exists a set of generalization operators on A, then select an
operator and generalize A
Attribute-Oriented Induction Algorithm:
 InitialRel: Query processing of task-relevant data, deriving the initial

relation.
 PreGen: Based on the analysis of the number of distinct values in
each attribute, determine generalization plan for each attribute:
removal? or how high to generalize?
 PrimeGen: Based on the PreGen plan, perform generalization to the
right level to derive a “prime generalized relation”, accumulating the
counts.
 Presentation: User interaction: (1) adjust levels by drilling, (2)
pivoting, (3) mapping into rules, cross tabs, visualization
presentations.(Visualization techniques: Pie charts, bar charts,
curves, cubes, and other visual forms.)

UNIT-1 Introduction To Data Mining

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

UNIT-1 Introduction To Data Mining

Caricato da

Copyright:

Formati disponibili

UNIT-1

INTRODUCTION TO DATA MINING

Q) Define data mining.

Q) What motivated data mining? Why is it important?

Fig: Data mining as a process of Knowledge discovery

Databases, Data warehouses or other kinds data cleaning and data

 Knowledge base: it includes domain knowledge and which need to

Data mining is applicable to wide variety of information includes.

Database Management System consists of collection of interrelated data

Data warehouse is a repository of information collected from multiple

Fig . Typical framework of a data warehouse for AllElectronics.

Transactional database includes records as a transaction identity number,

Fig. Fragment of a transactional database for sales at AllElectronics.

Advanced Database Systems

4. Object oriented Databases:

Object relational databases:

These databases may contain spatial related information which includes

6. Temporal databases and time series databases:

7. Text Databases and Multimedia Database:

8. Hetero generous and legacy databases:

World wild online information services are linked together to facilitate

Q) Explain indetail about Data Mining Functionalities.

Characterization & Discrimination:

Summarization of data of the class is called data characterization (target

It is the discovery of association rules showing attribute value conditions

Association defined among multiple attributes (predicates) call

Classification & Prediction:

Classification is the process of finding a set of models that describe

Clustering analyzes data objects without consulting class label. The

Fig. A 2-D plot of customer data with respect to customer locations in a

Q) Classify Data Mining Systems

Data Mining is an inter disciplinary field confluence of a set of

Classification according to the kinds of databases mined: Data mining

Classification according to the kinds of knowledge mined: Data mining

Classification according to the kind of techniques utilized:

Q) Explain various technologies used in Data Mining.

Data mining involves an integration of techniques from multiple disciplines

Unsupervised learning is essentially a synonym for clustering. The learning

Semi-supervised learning is a class of machine learning techniques that

Fig. Data mining adopts techniques from many domains.

Database Systems and DataWarehouses

Q) Discuss Major Issues In Data Mining.

1. Mining Methodology: Researchers have been vigorously developing

Mining various and new kinds of knowledge: Due to the diversity of

Mining knowledge in multidimensional space: When searching for

Data mining—an interdisciplinary effort: The power of data mining can be

Boosting the power of discovery in a networked environment: Most data

Handling uncertainty, noise, or incompleteness of data: Data cleaning,

Pattern evaluation and pattern- or constraint-guided mining:

Incorporation of background knowledge: Background knowledge,

Presentation and visualization of data mining results: Data mining

3. Efficiency and Scalability:

Efficiency and scalability of data mining algorithms: The running time of

Parallel, distributed, and incremental mining algorithms: The

Cloud computing and cluster computing: which use computers in a

4. Diversity of Database Types:

Handling complex types of data: Different applications generate a wide

The discovery of knowledge from different sources of structured, semi-

5. Data Mining and Society

Privacy-preserving data mining: Data mining will help scientific discovery,

Web Search Engines:

Q) Define Data warehouse.

"A warehouse is a subject-oriented, integrated, time-variant and non-volatile