Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Performance Issues
Interactive
mining
of
knowledge
at
multiple
levels
of
Pattern
evaluation
The
patterns
discovered
should
be
parallel
and
distributed
data
mining
algorithms.
These
are
Bottom Tier - The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end tools
and utilities to feed data into the bottom tier. These back end tools and utilities
perform the Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
o
system.
The
ROLAP
maps
the
operations
on
Top-Tier - This tier is the front-end client layer. This layer holds the query tools
and reporting tools, analysis tools and data mining tools.
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual
warehouse. It is easy to build a virtual warehouse. Building a virtual
warehouse requires excess capacity on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of
data is valuable to specific groups of an organization.
In other words, we can claim that data marts contain data specific to
a particular group. For example, the marketing data mart may contain
data related to items, customers, and sales. Data marts are confined
to subjects.
Points to remember about data marts:
The implementation data mart cycles is measured in short periods of time, i.e., in
weeks rather than months or years.
The life cycle of a data mart may be complex in long run, if its planning and
design are not organization-wide.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an
entire organization
Load Manager
This component performs the operations required to extract and load
process.
The size and complexity of the load manager varies between specific
solutions from one data warehouse to other.
Perform simple transformations into structure similar to the one in the data
warehouse.
Fast Load
In order to minimize the total load window the data need to be loaded into the
warehouse in the fastest possible time.
It is more effective to load the data into relational database prior to applying
transformations and checks.
Gateway technology proves to be not suitable, since they tend not be performant
when large data volumes are involved.
Simple Transformations
While loading it may be required to perform simple transformations.
After this has been completed we are in position to do the complex
checks. Suppose we are loading the EPOS sales transaction we need
to perform the following checks:
Strip out all the columns that are not required within the warehouse.
Warehouse Manager
A warehouse manager is responsible for the warehouse management
process. It consists of third-party system software, C programs, and
shell scripts.
The size and complexity of warehouse managers varies between
specific solutions.
Backup/Recovery tool
SQL Scripts
Creates indexes, business views, partition views against the base data.
Transforms and merges the source data into the published data warehouse.
Archives the data that has reached the end of its captured life.
profiles
Query Manager
Query manager is responsible for directing the queries to the suitable tables.
to
Query manager is responsible for scheduling the execution of the queries posed
by the user.
Stored procedures
Detailed Information
Detailed information is not kept online, rather it is aggregated to the
next level of detail and then archived to tape. The detailed
information part of data warehouse keeps the detailed information in
the starflake schema. Detailed information is loaded into the data
warehouse to supplement the aggregated data.
The following diagram shows a pictorial impression of where detailed
information is stored and how it is used.
a) Data Sources
Database, data warehouse, World Wide Web (WWW), text files and other documents are the
actual sources of data. You need large volumes of historical data for data mining to be successful.
Organizations usually store data in databases or data warehouses. Data warehouses may
contain one or more databases, text files, spreadsheets or other kinds of information repositories.
Sometimes, data may reside even in plain text files or spreadsheets. World Wide Web or the
Internet is another big source of data.
Different Processes
The data needs to be cleaned, integrated and selected before passing it to the database or data
warehouse server. As the data is from different sources and in different formats, it cannot be used
directly for the data mining process because the data might not be complete and reliable. So, first
data needs to be cleaned and integrated. Again, more data than required will be collected from
different data sources and only the data of interest needs to be selected and passed to the
server. These processes are not as simple as we think. A number of techniques may be
performed on the data as part of cleaning, integration and selection.
b) Database or Data Warehouse Server
The database or data warehouse server contains the actual data that is ready to be processed.
Hence, the server is responsible for retrieving the relevant data based on the data mining request
of the user.
c) Data Mining Engine
The data mining engine is the core component of any data mining system. It consists of a number
of modules for performing data mining tasks including association, classification, characterization,
clustering, prediction, time-series analysis etc.
d) Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of interestingness of the
pattern by using a threshold value. It interacts with the data mining engine to focus the search
towards interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data mining
system. This module helps the user use the system easily and efficiently without knowing the real
complexity behind the process. When the user specifies a query or a task, this module interacts
with the data mining system and displays the result in an easily understandable manner.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the
search or evaluating the interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in the process of data
mining. The data mining engine might get inputs from the knowledge base to make the result
more accurate and reliable. The pattern evaluation module interacts with the knowledge base on
a regular basis to get inputs and also to update it.
Summary
Each and every component of data mining system has its own role and importance in completing
data mining efficiently. These different modules need to interact correctly with each other in order
to complete the complex process of data mining successfully.
OLAP:
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information. This chapter cover the types of OLAP, operations on OLAP,
difference between OLAP, and statistical databases and OLTP.
Types of OLAP Servers
We have four types of OLAP servers:
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Specialized SQL Servers
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following:
Implementation of aggregation navigation logic.
Optimization for each DBMS back end.
Additional tools and services.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense
and sparse data sets.
Hybrid OLAP (HOLAP)
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP
and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of
detailed information. The aggregations are stored separately in MOLAP store.
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level
of city to the level of country.
When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways:
Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from
the level of city to the level of country.
When roll-up is performed, one or more dimensions from the data cube are
removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the
following ways:
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
Initially the concept hierarchy was "day < month < quarter < year."
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and
provides a new sub-cube. Consider the following diagram that shows how slice
works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view
in order to provide an alternative presentation of data. Consider the following
diagram that shows the pivot operation.
DATA REDUCTION:
Data reduction can be achieved using several different types of technologies. The best-known
data reduction technique is data deduplication, which eliminates redundant data on storage
systems. The deduplication process typically occurs at the storage block level. The system
analyzes the storage to see if duplicate blocks exist, and gets rid of any redundant blocks. The
remaining block is shared by any file that requires a copy of the block. If an application attempts
to modify this block, the block is copied prior to modification so that other files that depend on the
block can continue to use the unmodified version, thereby avoiding file corruption.
Some storage arrays track which blocks are the most heavily shared. Those blocks that are
shared by the largest number of files may be moved to a memory- or flash storagebased cache so they can be read as efficiently as possible.
Data compression reduces the size of a file by removing redundant information from files so that
less disk space is required. This is accomplished natively in storage systems using algorithms or
formulas designed to identify and remove redundant bits of data.
Archiving data also reduces data on storage systems, but the approach is quite different. Rather
than reducing data within files or databases, archiving removes older, infrequently accessed data
from expensive storage and moves it to low-cost, high-capacity storage. Archive storage can
be disk, tape or cloud based.
ADDITIONAL
ADVANTAGES
DISADVANTAGES
OF
DATA
WAREHOUSING:
The Advantages of Data Warehousing
Data warehouses tend to have a very high query success as they have complete control over the four
main areas of data management systems.
Clean data
Security issues.
Difficult to accommodate changes in data types and ranges, data source schema, indexes and
queries
Subject Oriented
Integrated
Nonvolatile
Time Variant
Subject Oriented:
Data warehouses are designed to help you analyze data. For example, to learn more about your
company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you
can answer questions like "Who was our best customer for this item last year?" This ability to define a
data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
Integrated:
Integration is closely related to subject orientation. Data warehouses must put data from disparate
sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies
among units of measure. When they achieve this, they are said to be integrated.
Nonvolatile:
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical
because the purpose of a warehouse is to enable you to analyze what has occurred.
Time Variant:
In order to discover trends in business, analysts need large amounts of data. This is very much in
contrast to online transaction processing (OLTP) systems, where performance requirements demand that
historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by
the term time variant.
Typically, data flows from one or more online transaction processing (OLTP) databases into a data
warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file before
being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to a
few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.
DATA DISCRMINATION:
Data discrimination is the selective filtering of information by a service
provider. This has been a new issue in the recent debate over net
neutrality. Accordingly, one should consider net neutrality in terms of a
dichotomy between types of discrimination that make economic sense
and will not harm consumers and those that constitute unfair trade
practices and other types of anticompetitive practices. [1] Nondiscrimination mandates that one class of customers may not be favored
over another so the network that is built is the same for everyone, and
everyone can access it.[2]
A domain constraint specifies the permissible values for a given attribute, while a key constraint
specifies the attributes that uniquely identify a row in a given table.
The domain/key normal form is achieved when every constraint on the relation is a logical
consequence of the definition of keys and domains, and enforcing key and domain restraints and
conditions causes all constraints to be met. Thus, it avoids all non-temporal anomalies.
The reason to use domain/key normal form is to avoid having general constraints in the database that
are not clear domain or key constraints. Most databases can easily test domain and key constraints on
attributes. General constraints however would normally require special database programming in the
form of stored procedures (often of the trigger variety) that are expensive to maintain and expensive
for the database to execute. Therefore, general constraints are split into domain and key constraints.
It's much easier to build a database in domain/key normal form than it is to convert lesser databases
which may contain numerous anomalies. However, successfully building a domain/key normal form
database remains a difficult task, even for experienced database programmers. Thus, while the
domain/key normal form eliminates the problems found in most databases, it tends to be the most
costly normal form to achieve. However, failing to achieve the domain/key normal form may carry longterm, hidden costs due to anomalies which appear in databases adhering only to lower normal forms
over time.
What is Clustering?
Clustering is the process of making a group of abstract objects into
classes of similar objects.
Points to Remember
While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
Clustering Methods
Clustering methods can be classified into the following categories
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of n objects and the partitioning
method constructs k partition of data. Each partition will represent a
cluster and k n. It means that it will classify the data into k groups,
which satisfy the following requirements
Points to remember
For a given number of partitions (say k), the partitioning method will create an
initial partitioning.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of
data objects. We can classify hierarchical methods on the basis of how
the hierarchical decomposition is formed. There are two approaches
here
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we
start with each object forming a separate group. It keeps on merging
the objects or groups that are close to one another. It keep on doing
so until all of the groups are merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we
start with all of the objects in the same cluster. In the continuous
iteration, a cluster is split up into smaller clusters. It is down until
each object in one cluster or the termination condition holds. This
method is rigid, i.e., once a merging or splitting is done, it can never
be undone.
Density-based Method
This method is based on the notion of density. The basic idea is to
continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within
a given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is
quantized into finite number of cells that form a grid structure.
Advantage
Model-based methods
In this method, a model is hypothesized for each cluster to find the
best fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the
data points.
Constraint-based Method
In this method, the clustering is performed by the incorporation of
user or application-oriented constraints. A constraint refers to the
user expectation or the properties of desired clustering results.
Constraints provide us with an interactive way of communication with
the clustering process. Constraints can be specified by the user or the
application requirement.
BAYESIAN
CLASSIFICATION:
Bayesian classification is based on Bayes' Theorem. Bayesian
classifiers are the statistical classifiers. Bayesian classifiers can predict
class membership probabilities such as the probability that a given
tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities
These variables may correspond to the actual attribute given in the data.
Conditional Probability:
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for
classification. We can express a rule in the following from
Points to remember
The antecedent part the condition consist of one or more attribute tests and
these tests are logically ANDed.
If the condition holds true for a given tuple, then the antecedent is
satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting
IF-THEN rules from a decision tree.
Points to remember
One rule is created for each path from the root to the leaf node.
The leaf node holds the class prediction, forming the rule consequent.
Rule Pruning
The rule is pruned is due to the following reason
The Assessment of quality is made on the original set of training data. The rule
may perform well on training data but less well on subsequent data. That's why
the rule pruning is required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version
of R has greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a
given rule R,
FOIL_Prune = pos - neg / pos + neg