Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
• Classification rules that identify a group of objects that have the same
label or differentiate among groups of objects that have different labels.
These methods are termed classification/regression methods, and
(1) ITERATE, the conceptual clustering system that generates stable and
cohesive clusters through ADO-star data ordering technique and iterative
redistribution strategy
Model
The facts that the data warehouses helps analyze are classified along different
dimensions: the fact tables hold the main data, while the usually smaller
dimension tables describe each value of a dimension and can be joined to fact
tables as needed.
Dimension tables have a simple primary key, while fact tables have a
compound primary key consisting of the aggregate of relevant dimension keys.
It is common for dimension tables to consolidate redundant data and be in
second normal form, while fact tables are usually in third normal form because
all data depend on either one dimension or all of them, not on combinations of
a few dimensions.
Example
Fact_Sales is the fact table and there
are three dimension tables Dim_Date,
Dim_Store and Dim_Product. Each
dimension table has a primary key on
its Id column, relating to one of the
columns of the Fact_Sales table's
three-column primary key (Date_Id,
Store_Id, Product_Id). The non-
primary key Units_Sold column of the
fact table in this example represents
a measure or metric that can be used in calculations and analysis. The non-
primary key columns of the dimension tables represent additional attributes of
the dimensions (such as the Year of the Dim_Date dimension).
The following query extracts how many TV sets have been sold, for each brand
and country, in 1997.
SELECT
P.Brand,
S.Country,
SUM (F.Units_Sold)
FROM
Fact_Sales F
INNER JOIN Dim_Date D
ON F.Date_Id = D.Id
INNER JOIN Dim_Store S
ON F.Store_Id = S.Id
INNER JOIN Dim_Product P
ON F.Product_Id = P.Id
WHERE
D.Year = 1997
AND
P.Product_Category = 'tv'
GROUP BY
P.Brand,
S.Country
Snowflake schema
The snowflake schema is a variation of the star schema, featuring
normalization of dimension tables.
Common uses
The star and snowflake schema are most commonly found in dimensional data
warehouses and data marts where speed of data retrieval is more important
than the efficiency of data manipulations. As such, the tables in these schema
are not normalized much, and are frequently designed at a level of
normalization short of third normal form.
The decision whether to employ a star schema or a snowflake schema should
consider the relative strengths of the database platform in question and the
query tool to be employed. Star schema should be favored with query tools that
largely expose users to the underlying table structures, and in environments
where most queries are simpler in nature. Snowflake schema are often better
with more sophisticated query tools that isolate users from the raw table
structures and for environments having numerous queries with complex
criteria.
From a space storage point of view, the dimensional tables are typically small
compared to the fact tables. This often removes the storage space benefit of
snowflaking the dimension tables.
Benefits of "snowflaking"
• Some OLAP multidimensional database modeling tools that use
dimensional data marts as a data source are optimized for snowflake
schemas.
• If a dimension is very sparse (i.e. most of the possible values for the
dimension have no data) and/or a dimension has a very long list of
attributes which may be used in a query, the dimension table may
occupy a significant proportion of the database and snowflaking may be
appropriate.
• A multidimensional view is sometimes added to an existing transactional
database to aid reporting. In this case, the tables which describe the
dimensions will already exist and will typically be normalized. A
snowflake schema will therefore be easier to implement.
• A snowflake schema can sometimes reflect the way in which users think
about data. Users may prefer to generate queries using a star schema in
some cases, although this may or may not be reflected in the underlying
organization of the database.
• Some users may wish to submit queries to the database which, using
conventional multidimensional reporting tools, cannot be expressed
within a simple star schema. This is particularly common in data mining
of customer databases, where a common requirement is to locate
common factors between customers who bought products meeting
complex criteria. Some snowflaking would typically be required to permit
simple query tools to form such a query, especially if provision for these
forms of query weren't anticipated when the data warehouse was first
designed.
Examples
The following example query is the snowflake schema equivalent of the star
schema example code which returns the total number of TV units sold by brand
and by country for 1997. Notice that the snowflake schema query requires
many more joins than the star schema version in order to fulfill even a simple
query. The benefit of using the snowflake schema in this example is that the
storage requirements are lower since the snowflake schema eliminates many
duplicate values from the dimensions themselves.
SELECT
B.Brand,
G.Country,
SUM (F.Units_Sold)
FROM
Fact_Sales F
INNER JOIN Dim_Date D
ON F.Date_Id = D.Id
INNER JOIN Dim_Store S
ON F.Store_Id = S.Id
INNER JOIN Dim_Geography G
ON S.Geography_Id = G.Id
INNER JOIN Dim_Product P
ON F.Product_Id = P.Id
INNER JOIN Dim_Product_Category C
ON P.Product_Category_Id = C.Id
INNER JOIN Dim_Brand B
ON P.Brand_Id = B.Id
WHERE
D.Year = 1997
AND
C.Product_Category = 'tv'
GROUP BY
B.Brand,
G.Country
2b Concept Hierarchies
2c OLAP Operations
Online analytical processing or OLAP is an approach to quickly answer
multi-dimensional analytical queries. OLAP is part of the broader category of
business intelligence, which also encompasses relational reporting and data
mining. The typical applications of OLAP are in business reporting for sales,
marketing, management reporting, business process management (BPM),
budgeting and forecasting, financial reporting and similar areas. The term OLAP
was created as a slight modification of the traditional database term OLTP
(Online Transaction Processing).
Databases configured for OLAP use a multidimensional data model, allowing for
complex analytical and ad-hoc queries with a rapid execution time. They borrow
aspects of navigational databases and hierarchical databases that are faster
than relational databases.
The output of an OLAP query is typically displayed in a matrix (or pivot) format.
The dimensions form the rows and columns of the matrix; the measures form
the values.
Concept
At the core of any OLAP system is the concept of an OLAP cube (also called a
multidimensional cube or a hypercube). It consists of numeric facts called
measures which are categorized by dimensions. The cube metadata is typically
created from a star schema or snowflake schema of tables in a relational
database. Measures are derived from the records in the fact table and
dimensions are derived from the tables. Each measure can be thought of as
having a set of labels, or meta-data associated with it. A dimension is what
describes these labels; it provides information about the measure. A simple
example would be a cube that contains a store's sales as a measure, and
Date/Time as a dimension. Each Sale has a Date/Time label that describes more
about that sale. Any number of dimensions can be added to the structure such
as Store, Cashier, or Customer by adding a column to the fact table. This allows
an analyst to view the measures along any combination of the dimensions.
For Example:
Sales Fact Table
+-----------------------+
| sale_amount | time_id |
+-----------------------+ Time Dimension
| 2008.08| 1234|---+ +----------------------------+
+-----------------------+ | | time_id | timestamp |
| +----------------------------+
+---->| 1234 | 20080902 12:35:43|
+----------------------------+
Multidimensional databases
Multidimensional structure is defined as “a variation of the relational model that
uses multidimensional structures to organize data and express the relationships
between data” (O'Brien & Marakas, 2009, pg 177). The structure is broken into
cubes and the cubes are able to store and access data within the confines of
each cube. “Each cell within a multidimensional structure contains aggregated
data related to elements along each of its dimensions” (pg. 178). Even when
data is manipulated it is still easy to access as well as be a compact type of
database. The data still remains interrelated. Multidimensional structure is
quite popular for analytical databases that use online analytical processing
(OLAP) applications (O’Brien & Marakas, 2009). Analytical databases use these
databases because of their ability to deliver answers quickly to complex
business queries. Data can be seen from different ways, which gives a broader
picture of a problem unlike other models (Williams, Garza, Tucker & Marcus,
1994).
Aggregations
It has been claimed that for complex queries OLAP cubes can produce an
answer in around 0.1% of the time for the same query on OLTP relational data.
The most important mechanism in OLAP which allows it to achieve such
performance is the use of aggregations. Aggregations are built from the fact
table by changing the granularity on specific dimensions and aggregating up
data along these dimensions. The number of possible aggregations is
determined by every possible combination of dimension granularities.
The combination of all possible aggregations and the base data contains the
answers to every query which can be answered from the data.
Because usually there are many aggregations that can be calculated, often only
a predetermined number are fully calculated; the remainder are solved on
demand. The problem of deciding which aggregations (views) to calculate is
known as the view selection problem. View selection can be constrained by the
total size of the selected set of aggregations, the time to update them from
changes in the base data, or both. The objective of view selection is typically to
minimize the average time to answer OLAP queries, although some studies also
minimize the update time. View selection is NP-Complete. Many approaches to
the problem have been explored, including greedy algorithms, randomized
search, genetic algorithms and A* search algorithm.
A very effective way to support aggregation and other common OLAP
operations is the use of bitmap indexes.
Types
OLAP systems have been traditionally categorized using the following
taxonomy.
Multidimensional
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP stores this data in optimized multi-dimensional array storage, rather
than in a relational database. Therefore it requires the pre-computation and
storage of information in the cube - the operation known as processing.
Relational
ROLAP works directly with relational databases. The base data and the
dimension tables are stored as relational tables and new tables are created to
hold the aggregated information. Depends on a specialized schema design.
Hybrid
There is no clear agreement across the industry as to what constitutes "Hybrid
OLAP", except that a database will divide data between relational and
specialized storage. For example, for some vendors, a HOLAP database will use
relational tables to hold the larger quantities of detailed data, and use
specialized storage for at least some aspects of the smaller quantities of more-
aggregate or less-detailed data.
Comparison
Each type has certain benefits, although there is disagreement about the
specifics of the benefits between providers.
• Some MOLAP implementations are prone to database explosion.
Database explosion is a phenomenon causing vast amounts of storage
space to be used by MOLAP databases when certain common conditions
are met: high number of dimensions, pre-calculated results and sparse
multidimensional data. The typical mitigation technique for database
explosion is not to materialize all the possible aggregation, but only the
optimal subset of aggregations based on the desired performance vs.
storage trade off.
• MOLAP generally delivers better performance due to specialized indexing
and storage optimizations. MOLAP also needs less storage space
compared to ROLAP because the specialized storage typically includes
compression techniques.
• ROLAP is generally more scalable. However, large volume pre-processing
is difficult to implement efficiently so it is frequently skipped. ROLAP
query performance can therefore suffer tremendously.
• Since ROLAP relies more on the database to perform calculations, it has
more limitations in the specialized functions it can use.
• HOLAP encompasses a range of solutions that attempt to mix the best of
ROLAP and MOLAP. It can generally pre-process quickly, scale well, and
offer good function support.
Other types
The following acronyms are also sometimes used, although they are not as
widespread as the ones above:
• WOLAP - Web-based OLAP
• DOLAP - Desktop OLAP
• RTOLAP - Real-Time OLAP
Products
The first product that performed OLAP queries was Express, which was released
in 1970 (and acquired by Oracle in 1995 from Information Resources).However,
the term did not appear until 1993 when it was coined by Edgar F. Codd, who
has been described as "the father of the relational database". Codd's paper
resulted from a short consulting assignment which Codd undertook for former
Arbor Software (later Hyperion Solutions, and in 2007 acquired by Oracle), as a
sort of marketing coup. The company had released its own OLAP product,
Essbase, a year earlier. As a result Codd's "twelve laws of online analytical
processing" were explicit in their reference to Essbase. There was some
ensuing controversy and when Computerworld learned that Codd was paid by
Arbor, it retracted the article. OLAP market experienced strong growth in late
90s with dozens of commercial products going into market. In 1998, Microsoft
released its first OLAP Server - Microsoft Analysis Services, which drove wide
adoption of OLAP technology and moved it into mainstream.
Product comparison
Market structure
Below is a list of top OLAP vendors in 2006, with figures in millions of United
States Dollars.
Global
Vendor
Revenue
Microsoft Corporation 1,806
Hyperion Solutions
1,077
Corporation
Cognos 735
Business Objects 416
MicroStrategy 416
SAP AG 330
Cartesis SA 210
Applix 205
Infor 199
Oracle Corporation 159
Others 152
Total 5,700
Microsoft was the only vendor that continuously exceeded the industrial
average growth during 2000-2006. Since the above data was collected,
Hyperion has been acquired by Oracle, Cartesis by Business Objects, Business
Objects by SAP, Applix by Cognos, and Cognos by IBM.
Data cleansing or data scrubbing is the act of detecting and correcting (or
removing) corrupt or inaccurate records from a record set, table, or database.
Used mainly in databases, the term refers to identifying incomplete, incorrect,
inaccurate, irrelevant etc. parts of the data and then replacing, modifying or
deleting this dirty data.
After cleansing, a data set will be consistent with other similar data sets in the
system. The inconsistencies detected or removed may have been originally
caused by different data dictionary definitions of similar entities in different
stores, may have been caused by user entry errors, or may have been
corrupted in transmission or storage.
Data cleansing differs from data validation in that validation almost invariably
means data is rejected from the system at entry and is performed at entry
time, rather than on batches of data.
The actual process of data cleansing may involve removing typographical errors
or validating and correcting values against a known list of entities. The
validation may be strict (such as rejecting any address that does not have a
valid postal code) or fuzzy (such as correcting records that partially match
existing, known records).
Motivation
Administratively, incorrect or inconsistent data can lead to false conclusions
and misdirected investments on both public and private scales. For instance,
the government may want to analyze population census figures to decide which
regions require further spending and investment on infrastructure and services.
In this case, it will be important to have access to reliable data to avoid
erroneous fiscal decisions.
In the business world, incorrect data can be costly. Many companies use
customer information databases that record data like contact information,
addresses, and preferences. If for instance the addresses are inconsistent, the
company will suffer the cost of resending mail or even losing customers.
Data quality
High quality data needs to pass a set of quality criteria. Those include:
• Accuracy: An aggregated value over the criteria of integrity, consistency
and density
• Integrity: An aggregated value over the criteria of completeness and
validity
• Completeness: Achieved by correcting data containing anomalies
• Validity: Approximated by the amount of data satisfying integrity
constraints
• Consistency: Concerns contradictions and syntactical anomalies
• Uniformity: Directly related to irregularities
• Density: The quotient of missing values in the data and the number of
total values ought to be known
• Uniqueness: Related to the number of duplicates in the data
Existing tools
Before computer automation data about individuals or organizations were
maintened and secured as paper records, dispersed in separate business or
organizational units. Information systems concentrate data in computer files
that can potentially be accessed by large numbers of people and by groups
outside of organization.
Data integration
Example
Consider a web application where a user can query a variety of information
about cities (such as crime statistics, weather, hotels, demographics, etc).
Traditionally, the information must exist in a single database with a single
schema. But any single enterprise would find information of this breadth
somewhat difficult and expensive to collect. Even if the resources exist to
gather the data, it would likely duplicate data in existing crime databases,
weather websites, and census data.
A data-integration solution may address this problem by considering these
external resources as materialized views over a virtual mediated schema,
resulting in "virtual data integration". This means application-developers
construct a vitual schema — the mediated schema — to best model the kinds of
answers their users want. Next, they design "wrappers" or adapters for each
data source, such as the crime database and weather website. These adapters
simply transform the local query results (those returned by the respective
websites or databases) into an easily processed form for the data integration
solution (see figure 2). When an application-user queries the mediated schema,
the data-integration solution transforms this query into appropriate queries
over the respective data sources. Finally, the virtual database combines the
results of these queries into the answer to the user's query.
This solution offers the convenience of adding new sources by simply
constructing an adapter for them. It contrasts with ETL systems or with a single
database solution, which require manual integration of the entire new dataset
into the system.
Definitions
Data integration systems are formally defined as a triple where G is
the global (or mediated) schema, S is the heterogeneous set of source
schemas, and M is the mapping that maps queries between the source and the
global schemas. Both G and S are expressed in languages over alphabets
composed of symbols for each of their respective relations. The mapping M
consists of assertions between queries over G and queries over S. When users
pose queries over the data integration system, they pose queries over G and
the mapping then asserts connections between the elements in the global
schema and the source schemas.
A database over a schema is defined as a set of sets, one for each relation (in a
relational database). The database corresponding to the source schema S
would comprise the set of sets of tuples for each of the heterogeneous data
sources and is called the source database. Note that this single source
database may actually represent a collection of disconnected databases. The
database corresponding to the virtual mediated schema G is called the global
database. The global database must satisfy the mapping M with respect to the
source database. The legality of this mapping depends on the nature of the
correspondence between G and S. Two popular ways to model this
correspondence exist: Global as View or GAV and Local as View or LAV.
Figure 3: Illustration of tuple space of the GAV and LAV mappings. In GAV, the
system is constrained to the set of tuples mapped by the mediators while the
set of tuples expressible over the sources may be much larger and richer. In
LAV, the system is constrained to the set of tuples in the sources while the set
of tuples expressible over the global schema can be much larger. Therefore LAV
systems must often deal with incomplete answers.
GAV systems model the global database as a set of views over S. In this case M
associates to each element of G a query over S. Query processing becomes a
straightforward operation due to the well-defined associations between G and
S. The burden of complexity falls on implementing mediator code instructing
the data integration system exactly how to retrieve elements from the source
databases. If any new sources join the system, considerable effort may be
necessary to update the mediator, thus the GAV approach appears preferable
when the sources seem unlikely to change.
In a GAV approach to the example data integration system above, the system
designer would first develop mediators for each of the city information sources
and then design the global schema around these mediators. For example,
consider if one of the sources served a weather website. The designer would
likely then add a corresponding element for weather to the global schema.
Then the bulk of effort concentrates on writing the proper mediator code that
will transform predicates on weather into a query over the weather website.
This effort can become complex if some other source also relates to weather,
because the designer may need to write code to properly combine the results
from the two sources.
On the other hand, in LAV, the source database is modeled as a set of views
over G. In this case M associates to each element of S a query over G. Here the
exact associations between G and S are no longer well-defined. As is illustrated
in the next section, the burden of determining how to retrieve elements from
the sources is placed on the query processor. The benefit of an LAV modeling is
that new sources can be added with far less work than in a GAV system, thus
the LAV approach should be favored in cases where the mediated schema is
more likely to change.
In an LAV approach to the example data integration system above, the system
designer designs the global schema first and then simply inputs the schemas of
the respective city information sources. Consider again if one of the sources
serves a weather website. The designer would add corresponding elements for
weather to the global schema only if none existed already. Then programmers
write an adapter or wrapper for the website and add a schema description of
the website's results to the source schemas. The complexity of adding the new
source moves from the designer to the query processor.
Query processing
The theory of query processing in data integration systems is commonly
expressed using conjunctive queries [5]. One can loosely think of a conjunctive
query as a logical function applied to the relations of a database such as "f(A,B)
where A < B". If a tuple or set of tuples is substituted into the rule and satisfies
it (makes it true), then we consider that tuple as part of the set of answers in
the query. While formal languages like Datalog express these queries concisely
and without ambiguity, common SQL queries count as conjunctive queries as
well.
In terms of data integration, "query containment" represents an important
property of conjunctive queries. A query A contains another query B (denoted
) if the results of applying B are a subset of the results of applying A for
any database. The two queries are said to be equivalent if the resulting sets are
equal for any database. This is important because in both GAV and LAV
systems, a user poses conjunctive queries over a virtual schema represented
by a set of views, or "materialized" conjunctive queries. Integration seeks to
rewrite the queries represented by the views to make their results equivalent or
maximally contained by our user's query. This corresponds to the problem of
answering queries using views (AQUV).
In GAV systems, a system designer writes mediator code to define the query-
rewriting. Each element in the user's query corresponds to a substitution rule
just as each element in the global schema corresponds to a query over the
source. Query processing simply expands the subgoals of the user's query
according to the rule specified in the mediator and thus the resulting query is
likely to be equivalent. While the designer does the majority of the work
beforehand, some GAV systems such as Tsimmis involve simplifying the
mediator description process.
In LAV systems, queries undergo a more radical process of rewriting because no
mediator exists to align the user's query with a simple expansion strategy. The
integration system must execute a search over the space of possible queries in
order to find the best rewrite. The resulting rewrite may not be an equivalent
query but maximally contained, and the resulting tuples may be incomplete. As
of 2009 the MiniCon algorithm[6] is the leading query rewriting algorithm for LAV
data integration systems.
In general, the complexity of query rewriting is NP-complete. If the space of
rewrites is relatively small this does not pose a problem — even for integration
systems with hundreds of sources.
Simplicity of understanding
Answering queries with views arouses interest from a theoretical
standpoint, but difficulties in understanding how to incorporate it as an
"enterprise solution". Some developers believe it should be merged with
EAI. Others believe it should be incorporated with ETL systems, citing
customers' confusion over the differences between the two services.
Simplicity of deployment
Even if recognized as a solution to a problem, EII as of 2009currently
takes time to apply and offers complexities in deployment. People have
proposed a variety of schema-less solutions such as "Lean Middleware",
but ease-of-use and speed of employment appear inversely proportional
to the generality of such systems. Others cite the need for standard data
interfaces to speed and simplify the integration process in practice.
Handling higher-order information
Analysts experience difficulty — even with a functioning information
integration system — in determining whether the sources in the
database will satisfy a given application. Answering these kinds of
questions about a set of repositories requires semantic information like
metadata and/or ontologies. The few commercial tools that leverage this
information remain in their infancy.
Data transformation
In metadata, a data transformation converts data from a source data format
into destination data.
Data transformation can be divided into two steps:
1. data mapping maps data elements from the source to the destination
and captures any transformation that must occur
2. code generation that creates the actual transformation program
Data element to data element mapping is frequently complicated by complex
transformations that require one-to-many and many-to-one transformation
rules.
The code generation step takes the data element mapping specification and
creates an executable program that can be run on a computer system. Code
generation can also create transformation in easy-to-maintain computer
languages such as Java or XSLT.
When the mapping is indirect via a mediating data model, the process is also
called data mediation.
Transformational languages
There are numerous languages available for performing data transformation.
Many transformational languages require a grammar to be provided. In many
cases the grammar is structured using something closely resembling Backus–
Naur Form (BNF). There are numerous languages available for such purposes
varying in their accessibility (cost) and general usefulness. Examples of such
languages include:
• XSLT - the XML transformation language
• TXL - prototyping language-based descriptions using source
transformation
It should be noted that though transformational languages are typically best
suited for transformation, something as simple as regular expressions can be
used to achieve useful transformation. Textpad supports the use of regular
expressions with arguments. This would allow all instances of a particular
pattern to be replaced with another pattern using parts of the original pattern.
For example:
foo ("some string", 42, gCommon);
bar (someObj, anotherObj);
Difficult problems
There are many challenges in data transformation. Probably the most difficult
problem to address in C++ is "unstructured preprocessor directives". These are
preprocessor directives which do not contain blocks of code with simple
grammatical descriptions - example:
void MyFunc ()
{
if (x>17)
{ printf("test");
#ifdef FOO
} else {
#endif
if (gWatch)
mTest = 42;
}
}
A really general solution to handling this is very hard because such
preprocessor directives can essentially edit the underlying language in arbitrary
ways. However, because such directives are not, in practice, used in completely
arbitrary ways, one can build practical tools for handling preprocessed
languages. The DMS Software Reengineering Toolkit is capable of handling
structured macros and preprocessor conditionals.
Definition
Following the original definition by Agrawal et al. the problem of association
rule mining is defined as: Let be a set of n binary attributes
called items. Let be a set of transactions called the
database. Each transaction in D has a unique transaction ID and contains a
subset of the items in I. A rule is defined as an implication of the form
where and . The sets of items (for short itemsets) X and
Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side
or RHS) of the rule.
Example data base with 4 items and 5
transactions
transaction
milk bread butter beer
ID
1 1 1 0 0
2 0 1 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
To illustrate the concepts, we use a small example from the supermarket
domain. The set of items is I = {milk,bread,butter,beer} and a small database
containing the items (1 codes presence and 0 absence of an item in a
transaction) is shown in the table to the right. An example rule for the
supermarket could be meaning that if milk and
bread is bought, customers also buy butter.
Note: this example is extremely small. In practical applications, a rule needs a
support of several hundred transactions before it can be considered statistically
significant, and datasets often contain thousands or millions of transactions.
To select interesting rules from the set of all possible rules, constraints on
various measures of significance and interest can be used. The best-known
constraints are minimum thresholds on support and confidence. The support
supp(X) of an item set X is defined as the proportion of transactions in the data
set which contain the item set. In the example database, the item set
{milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all
transactions (2 out of 5 transactions).
The confidence of a rule is defined
. For example, the rule has a confidence of 0.2 /
0.4 = 0.5 in the database, which means that for 50% of the transactions
containing milk and bread the rule is correct. Confidence can be interpreted as
an estimate of the probability P(Y | X), the probability of finding the RHS of the
rule in transactions under the condition that these transactions also contain the
LHS.
has a lift of .
Apriori algorithm
Apriori is the best-known algorithm to mine association rules. It uses a breadth-
first search strategy to counting the support of itemsets and uses a candidate
generation function which exploits the downward
closure property of support.
Eclat algorithm
Eclat is a depth-first search algorithm using set intersection.
FP-growth algorithm
FP-growth (frequent pattern growth) uses an extended prefix-tree (FP-tree)
structure to store the database in a compressed form. FP-growth adopts a
divide-and-conquer approach to decompose both the mining tasks and the
databases. It uses a pattern fragment growth method to avoid the costly
process of candidate generation and testing used by Apriori.
One-attribute-rule
The one-attribute-rule, or OneR, is an algorithm for finding association rules.
According to Ross, very simple association rules, involving just one attribute in
the condition part, often work well in practice with real-world data.. The idea of
the OneR (one-attribute-rule) algorithm is to find the one attribute to use to
classify a novel datapoint that makes fewest prediction errors.
For example, to classify a car you haven't seen before, you might apply the
following rule: If Fast Then Sportscar, as opposed to a rule with multiple
attributes in the condition: If Fast And Softtop And Red Then Sportscar.
The algorithm is as follows:
For each attribute A:
For each value V of that attribute, create a rule:
• count how often each class appears
• find the most frequent class, c
• make a rule "if A=V then C=c"
Calculate the error rate of this rule
Pick the attribute whose rules produce the lowest error rate
Zero-attribute-rule
The zero-attribute-rule, or ZeroR, does not involved any attribute in the
condition part, and always returns the most frequent class in the training set.
This algorithm is frequently used to measure the classification success of other
algorithms.
Lore
A famous story about association rule mining is the "beer and diaper" story. A
purported survey of behavior of supermarket shoppers discovered that
customers (presumably young men) who buy diapers tend also to buy beer.
This anecdote became popular as an example of how unexpected association
rules might be found from everyday data. [See
http://www.dssresources.com/newsletters/66.php]
GUHA procedure ASSOC
GUHA is a general method for exploratory data analysis that has theoretical
foundations in observational calculi. The ASSOC procedure [19] is a GUHA
method which mines for generalized association rules using fast bitstrings
operations. The association rules mined by this method are more general than
those output by apriori, for example "items" can be connected both with
conjunction and disjunctions and the relation between antecedent and
consequent of the rule is not restricted to setting minimum support and
confidence as in apriori: an arbitrary combination of supported interest
measures can be used.
The decision tree is a classification model, applied to existing data. If you apply
it to new data, for which the class is unknown, you also get a prediction of the
class. The assumption is that the new data comes from the similar distribution
as the data you used to build your decision tree. In many cases this is a correct
assumption and that is why you can use the decision tree for building a
predictive model.
It is a matter of definition. If you are trying to classify existing data, e.g. group
patients based on their known medical data and treatment outcome, I would
call it a classification. If you use a classification model to predict the treatment
outcome for a new patient, it would be a prediction.
gabrielac adds In the book "Data Mining Concepts and Techniques", Han and
Kamber's view is that predicting class labels is classification, and predicting
values (e.g. using regression techniques) is prediction. Other people prefer to
use "estimation" for predicting continuous values.
Clustering methods
The goal of clustering is to reduce the amount of data by categorizing or
grouping similar data items together. Such grouping is pervasive in the way
humans process information, and one of the motivations for using clustering
algorithms is to provide automated tools to help in constructing categories or
taxonomies [Jardine and Sibson, 1971, Sneath and Sokal, 1973]. The methods
may also be used to minimize the effects of human factors in the process.
Clustering methods [Anderberg, 1973, Hartigan, 1975, Jain and Dubes, 1988,
Jardine and Sibson, 1971, Sneath and Sokal, 1973, Tryon and Bailey, 1973] can
be divided into two basic types: hierarchical and partitional clustering. Within
each of the types there exists a wealth of subtypes and different algorithms for
finding the clusters.
Hierarchical clustering proceeds successively by either merging smaller clusters
into larger ones, or by splitting larger clusters. The clustering methods differ in
the rule by which it is decided which two small clusters are merged or which
large cluster is split. The end result of the algorithm is a tree of clusters called a
dendrogram, which shows how the clusters are related. By cutting the
dendrogram at a desired level a clustering of the data items into disjoint groups
is obtained.
Partitional clustering, on the other hand, attempts to directly decompose the
data set into a set of disjoint clusters. The criterion function that the clustering
algorithm tries to minimize may emphasize the local structure of the data, as
by assigning clusters to peaks in the probability density function, or the global
structure. Typically the global criteria involve minimizing some measure of
dissimilarity in the samples within each cluster, while maximizing the
dissimilarity of different clusters.
A problem with the clustering methods is that the interpretation of the clusters
may be difficult. Most clustering algorithms prefer certain cluster shapes, and
the algorithms will always assign the data to clusters of such shapes even if
there were no clusters in the data. Therefore, if the goal is not just to compress
the data set but also to make inferences about its cluster structure, it is
essential to analyze whether the data set exhibits a clustering tendency. The
results of the cluster analysis need to be validated, as well. Jain and Dubes
(1988) present methods for both purposes.
Another potential problem is that the choice of the number of clusters may be
critical: quite different kinds of clusters may emerge when K is changed. Good
initialization of the cluster centroids may also be crucial; some clusters may
even be left empty if their centroids lie initially far from the distribution of data.
Clustering can be used to reduce the amount of data and to induce a
categorization. In exploratory data analysis, however, the categories have only
limited value as such. The clusters should be illustrated somehow to aid in
understanding what they are like. For example in the case of the K-means
algorithm the centroids that represent the clusters are still high-dimensional,
and some additional illustration methods are needed for visualizing them.
August 2009
Bachelor of Science in Information Technology (BScIT) – Semester 4
BT0050 – Data Warehousing & Mining – 4 Credits
(Book ID: B0038)
Assignment Set – 2 (60 Marks)
Association Analysis
In data mining, association rule learning is a popular and well researched
method for discovering interesting relations between variables in large
databases. Piatetsky-Shapiro describes analyzing and presenting strong rules
discovered in databases using different measures of interestingness. Based on
the concept of strong rules, Agrawal etal. introduced association rules for
discovering regularities between products in large scale transaction data
recorded by point-of-sale (POS) systems in supermarkets. For example, the rule
found in the sales data of a supermarket would indicate that if a customer buys
onions and potatoes together, he or she is likely to also buy beef. Such
information can be used as the basis for decisions about marketing activities
such as, e.g., promotional pricing or product placements. In addition to the
above example from market basket analysis association rules are employed
today in many application areas including Web usage mining, intrusion
detection and bioinformatics.
The lift of a rule is defined as or the ratio of the observed confidence to that
expected by chance. The rule has a lift of .The conviction of a rule is defined as
. The rule has a conviction of , and be interpreted as the ratio of the expected
frequency that X occurs without Y (that is to say, the frequency that the rule
makes an incorrect prediction) if X and Y were independent divided by the
observed frequency of incorrect predictions. In this example, the conviction
value of 1.2 shows that the rule would be incorrect 20% more often (1.2 times
as often) if the association between X and Y was purely random chance.
The two class model builders provided by Rattle are: Decision Trees, Boosted
Decision Trees, Random Forests, Support Vector Machines, and Logistic
Regression. Whilst a model is being built you will see the cursor image change
to indicate the system is busy, and the status bar will report that a model is
being built.
We will consider each of the model builders deployed in Rattle and characterise
them through the sentences they generate and how they search for the best
sentences that capture or summarises what the data is indicating.
Cluster Analysis
Cluster analysis or clustering is the assignment of a set of observations into
subsets (called clusters) so that observations in the same cluster are similar in
some sense. Clustering is a method of unsupervised learning, and a common
technique for statistical data analysis used in many fields, including machine
learning, data mining, pattern recognition, image analysis and bioinformatics.
Types of clustering
Data clustering algorithms can be hierarchical. Hierarchical algorithms find
successive clusters using previously established clusters. These algorithms can
be either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative
algorithms begin with each element as a separate cluster and merge them into
successively larger clusters. Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller clusters. Partitional algorithms
typically determine all clusters at once, but can also be used as divisive
algorithms in the hierarchical clustering. Density-based clustering algorithms
are devised to discover arbitrary-shaped clusters. In this approach, a cluster is
regarded as a region in which the density of data objects exceeds a threshold.
DBSCAN and OPTICS are two typical algorithms of this kind.
Outlier Analysis
Rare, unusual, or just plain infrequent events are of interest in data mining in
many contexts including fraud in income tax, insurance, and online banking, as
well as for marketing. We classify analyses that focus on the discovery of such
data items as outlier analysis. () captures the concept of an outlier as:
An observation that deviates so much from other observations as to arouse
suspicion that it was generated by a different mechanism. Outlier detection
algorithms often fall into one of the categories of distance-based methods,
density-based methods, projection-based methods, and distribution-based
methods. A general approach to identifying outliers is to assume a known
distribution for the data and to examine the deviation of individuals from the
distribution. Such approaches are common in statistics (, ) but such approaches
do not scale well. Distance based methods are common in data mining where
the measure of an entities outliedness is based on its distance to nearby
entities. The number of nearby entities and the minimum distance are two
parameters.Density based approaches from breuning kriegel ng and sander
2000 sigmod LOF: local outlier factor.
This definition of the data warehouse focuses on data storage. However, the
means to retrieve and analyze data, to extract, transform and load data, and to
manage the data dictionary are also considered essential components of a data
warehousing system. Many references to data warehousing use this broader
context. Thus, an expanded definition for data warehousing includes business
intelligence tools, tools to extract, transform, and load data into the repository,
and tools to manage and retrieve metadata.
Enterprise Data Warehouse provide a control Data Base for decision support
through out the enterprise.Operational data store has a broad enterprise under
scope but unlike a real enterprise DW. Data is refreshed in rare real time and
used for routine business activity.
History
Figure 1: Simple schematic for a data warehouse. The ETL process extracts
information from the source databases, transforms it and then loads it into the
data warehouse.
Figure 2: Simple schematic for a data-integration solution. A system designer
constructs a mediated schema against which users can run queries. The virtual
database interfaces with the source databases via wrapper code if required.
Example
Consider a web application where a user can query a variety of information
about cities (such as crime statistics, weather, hotels, demographics, etc).
Traditionally, the information must exist in a single database with a single
schema. But any single enterprise would find information of this breadth
somewhat difficult and expensive to collect. Even if the resources exist to
gather the data, it would likely duplicate data in existing crime databases,
weather websites, and census data.
A data-integration solution may address this problem by considering these
external resources as materialized views over a virtual mediated schema,
resulting in "virtual data integration". This means application-developers
construct a vitual schema — the mediated schema — to best model the kinds of
answers their users want. Next, they design "wrappers" or adapters for each
data source, such as the crime database and weather website. These adapters
simply transform the local query results (those returned by the respective
websites or databases) into an easily processed form for the data integration
solution (see figure 2). When an application-user queries the mediated schema,
the data-integration solution transforms this query into appropriate queries
over the respective data sources. Finally, the virtual database combines the
results of these queries into the answer to the user's query.
This solution offers the convenience of adding new sources by simply
constructing an adapter for them. It contrasts with ETL systems or with a single
database solution, which require manual integration of the entire new dataset
into the system.
Definitions
Data integration systems are formally defined as a triple where G is the global
(or mediated) schema, S is the heterogeneous set of source schemas, and M is
the mapping that maps queries between the source and the global schemas.
Both G and S are expressed in languages over alphabets composed of symbols
for each of their respective relations. The mapping M consists of assertions
between queries over G and queries over S. When users pose queries over the
data integration system, they pose queries over G and the mapping then
asserts connections between the elements in the global schema and the source
schemas.
A database over a schema is defined as a set of sets, one for each relation (in a
relational database). The database corresponding to the source schema S
would comprise the set of sets of tuples for each of the heterogeneous data
sources and is called the source database. Note that this single source
database may actually represent a collection of disconnected databases. The
database corresponding to the virtual mediated schema G is called the global
database. The global database must satisfy the mapping M with respect to the
source database. The legality of this mapping depends on the nature of the
correspondence between G and S. Two popular ways to model this
correspondence exist: Global as View or GAV and Local as View or LAV.
Figure 3: Illustration of tuple space of the GAV and LAV mappings. In GAV, the
system is constrained to the set of tuples mapped by the mediators while the
set of tuples expressible over the sources may be much larger and richer. In
LAV, the system is constrained to the set of tuples in the sources while the set
of tuples expressible over the global schema can be much larger. Therefore LAV
systems must often deal with incomplete answers.
GAV systems model the global database as a set of views over S. In this case M
associates to each element of G a query over S. Query processing becomes a
straightforward operation due to the well-defined associations between G and
S. The burden of complexity falls on implementing mediator code instructing
the data integration system exactly how to retrieve elements from the source
databases. If any new sources join the system, considerable effort may be
necessary to update the mediator, thus the GAV approach appears preferable
when the sources seem unlikely to change.
In a GAV approach to the example data integration system above, the system
designer would first develop mediators for each of the city information sources
and then design the global schema around these mediators. For example,
consider if one of the sources served a weather website. The designer would
likely then add a corresponding element for weather to the global schema.
Then the bulk of effort concentrates on writing the proper mediator code that
will transform predicates on weather into a query over the weather website.
This effort can become complex if some other source also relates to weather,
because the designer may need to write code to properly combine the results
from the two sources.On the other hand, in LAV, the source database is
modeled as a set of views over G. In this case M associates to each element of
S a query over G. Here the exact associations between G and S are no longer
well-defined. As is illustrated in the next section, the burden of determining how
to retrieve elements from the sources is placed on the query processor. The
benefit of an LAV modeling is that new sources can be added with far less work
than in a GAV system, thus the LAV approach should be favored in cases where
the mediated schema is more likely to change.
In an LAV approach to the example data integration system above, the system
designer designs the global schema first and then simply inputs the schemas of
the respective city information sources. Consider again if one of the sources
serves a weather website. The designer would add corresponding elements for
weather to the global schema only if none existed already. Then programmers
write an adapter or wrapper for the website and add a schema description of
the website's results to the source schemas. The complexity of adding the new
source moves from the designer to the query processor.
Query processing
The theory of query processing in data integration systems is commonly
expressed using conjunctive queries. One can loosely think of a conjunctive
query as a logical function applied to the relations of a database such as "f(A,B)
where A < B". If a tuple or set of tuples is substituted into the rule and satisfies
it (makes it true), then we consider that tuple as part of the set of answers in
the query. While formal languages like Datalog express these queries concisely
and without ambiguity, common SQL queries count as conjunctive queries as
well.
In terms of data integration, "query containment" represents an important
property of conjunctive queries. A query A contains another query B (denoted )
if the results of applying B are a subset of the results of applying A for any
database. The two queries are said to be equivalent if the resulting sets are
equal for any database. This is important because in both GAV and LAV
systems, a user poses conjunctive queries over a virtual schema represented
by a set of views, or "materialized" conjunctive queries. Integration seeks to
rewrite the queries represented by the views to make their results equivalent or
maximally contained by our user's query. This corresponds to the problem of
answering queries using views (AQUV).
In GAV systems, a system designer writes mediator code to define the query-
rewriting. Each element in the user's query corresponds to a substitution rule
just as each element in the global schema corresponds to a query over the
source. Query processing simply expands the subgoals of the user's query
according to the rule specified in the mediator and thus the resulting query is
likely to be equivalent. While the designer does the majority of the work
beforehand, some GAV systems such as Tsimmis involve simplifying the
mediator description process.
In LAV systems, queries undergo a more radical process of rewriting because no
mediator exists to align the user's query with a simple expansion strategy. The
integration system must execute a search over the space of possible queries in
order to find the best rewrite. The resulting rewrite may not be an equivalent
query but maximally contained, and the resulting tuples may be incomplete. As
of 2009 the MiniCon algorithm is the leading query rewriting algorithm for LAV
data integration systems.In general, the complexity of query rewriting is NP-
complete. If the space of rewrites is relatively small this does not pose a
problem — even for integration systems with hundreds of sources.
Ques 4 Discuss:
• Mining Multi-level Association rules from Transactional
Databases
Due to the development of information systems and technology, businesses
increasingly have the capability to accumulate huge amounts of retail data in
large databases. In the recent marketing research, products' discounts have
rarely been considered as an important decision variable. Although few
researches have analyzed the effect of discount on sales, they ignore its
temporal characteristics. That is, in real world, each product may appear with
different discounts rates in different time periods. Moreover, they have
considered discount at single concept level. Therefore, the discovered
knowledge is less concrete and implementation of the results of analyses
become difficult. The problem addressed in this paper is the consideration of
products' discounts in discovering multiple-level association rules in different
time intervals that a specific discount appears on a specific product. The
proposed algorithm makes it possible to acquire more concrete and specific
knowledge corresponding to association between products and their discounts
as well as implementation of its results.
Data can also be reduced by applying many other methods, ranging from
wavelet transformation and principle components analysis to discretization
techniques, such as binning, histogram analysis, and clustering.
• Partitioning Methods
The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k clusters so that the resulting intracluster similarity is high but the
intercluster similarity is low. Cluster similarity is measured in regard to the
mean value of the objects in a cluster, which can be viewed as the cluster’s
centroid or center of gravity. “How does the k-means algorithm work?” The k-
means algorithm proceeds as follows. First, it randomly selects k of the objects,
each of which initially represents a cluster mean or center. For each of the
remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the distance between the object and the cluster mean. It then
computes the new mean for each cluster. This process iterates until the
criterion function converges. Typically, the square-error criterion is used,
defined as
where E is the sum of the square error for all objects in the data set; p is the
point inspace representing a given object; and mi is the mean of cluster Ci
(both p and mi are multidimensional). In other words, for each object in each
cluster, the distance from the object to its cluster center is squared, and the
distances are summed. This criterion tries to make the resulting k clusters as
compact and as separate as possible. The k-means procedure is summarized in
Figure 7.2.
separated from one another. The method is relatively scalable and efficient in
processing large data sets because the computational complexity of the
algorithm is O(nkt), where n is the total number of objects, k is the number of
clusters, and t is the number of iterations. Normally, k_n and t _n. The method
often terminates at a local optimum. The k-means method, however, can be
applied only when the mean of a cluster is defined. This may not be the case in
some applications, such as when data with categorical attributes are involved.
The necessity for users to specify k, the number of clusters, in advance can be
seen as a disadvantage. The k-means method is not suitable for discovering
clusters with nonconvex shapes or clusters of very different size. Moreover, it is
sensitive to noise and outlier data points because a small number of such data
can substantially influence the mean value.There are quite a few variants of the
k-means method. These can differ in the selection of the initial k means, the
calculation of dissimilarity, and the strategies for calculating cluster means. An
interesting strategy that often yields good results is to first apply a hierarchical
agglomeration algorithm, which determines the number of clusters and finds an
initial clustering, and then use iterative relocation to improve the
clustering.Another variant to k-means is the k-modes method, which extends
the k-means paradigmto cluster categorical data by replacing the means of
clusters with modes, using newdissimilarity measures to dealwith categorical
objects and a frequency-basedmethod to update modes of clusters. The k-
means and the k-modes methods can be integrated to cluster data with mixed
numeric and categorical values.
The EM (Expectation-Maximization) algorithm (which will be further discussed in
Section 7.8.1) extends the k-means paradigm in a different way. Whereas the k-
means algorithm assigns each object to a cluster, in EM each object is assigned
to each cluster according to a weight representing its probability of
membership. In other words, there are no strict boundaries between clusters.
Therefore, new means are computed based on weighted measures. “How can
we make the k-means algorithm more scalable?” A recent approach to scaling
the k-means algorithm is based on the idea of identifying three kinds of regions
in data: regions that are compressible, regions that must be maintained in main
memory, and regions that are discardable. An object is discardable if its
membership in a cluster is ascertained. An object is compressible if it is not
discardable but belongs to a tight subcluster. A data structure known as a
clustering feature is used to summarize objects that have been discarded or
compressed. If an object is neither discardable nor compressible, then it should
be retained in main memory. To achieve scalability, the iterative clustering
algorithm only includes the clustering features of the compressible objects and
the
objects that must be retained in main memory, thereby turning a secondary-
memorybased algorithm into a main-memory-based algorithm. An alternative
approach to scaling the k-means algorithm explores the microclustering idea,
which first groups nearby objects into “microclusters” and then performs k-
means clustering on the microclusters. Microclustering is further discussed in
Section 7.