Sei sulla pagina 1di 67

Data warehouse & Data Mining 10IS74

VTU Question Paper Solution


Unit -1
1a.What is operational data store (ods)? Explain with neat diagram. (08 Marks) Jan 2014

ODS -operational data store problem‖To create an ODS you Periodically copy data into it from
the live OLTP system reporting tools An ODS can be an integration point or real an operational
system It’s not enough for full enterprise processing Difference between ODS and
datawarehouse
–Major task of traditional relational DBMS
–Day-to-day operations: purchasing, inventory, banking, man ufacturing, payroll, registration,
accounting, etc.
–The time horizon for the data operational systems
--Operational database: current value data
– Data warehouse data: provide information from a his torical perspective (e.g., past 5-10 years)
–Every key structure in the data warehouseContains a n element of time, explicitly or implicitly
–But the key of operational data may or may not contain ―time element‖

1b. What is ETL? Explain the steps in ETL (04 Marks) Jan 2015, (07 Marks)
Jan 2014, (06 Marks) June/ July 2015, (04 Marks) June/ July 2016
the process of extracting data from source systems and bringing it into the data warehouse is
commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL
refers to a broad process, and not three well-defined steps. The acronym ETL is perhaps too

Dept of ISE, SJBIT Page 1


Data warehouse & Data Mining 10IS74

implistic, because it omits the transportation phase and implies that each of the other phases of
the process is distinct. Nevertheless, the entire process is known as ETL.
The methodology and tasks of ETL have been well known for many years, and are not
necessarily unique to data warehouse environments: a wide variety of proprietary applications
and database systems are the IT backbone of any enterprise. Data has to be shared between
applications or systems, trying to integrate them, giving at least two applications the same picture
of the world. This data sharing was mostly addressed by mechanisms similar to what we now
call ETL.

ETL Basics in Data Warehousing

What happens during the ETL process? The following tasks are the main actions in the process.

Extraction of Data

During extraction, the desired data is identified and extracted from many different sources,
including database systems and applications. Very often, it is not possible to identify the specific
subset of interest, therefore more data than necessary has to be extracted, so the identification of
the relevant data will be done at a later point in time. Depending on the source system's
capabilities (for example, operating system resources), some transformations may take place
during this extraction process. The size of the extracted data varies from hundreds of kilobytes
up to gigabytes, depending on the source system and the business situation. The same is true for
the time delta between two (logically) identical extractions: the time span may vary between
days/hours and minutes to near real-time. Web server log files, for example, can easily grow to
hundreds of megabytes in a very short period of time.

Transportation of Data

After data is extracted, it has to be physically transported to the target system or to an


intermediate system for further processing. Depending on the chosen way of transportation, some
transformations can be done during this process, too. For example, a SQL statement which
directly accesses a remote target through a gateway can concatenate two columns as part of the
SELECT statement.

Dept of ISE, SJBIT Page 2


Data warehouse & Data Mining 10IS74

The emphasis in many of the examples in this section is scalability. Many long-time users of
Oracle Database are experts in programming complex data transformation logic using
PL/SQL.implementations that take advantage of Oracle's new SQL functionality, especially for
ETL and the parallel query infrastructure.

1c. What are the guide lines for implementing the data warehouse.
05Marks) Jan 2014, (08 Marks) Jan 2015
The limited reporting in the source systems

 The desire to use a better and more powerful reporting tool than what the source systems
offer
 Only a few people have the security to access the source systems and you want to allow
others to generate reports
 A company owns many retail stores each of which track orders in its own database and
you want to consolidate the databases to get real-time inventory levels throughout the day
 You need to gather data from various source systems to get a true picture of a customer
so you have the latest info if the customer calls customer service. Custom data such as
customer info, support history, call logs, and order info. Or medical data to get a true
picture of a patient so the doctor has the latest info throughout the day: outpatient
department records, hospitalization records, diagnostic records, and pharmaceutical
purchase records

Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions. Data warehouse systems are
valuable tools in today’s competitive, fast-evolving world. In the last several years, many firms
have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that
with competition mounting in every industry, data warehousing is the latest must-have marketing
weapon - a way to retain customers by learning more about their needs.
Data warehouses have been defined in many ways, making it difficult to formulate a rigorous
definition. Loosely speaking, a data warehouse refers to a database that is maintained separately
from an organization’s operational databases. Data warehouse systems allow for the integration

Dept of ISE, SJBIT Page 3


Data warehouse & Data Mining 10IS74

of a variety of application systems. They support information processing by providing a solid


platform of consolidated historical data for analysis.
According to William H. Inmon, a leading architect in the construction of data warehouse
systems, ―A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process‖. This short, but
comprehensive definition presents the major features of a data warehouse. The four keywords,
subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from
other data repository systems, such as relational database systems, transaction processing
systems, and file systems

2a.Explain ODS (Operational Data Store) and its structure with a neat figure.
(07 Marks) June/ July 2014, (06 Marks) June/ July 2015

WithinW.H. Inmon’s concept of a Corporate InformationFactory (CIF),Operational Data Stores


(ODS) are architectural entities specificallydesigned for the purposes of Operational BI.Theyplay
adual role,supporting both decision support and operational transaction processing . Like data
warehouse es they integrate data fromvarious systems.
In fact,as figure 2.1 shows, they can be used as interm ediate layers between operational systems
and a data warehou se. However, they differ from data warehouses in three important aspects:
their content s are volatile , meaning that they change at much higher frequencies than contents
of a data warehouse they keep detailed ,highly granular data and arecurrent valued in the sense
that they contain no (or little) historical data.In particular theconcentrationon current data keeps
the size of Operational Data Stores manageable and ensures acceptable query runtimes on fresh
operational data.

Dept of ISE, SJBIT Page 4


Data warehouse & Data Mining 10IS74

2b. Explain the implementation steps for data warehouse.


(07 Marks) June/ July 2014, (08 Marks) June/ July 2016
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions. Data warehouse systems are
valuable tools in today’s competitive, fast-evolving world. In the last several years, many firms
have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that
with competition mounting in every industry, data warehousing is the latest must-have marketing
weapon - a way to retain customers by learning more about their needs.

Data warehouses have been defined in many ways, making it difficult to formulate a rigorous
definition. Loosely speaking, a data warehouse refers to a database that is maintained separately
from an organization’s operational databases. Data warehouse systems allow for the integration
of a variety of application systems. They support information processing by providing a solid
platform of consolidated historical data for analysis.

According to William H. Inmon, a leading architect in the construction of data warehouse


systems, ―A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process‖. This short, but
comprehensive definition presents the major features of a data warehouse. The four keywords,
subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from
other data repository systems, such as relational database systems, transaction processing
systems, and file systems

Dept of ISE, SJBIT Page 5


Data warehouse & Data Mining 10IS74

2c. Write the differences between OL TP and data warehouse.


(06 Marks) June/ July 2014, (08 Marks) June/ July 2015

3. What is ODS? How does it differ from data warehouse? Explain.


(08 Marks) Jan 2015, (04 Marks) June/ July 2016
the purpose of an ODS is to integrate corporate data from different heterogeneous data sources in
order to facilitate operational reporting in real-time or near real-time . Usually data in the ODS
will be structured similar to the source systems, although during integration the data can be
cleaned, denormalized, and business rules applied to ensure data integrity. This integration will
happen at the lowest granular level and occur quite frequently throughout the day. Normally an
ODS will not be optimized for historical and trend analysis as this is left to the data warehouse.
And an ODS is frequently used as a data source for the data warehouse.

 An ODS is targeted for the lowest granular queries whereas a data warehouse is usually
used for complex queries against summary-level or on aggregated data
 An ODS is meant for operational reporting and supports current or near real-time
reporting requirements whereas a data warehouse is meant for historical and trend
analysis reporting usually on a large volume of data
 An ODS contains only a short window of data, while a data warehouse contains the entire
history of data

Dept of ISE, SJBIT Page 6


Data warehouse & Data Mining 10IS74

 An ODS provides information for operational and tactical decisions on current or near
real-time data while a data warehouse delivers feedback for strategic decisions leading to
overall system improvements
 In an ODS the frequency of data load could be every few minutes or hourly whereas in a
data warehouse the frequency of data loads could be daily, weekly, monthly or quarterly

4. Discuss the benefits of implementing a data warehouse (04 Marks) June/ July 2016

Benefits from a successful implementation of a data warehouse include:

• Enhanced Business Intelligence

• Increased Query and System Performance

• Business Intelligence from Multiple Sources

• Timely Access to Data

• Enhanced Data Quality and Consistency

• Historical Intelligence

• High Return on Investment

Dept of ISE, SJBIT Page 7


Data warehouse & Data Mining 10IS74

Unit -2

1a. Distinguish between OLTP and OLAP.

(04 Marks) Jan 2014, (05 Marks) June/ July 2015

1b. Explain the operation of data cube with suitable examples. (08 Marks) Jan 2014

Users of decision support systems often see data in the form of data cubes. The cube is used to
represent data along some measure of interest. Although called a "cube", it can be 2-dimensional,
3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database
and the cells in the data cube represent the measure of interest. For example, they could contain a
count for the number of times that attribute combination occurs in the database, or the minimum,
maximum, sum or average value of some attribute. Queries are performed on the cube to retrieve
decision support information.

Example: We have a database that contains transaction information relating company sales of a
part to a customer at a store location. The data cube formed from this database is a 3-dimensional

Dept of ISE, SJBIT Page 8


Data warehouse & Data Mining 10IS74

representation, with each cell (p,c,s) of the cube representing a combination of values from part,
customer and store-location. A sample data cube for this combination is shown in Figure 1. The
contents of each cell is the count of the number of times that specific combination of values
occurs together in the database. Cells that appear blank in fact have a value of zero. The cube can
then be used to retrieve information within the database about, for example, which store should
be given a certain part to sell in order to make the greatest sales.

1c. Write short note on


(08 Marks) Jan 2014, (8 Marks) June/ July 2014,(6 Marks) June/ July 2016
i) ROLAP ii) MOLAP iii) DATACUBE iv) FASMI
i) OLAP servers may use relational OLAP (ROLAP), or multidimensional OLAP
(MOLAP), or hybrid OLAP (HOLAP). A ROLAP server uses an extended
relational DBMS that maps OLAP operations on multidimensional data to standard
relational operations.
ii) A MOLAP server maps multidimensional data views directly to array structures.
A HOLAP server combines ROLAP and MOLAP. For example, it may use ROLAP
for historical data while maintaining frequently accessed data in a separate MOLAP
store.
iii) A multidimensional data model is typically used for the design of corporate data
warehouses and departmental data marts. Such a model can adopt a star schema,

Dept of ISE, SJBIT Page 9


Data warehouse & Data Mining 10IS74

snowflake schema, or fact constellation schema. The core of the multidimensional


model is the data cube
2 . Explain the characteristics of OLAP systems and write the comparision of OLTP and
OLAP. (12 Marks) June/ July 2014

OLTP (On-line Transaction Processing) is involved in the operation of a particular system.


OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE,
DELETE). The main emphasis for OLTP systems is put on very fast query processing,

maintaining data integrity in multi-access environments and an effectiveness measured by


number of transactions per second. In OLTP database there is detailed and current data, and
schema used to store transactional databases is the entity model (usually 3NF). It involves
Queries accessing individual record like Update your Email in Company database.

OLAP (On-line Analytical Processing) deals with Historical Data or Archival Data. OLAP is
characterized by relatively low volume of transactions. Queries are often very complex and
involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP
applications are widely used by Data Mining techniques. In OLAP database there is aggregated,
historical data, stored in multi-dimensional schemas (usually star schema). Sometime query need
to access large amount of data in Management records like what was the profit of your company
in last year.

Dept of ISE, SJBIT Page 10


Data warehouse & Data Mining 10IS74

3a. Why multidimensional views of data and data cubes are used? With a neat diagram,
explain data cube implementations. (10 Marks) Jan 2015, (06 Marks) June/ July 2016

The multidimensional view of data is in some way a natural view of any enterprise for managers
The triangle diagram in Figure shows that as we go higher in the triangle hierarchy the managers
need for detailed information declines

DATA CUBE IMPLEMENTATIONS

1.Precompute and store all: This means that millions of aggregates will need to be computed and
stored. Although this is the best solution as far as query response time is concerned, the solution
is impractical since resources required to compute the aggregates and to store themwill be
prohibitively large for a large data cube. Indexing large amounts of data is also expensive.

2. Pre compute (and store) none: This means that the aggregates are computed on the fly using
the raw data whenever a query is posed. This approach does not require additional space for
storing the cube but the query response time is likely to be very poor for large data cubes.

3.Precompute and store some: This means that we pre compute and store the means that we pre-
compute and store the most frequently queried aggregates and compute others as the need arises
Aggregates from the pre computed aggregates and will be necessary to access the database ( e.g.
the data warehouse) to compute the remaining aggregates. The more aggregates we are able to
pre -compute the better the query performance.

Dept of ISE, SJBIT Page 11


Data warehouse & Data Mining 10IS74

3b. What are data cube operations? Explain. (10 Marks) Jan 2015 (4 Marks) Jan 2016

 A data cube supports viewing/modelling of a variable (a set of variables) of interest.


Measures are used to report the values of the particular variable with respect to a given
set of dimensions.
 A fact table stores measures as well as keys representing relationships to various
dimensions.
 Dimensions are perspectives with respect to which an organization wants to keep record.
 A star schema defines a fact table and its associated dimensions.
data cube operations
Roll‐up (drill‐up)
Drill‐down (roll down)
Slice and dice
Pivot (rotate)
Roll-up

 Takes the current aggregation level of fact values and does a further aggregation on one
or more of the dimensions.
 Equivalent to doing GROUP BY to this dimension by using attribute hierarchy.
 Decreases a number of dimensions - removes row headers.

SELECT [attribute list], SUM [attribute names]


FROM [table list]
WHERE [condition list]
GROUP BY [grouping list];

Drill-down

 Opposite of roll-up.
 Summarizes data at a lower level of a dimension hierarchy, thereby viewing data in a
more specialized level within a dimension.
 Increases a number of dimensions - adds new headers

Dept of ISE, SJBIT Page 12


Data warehouse & Data Mining 10IS74

Slice

 Performs a selection on one dimension of the given cube, resulting in a sub-cube.


 Reduces the dimensionality of the cubes.

Sets one or more dimensions to specific values and keeps a subset of dimensions for selected
values

Dice

 Define a sub-cube by performing a selection of one or more dimensions.


 Refers to range select condition on one dimension, or to select condition on more than
one dimension.
 Reduces the number of member values of one or more dimensions.

Dept of ISE, SJBIT Page 13


Data warehouse & Data Mining 10IS74

Pivot (or rotate)

 Rotates the data axis to view the data from different perspectives.
 Groups data with different dimensions.

4 a. Discuss the FASMI characteristics of OLAP.


(05 Marks) June/ July 2015, (04 Marks) June/ July 2016
FASMI Characteristics
In the FASMI characteristics of OLAP systems, the name derived from the first letters of the
characteristics are:
Fast: As noted earlier, most OLAP queries should be answered very quickly, perhaps within
seconds. The performance of an OLAP system has to be like that of a search engine. If the
response takes more than say 20 seconds, the user is likely to move away to something else
assuming there is a problem with the query. Achieving such performance is difficult. The data
structures must be efficient. The hardware must be powerful enough for the amount of data and

Dept of ISE, SJBIT Page 14


Data warehouse & Data Mining 10IS74

the number of users. Full pre-computation of aggregates helps but is often not practical due to
the large number of aggregates. One approach is to pre-compute the most commonly queried
aggregates and compute the remaining on-the-fly.
Analytic: An OLAP system must provide rich analytic functionality and it is expected that most
OLAP queries can be answered without any programming. The system should be able to cope
with any relevant queries for the application and the user. Often the analysis will be using the
vendor’s own tools although OLAP software capabilities differ widely between products in the
market.
Shared: An OLAP system is shared resource although it is unlikely to be shared by hundreds of
users. An OLAP system is likely to be accessed only by a select group of managers and may be
used merely by dozens of users. Being a shared system, an OLAP system should be provide
adequate security for confidentiality as well as integrity.
Multidimensional: This is the basic requirement. Whatever OLAP software is being used, it
must provide a multidimensional conceptual view of the data. It is because of the
multidimensional view of data that we often refer to the data as a cube. A dimension often has
hierarchies that show parent / child relationships between the members of a dimension. The
multidimensional structure should allow such hierarchies.
Information: OLAP systems usually obtain information from a data warehouse. The system
should be able to handle a large amount of input data. The capacity of an OLAP system to handle
information and its integration with the data warehouse may be critical.

4b. Explain Codd's OLAP rules. (10 Marks) June/ July 2015

Codd’s OLAP Characteristics


Codd et al’s 1993 paper listed 12 characteristics (or rules) OLAP systems. Another six in 1995
followed these. Codd restructured the 18 rules into four groups. These rules provide another
point of view on what constitutes an OLAP system.
1. Multidimensional conceptual view: As noted above, this is central characteristic of an OLAP
system. By requiring a multidimensional view, it is possible to carry out operations like slice and
dice.

Dept of ISE, SJBIT Page 15


Data warehouse & Data Mining 10IS74

2. Accessibility (OLAP as a mediator): The OLAP software should be sitting between data
sources (e.g. data warehouse) and an OLAP front-end.

3. Batch extraction vs. interpretive: An OLAP system should provide multidimensional data
staging plus pre calculation of aggregates in large multidimensional databases.

4. Multi-user support: Since the OLAP system is shared, the OLAP software should provide
many Normal database operations including retrieval, update, concurrency control, integrity and
security.

5. Storing OLAP results: OLAP results data should be kept separate from source data. Read-
write OLAP applications should not be implemented directly on live transaction data if OLAP
source systems are supplying information to the OLAP system directly.
6. Extraction of missing values: The OLAP system should distinguish missing values from zero
values. A large data cube may have a large number of zeros as well as some missing values. If a
distinction is not made between zero values and missing values, the aggregates are likely to be
computed incorrectly.

7. Treatment of missing values: An OLAP system should ignore all missing values regardless of
their source. Correct aggregate values will be computed once the missing values are ignored.

8. Uniform reporting performance: Increasing the number of dimensions or database size


should not significantly degrade the reporting performance of the OLAP system. This is a good
objective although it may be difficult to achieve in practice.

9. Generic dimensionality: An OLAP system should treat each dimension as equivalent in both
is structure and operational capabilities. Additional operational capabilities may be granted to
selected dimensions but such additional functions should be grantable to any dimension.

10. Unlimited dimensions and aggregation levels: An OLAP system should allow unlimited
dimensions and aggregation levels. In practice, the number of dimensions is rarely more than 10
and the number of hierarchies rarely more than six

Dept of ISE, SJBIT Page 16


Data warehouse & Data Mining 10IS74

Unit -3

1a. Discuss the Tasks of data mining with suitable examples.


(10marks) Jan 2014, (06 Marks) Jan 2015, (10 Marks) June/ July 2016
Generally, data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful information -
information that can be used to increase revenue, cuts costs, or both. Data mining software is one
of a number of analytical tools for analyzing data. It allows users to analyze data from many
different dimensions or angles, categorize it, and summarize the relationships identified.
Technically, data mining is the process of finding correlations or patterns among dozens of fields
in large relational databases.

While large-scale information technology has been evolving separate transaction and analytical
systems, data mining provides the link between the two. Data mining software analyzes
relationships and patterns in stored transaction data based on open-ended user queries. Several
types of analytical software are available: statistical, machine learning, and neural networks.
Generally, any of four types of relationships are sought:
 Classes: Stored data is used to locate data in predetermined groups. For example, a
restaurant chain could mine customer purchase data to determine when customers visit
and what they typically order. This information could be used to increase traffic by
having daily specials.
 Clusters: Data items are grouped according to logical relationships or consumer
preferences. For example, data can be mined to identify market segments or consumer
affinities.
 Associations: Data can be mined to identify associations. The beer-diaper example is an
example of associative mining.
 Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a backpack being
purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements:
 Extract, transform, and load transaction data onto the data warehouse system.
 Store and manage the data in a multidimensional database system.

Dept of ISE, SJBIT Page 17


Data warehouse & Data Mining 10IS74

 Provide data access to business analysts and information technology professionals.


 Analyze the data by application software.
 Present the data in a useful format, such as a graph or table.
Different levels of analysis are available:
 Artificial neural networks: Non-linear predictive models that learn through training and
resemble biological neural networks in structure.
 Genetic algorithms: Optimization techniques that use processes such as genetic
combination, mutation, and natural selection in a design based on the concepts of natural
evolution.
 Decision trees: Tree-shaped structures that represent sets of decisions. These decisions
generate rules for the classification of a dataset. Specific decision tree methods include
Classification and Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID) . CART and CHAID are decision tree techniques used for
classification of a dataset. They provide a set of rules that you can apply to a new
(unclassified) dataset to predict which records will have a given outcome. CART
segments a dataset by creating 2-way splits while CHAID segments using chi square tests
to create multi-way splits. CART typically requires less data preparation than CHAID.
 Nearest neighbor method: A technique that classifies each record in a dataset based on a
combination of the classes of the k record(s) most similar to it in a historical dataset
(where k 1). Sometimes called the k-nearest neighbor technique.
 Rule induction: The extraction of useful if-then rules from data based on statistical
significance.
 Data visualization: The visual interpretation of complex relationships in
multidimensional data. Graphics tools are used to illustrate data relationships.

1b. Explain shortly any five data pre-processing approaches.


(10marks) Jan 2014, (10 Marks) June/ July 2014, (10 Marks) Jan 2015
Data Preprocessing
– Aggregation
– Sampling
– Dimensionality Reduction

Dept of ISE, SJBIT Page 18


Data warehouse & Data Mining 10IS74

– Feature Subset Selection


– Feature Creation
– Discretization and Binarization
– Attribute Transformation
Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
Purpose
–Data reduction
‹Reduce the number of attributes or objects
–Change of scale
Cities aggregated into regions, states, countries, etcMore ―stable‖ data Aggregated data tends to
have less variability Sampling is the main technique employed for data selection.
–It is often used for both the preliminary investigation of the data and the final data analysis.
Statisticians sample because obtaining the entire set of data of interest is too expensive or time
Sampling is used in data mining because processing the entire set of data of interest is too
expensive or time consuming. Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement
As each item is selected, it is removed from the population
Sampling with replacement
–Objects are not removed from the population as they are selected for the sample. ‹
In sampling with replacement, the same object can be picked upmore than once
Stratified sampling
–Split the data into several partitions; then draw random samples from each partition
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining algorithms
–Allow data to be more easily visualized
–May help to eliminate irrelevant features or reduce noise

Dept of ISE, SJBIT Page 19


Data warehouse & Data Mining 10IS74

Techniques
–Principle Component Analysis
–Singular Value Decomposition
–Others: supervised and non-linear techniques feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
–duplicate much or all of the information contained in one or more other attributes
–Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
–contain no information that is useful for the data mining task at hand
–Example: students' ID is often irrelevant to the task ofpredicting students' GPA Feature Subset
SelectionTechniques:
Brute-force approch:‹
Try all possible feature subsets as input to data mining algorithm
–Embedded approaches:‹
Feature selection occurs naturally as part of the data mining algorithm
–Filter approaches:‹
Features are selected before data mining algorithm is run
–Wrapper approaches:‹
Use the data mining algorithm as a black box to find best subset of attributes
Feature Creation
Create new attributes that can capture the important information in a data set much more
efficiently than the original attributes
Feature Extraction
domain-specific
–Mapping Data to New Space
–Feature Construction combining features

Dept of ISE, SJBIT Page 20


Data warehouse & Data Mining 10IS74

2a. Explain four types of attributes with statistical operations and examples.
(06 Marks) June/ July 2014
There are different types of attributes
–Nominal
Examples: ID numbers, eye color, zip codes
–Ordinal
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall,
medium, short}
–Interval
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
–Ratio
Examples: temperature in Kelvin, length, time, counts

Dept of ISE, SJBIT Page 21


Data warehouse & Data Mining 10IS74

2b. Two binary vectors are given below: (04 Marks) June/ July 2014
X = (1,0,0,0, 0, 0, 0, 0, 0, 0)
Y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)

Calculate (i) SMC (ii) Jaccord similarly coefficient and hamming distance.

3. Write a short note on data mining applications. ( 04 Marks) Jan 2015


Although data mining is still in its infancy, companies in a wide range of industries - including
retail, finance, heath care, manufacturing transportation, and aerospace - are already using data
mining tools and techniques to take advantage of historical data. By using pattern recognition
technologies and statistical and mathematical techniques to sift through warehoused information,
data mining helps analysts recognize significant facts, relationships, trends, patterns, exceptions
and anomalies that might otherwise go unnoticed.
For businesses, data mining is used to discover patterns and relationships in the data in order to
help make better business decisions. Data mining can help spot sales trends, develop smarter
marketing campaigns, and accurately predict customer loyalty. Specific uses of data mining
include:
 Market segmentation - Identify the common characteristics of customers who buy the
same products from your company.
 Customer churn - Predict which customers are likely to leave your company and go to a
competitor.
 Fraud detection - Identify which transactions are most likely to be fraudulent.
 Direct marketing - Identify which prospects should be included in a mailing list to obtain
the highest response rate.
 Interactive marketing - Predict what each individual accessing a Web site is most likely
interested in seeing.
 Market basket analysis - Understand what products or services are commonly purchased
together; e.g., beer and diapers.
 Trend analysis - Reveal the difference between a typical customer this month and last.
Data mining technology can generate new business opportunities by:

Dept of ISE, SJBIT Page 22


Data warehouse & Data Mining 10IS74

Automated prediction of trends and behaviors: Data mining automates the process of finding
predictive information in a large database. Questions that traditionally required extensive hands-
on analysis can now be directly answered from the data. A typical example of a predictive
problem is targeted marketing. Data mining uses data on past promotional mailings to identify
the targets most likely to maximize return on investment in future mailings. Other predictive
problems include forecasting bankruptcy and other forms of default, and identifying segments of
a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns: Data mining tools sweep through
databases and identify previously hidden patterns. An example of pattern discovery is the
analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.
Using massively parallel computers, companies dig through volumes of data to discover patterns
about their customers and products. For example, grocery chains have found that when men go
to a supermarket to buy diapers, they sometimes walk out with a six-pack of beer as well. Using
that information, it's possible to lay out a store so that these items are closer.
AT&T, A.C. Nielson, and American Express are among the growing ranks of companies
implementing data mining techniques for sales and marketing. These systems are crunching
through terabytes of point-of-sale data to aid analysts in understanding consumer behavior and
promotional strategies. Why? To gain a competitive advantage and increase profitability!
Similarly, financial analysts are plowing through vast sets of financial records, data feeds, and
other information sources in order to make investment decisions. Health-care organizations are
examining medical records to understand trends of the past so they can reduce costs in the future.

4a. What is data preprocessing? Explain various data preprocessing tasks.


(14 Marks) June/ July 2015
Preprocessing

•Handle missing values

–Ignore the records with missing values

–Estimate missing values

Dept of ISE, SJBIT Page 23


Data warehouse & Data Mining 10IS74

•Remove outliers

–Find and remove those values that are significantly different from the others

•Resolve conflicts

–Merge information from different data sources Find duplicate records and identify the correct
information

Data cleaning

• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies

• Data integration

– Integration of multiple databases, data cubes, or files

• Data transformation

– Normalization and aggregation

• Data reduction

– Obtains reduced representation in volume but produces the same or similar


analytical results

• Data discretization

– Part of data reduction but with particular importance, especially for numerical
data

• Data is not always available

– E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data

• Missing data may be due to

– equipment malfunction

Dept of ISE, SJBIT Page 24


Data warehouse & Data Mining 10IS74

– inconsistent with other recorded data and thus deleted

– data not entered due to misunderstanding

– certain data may not be considered important at the time of entry

– not register history or changes of the data

• Missing data may need to be inferred.

• Use multi-resolution structure with different degrees of reduction

• Hierarchical clustering is often performed but tends to define partitions of data sets rather
than ―clusters‖

• Parametric methods are usually not amenable to hierarchical representation

• Hierarchical aggregation

– An index tree hierarchically divides a data set into partitions by value range of
some attributes

– Each partition can be considered as a bucket

– Thus an index tree with aggregates stored at each node is a hierarchical histogram

4b. Explain the following: (06 Marks) June/ July 2015

i) Euclidean distance

Dept of ISE, SJBIT Page 25


Data warehouse & Data Mining 10IS74

ii) Simple matching coefficient

iii) Jaccard coefficient.

5. For the following vectors X & Y. Calculate the Cosine, Correlation, Euclidean and
Jaccordsimilarity. X = (1, 1, 0, 1, 0, 1) ; Y = (1, 1, 1, 0, 0, 1). (10 Marks) June/ July 2016

Dept of ISE, SJBIT Page 26


Data warehouse & Data Mining 10IS74

Unit -4

1a. Develop the Apriori Alogorithm for generating frequent itemset.


(10Marks) Jan 2014 , (8Marks) Jan 2014 ,(10 Marks) June/ July 2014
Association Rules
Association rule mining is the process of finding patterns, associations and correlations among sets
of items in a database. The association rules generated have an antecedent and a consequent. An
association rule is a pattern of the form X & Y  Z [support, confidence], where X, Y, and Z are
items in the dataset. The left hand side of the rule X & Y is called the antecedent of the rule and the
right hand side Z is called the consequent of the rule. This means that given X and Y there is some
association with Z. Within the dataset, confidence and support are two measures to determine the
certainty or usefulness for each rule. Support is the probability that a set of items in the dataset
contains both the antecedent and consequent of the rule or P (X  Y  Z). Confidence is the
probability that a set of items containing the antecedent also contains the consequent or P (Z X 
Y).
Apriori Algorithm
The Apriori algorithm is a basic algorithm for finding frequent itemsets from a set of data
by using candidate generation. Apriori uses an iterative approach known as a level-wise search
because the k-itemsets is used to determine the (k + 1)-itemsets. The search begins for the set of
frequent 1-itemsets denoted L1. L1 is then used to find the set of frequent 2-iemsets, L2. L2 is then
used to find L3 and so on. This continues until no more frequent k-itemsets can be found (Han).
To improve efficiency of a level-wise generation the Apriori algorithm uses the Apriori
property. The Apriori property states that all nonempty subsets of a frequent itemset are also a
frequent itemsets. So, if {A, B} is a frequent itemset then subsets {A} and {B} are also frequent
itemsets. The level-wise search uses this Apriori property when stepping from level to the next. If
an itemset I does not satisfy the minimal support then I will not be considered a frequent itemset. If
item A is added to the itemset I then the new itemset I  A cannot occur more frequently then the
original itemset I. If an itemset fails to be considered a frequent itemset then all supersets of that
itemset will also fail that same test. The Apriori algorithm uses this property to decrease the
number of itemsets in the candidate list therefore optimizing search time. As the Apriori algorithm
steps from finding Lk-1 to finding Lk it uses a two-step process consisting of the Join Step and the

Dept of ISE, SJBIT Page 27


Data warehouse & Data Mining 10IS74

Prune Step (Han).


The first step is the Join Step and it is responsible for generating a set of candidate k-itemsets
denoted Ck from Lk-1. It does this by joining Lk-1 with itself. Apriori assumes that items within a
transaction or itemset are sorted in lexicographic order. The joining Lk-1 to Lk-1 is only performed
between itemsets that have the first (k-2) items in common with each other. Suppose itemsets I1 and
I2 are members of Lk-1. They will be joined with each other if (I1[1] = I2[1] and I1[2] = I2[2] and …
and I1[k-2] = I2[k-2] and I1[k-1] < I2[k-1]). Where I1[1] is the first item in itemset I1 and I1[k-1] is
the last item in I1 and so on for I2. It checks to make sure all k-2 items are equal and then lastly
makes sure the last item in the itemset is unequal in order to eliminate duplicate candidate k-
itemsets. The new candidate k-itemset generated from joining I1 with I2 would be I1[1] I1[2]… I1[k-
1]I2[k-1] (Han).
The second step is the Prune Step and converts Ck to Lk. The candidate list Ck contains all of the
frequent k-itemsets but it also contains k-itemsets that do not satisfy the minimum support count.
The scan of the database will determine the occurrence frequency of every candidate k-itemset to
determine if it satisfies the minimum support. This will be very costly as Ck becomes large. To
reduce the size of Ck the Apriori property is used. The Apriori property states that any (k-1)-itemset
that is not frequent cannot be a subset of a frequent k-itemset. Therefore if any (k-1)-subset of a
candidate k-itemset is not in Lk-1 then it can be removed from Ck because it too cannot be frequent.
This can be done rather quickly by utilizing a hash tree containing all frequent itemsets. The
resulting set of frequent k-itemsets is denoted Lk (Han).

1c. What is association analysis? (4 marks) Jan 2014


Association rules:
 Unsupervised learning
 Used for pattern discovery
 Each rule has form: A -> B, or Left -> Right

Dept of ISE, SJBIT Page 28


Data warehouse & Data Mining 10IS74

For example: ―70% of customers who purchase 2% milk will also purchase whole wheat bread.‖
Data mining using association rules is the process of looking for strong rules:
1. Find the large itemsets (i.e. most frequent combinations of items) Most frequently used
algorithm: Apriori algorithm.
2. Generate association rules for the above itemsets.
How to measure the strength of an association rule?
1. Using support/confidence
2. Using dependence framework
Support/confidence
Support shows the frequency of the patterns in the rule; it is the percentage of transactions that
contain both A and B, i.e.
Support = Probability(A and B)
Support = (# of transactions involving A and B) / (total number of transactions).
Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B
if they contain A, ie Confidence = Probability (B if A) = P(B/A)
Confidence = (# of transactions involving A and B) / (total number of transactions that have A).
Example:
Customer Item purchasedItem
purchased
1 pizza beer
2 salad soda
3 pizza soda
4 salad tea

If A is ―purchased pizza‖ and B is ―purchased soda‖ then


Support = P(A and B) = ¼
Confidence = P(B / A) = ½
Confidence does not measure if the association between A and B is random or not.
For example, if milk occurs in 30% of all baskets, information that milk occurs in 30% of all
baskets with bread is useless. But if milk is present in 50% of all baskets that contain coffee, that is
significant information. Support allows us to weed out most infrequent combinations – but

Dept of ISE, SJBIT Page 29


Data warehouse & Data Mining 10IS74

sometimes we should not ignore them, for example, if the transaction is valuable and generates a
large revenue, or if the products repel each other.

2 . Consider the following transaction data set '0' shows 9 transactions and list of items
using Apriori algo frequent itemset minimum support = 2 (10 Marks) June/ July 2014

3a.Explain FP - growth algorithm for discovering frequent item sets. What are its limitations?
(08 Marks) Jan 2015, (8 Marks) June/ July 2016
 Method
 For each item, construct its conditional pattern-base, and then its conditional FP-
tree
 Repeat the process on each newly created conditional FP-tree
 Until the resulting FP-tree is empty, or it contains only one path (single path will
generate all the combinations of its sub-paths, each of which is a frequent pattern)
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the FP-tree
2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
 If the conditional FP-tree contains a single path, simply enumerate all the patterns

3b. What is Apriori algorithm? How it is used to find frequent item sets? Explain.
(08 Marks) Jan 2015
 Join Step: Ck is generated by joining Lk-1with itself
 Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-
itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};

Dept of ISE, SJBIT Page 30


Data warehouse & Data Mining 10IS74

for (k = 1; Lk !=; k++) do begin


Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

 Apriori algorithm:
o Uses prior knowledge of frequent itemset properties.
o It is an iterative algorithm known as level-wise search.
o The search proceeds level-by-level as follows:
 First determine the set of frequent 1-itemset; L1
 Second determine the set of frequent 2-itemset using L1:
L2Etc.
o The complexity of computing Li is O(n) where n is the number of
transactions in the transaction database.
o Reduction of search space:
 In the worst case what is the number of itemsets in a level Li?
 Apriori uses ―Apriori Property‖:
Apriori Property:
 It is an anti-monotone property: if a set cannot pass a test, all
of its supersets will fail the same test as well.
 It is called anti-monotone because the property is monotonic
in the context of failing a test.
 All nonempty subsets of a frequent itemset must also be
frequent.
 An itemset I is not frequent if it does not satisfy the minimum
support threshold:
P(I) < min_sup

Dept of ISE, SJBIT Page 31


Data warehouse & Data Mining 10IS74

 If an item A is added to the itemset I, then the resulting itemset


I  A cannot occur more frequently than I:
I  A is not frequent Therefore, P(I  A) < min_sup

 How Apriori algorithm uses ―Apriori property‖?


o In the computation of the itemsets in Lk using Lk-1
o It is done in two steps:
 Join
 Prune

Itemset sup
Database TDB Itemset sup
{A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3

40 B, E

C2 Itemset sup C2 Itemset


L2 {A, B} 1
2nd scan {A, B}
Itemset sup
{A, C} 2
{A, C}
{A, C} 2
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3 {B, C}
{B, E} 3
{C, E} 2 {B, E}
{C, E} 2
{C, E}

C3 Itemset L3
3rd scan Itemset sup
{B, C, E}
{B, C, E} 2

Dept of ISE, SJBIT Page 32


Data warehouse & Data Mining 10IS74

3c. List the measures used for evaluating association patterns. (04 Marks) Jan 2015
 Find the frequent itemsets: the sets of items that have minimum support
 A subset of a frequent itemset must also be a frequent itemset
 i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset
 Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
 Use the frequent itemsets to generate association rules.

4. a. Explain frequent itemset generation in the apriori algorithm.


(10 Marks) June/ July 2015
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

4b. What is FP - Growth algorithm? In what way it is used to find frequency itemsets?
(03 Marks) June/ July 2015
 No candidate generation, no candidate test
 Use compact data structure
 Eliminate repeated database scan
 Basic operation is counting and FP-tree building
Completeness:
 never breaks a long pattern of any transaction
 preserves complete information for frequent pattern mining
 Compactness

Dept of ISE, SJBIT Page 33


Data warehouse & Data Mining 10IS74

 reduce irrelevant information—infrequent items are gone


 frequency descending ordering: more frequent items are more likely to be shared
 never be larger than the original database (if not count node-links and counts)
 Example: For Connect-4 DB, compression ratio could be over 100
4c. Construct the FP tree for following data set. Show the trees separately after reading
each transaction. (07 Marks) June/ July 2015
min_support = 0.5
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps: Scan DB once, find frequent 1-itemset (single item pattern)
1. Order frequent items in frequency descending order
2. Scan DB again, construct FP-tree

Dept of ISE, SJBIT Page 34


Data warehouse & Data Mining 10IS74

5. Consider the following transaction database for an supermarket


Table 4.1. Customer Items (12 Marks) June/ July 2016
C1 Milk, egg, bread, chip
C2 Egg, popcorn, chip, beer
C3 Egg, bread chip
C4 Milk, egg, bread, popcorn, chip, beer
C5 Milk, bread, beer
C6 Egg, bread, beer
C7 Milk, bread, chip
C8 Milk, egg, bread, butter, chip
C9 Milk, egg, butter, chip
Generate all the frequent item sets. Also generate all the strong rules from the frequent itemsets
by assuming the minimum support of 30% (atleast three transactions) and minimum confidence
of 60%.

Dept of ISE, SJBIT Page 35


Data warehouse & Data Mining 10IS74

Dept of ISE, SJBIT Page 36


Data warehouse & Data Mining 10IS74

Unit -5

1a. Explain Hunts Algorithm and illustrate is working?


(8 Marks) Jan 2014, (04 Marks) June/ July 2016

Let Dt be the set of training records that reach a node t


�General Procedure:
– If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
– If Dt is an empty set, then t is a leaf node labeled by the default class, yd
– If Dt contains records that belong to more than one class,use an attribute test to split the data into
smaller subsets. Recursively apply the procedure to each subset.

1b. What is rule based classifier? Explain how a rule based classifier works.
(8 Marks) Jan 2014
Rule-Based Classifier
 Classify records by using a collection of if…then…rules
 Rule: (Condition) → y

Dept of ISE, SJBIT Page 37


Data warehouse & Data Mining 10IS74

– where
• Condition is a conjunctions of attributes
• y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
• (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
• (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Name Blood Type Give Birth Can Fly Live in Water Class hawk warm no yes no ?

1c. Write the algorithm for k-nearest neighbour classification. (4 Marks) Jan 2014
Requires three things
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by
taking majority vote)

Dept of ISE, SJBIT Page 38


Data warehouse & Data Mining 10IS74

2a. Define classification. Draw a neat figure and explain general approach for solving
classification model. (06 Marks) June/ July 2014
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees, or mathematical
formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified result from
the model
 Accuracy rate is the percentage of test set samples that are correctly
classified by the model
 Test set is independent of training set, otherwise over-fitting will occur

Dept of ISE, SJBIT Page 39


Data warehouse & Data Mining 10IS74

2b. Mention the three impurity measures for selecting best splits.
(04 Marks) June/ July 2014
Consider a training set that contains 60 +ve examples and 100 -ve examples, for each oftbe
following candidate rules.
Ruel ri: Covers 50 +ve examples and 5 -ve examples.
Ruel r2: Covers 2 the examples and No -ve examples.
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
Depends on number of ways to split
2-way split
Multi-way split
Multi-way split: Use as many partitions as distinct values.
Binary split: Divides values into two subsets. Need to find optimal partitioning.

Dept of ISE, SJBIT Page 40


Data warehouse & Data Mining 10IS74

2c. Determine which is the best and worst candidate rule according to,
i) Rule accuracy
ii) Likelihood ratio statistic.
iii) Laplace measure. 10 Marks) June/ July 2014

A statistical model is often a parametrized family of probability density functions or probability


mass functions A simple-vs.-simple hypothesis test has completely specified models under both
the null and alternative hypotheses, which for convenience are written in terms of fixed values of
notional parameter Note that under either hypothesis, the distribution of the data is fully
specified; there are no unknown parameters to estimate. The likelihood ratio test is based on the
likelihood ratio,.

3a. How decision trees are used for classification? Explain decision tree induction
algorithm for classification. (10 Marks) Jan 2015

Dept of ISE, SJBIT Page 41


Data warehouse & Data Mining 10IS74

3b. How to improve accuracy of classification? Explain. (05 Marks) Jan 2015
 Boosting increases classification accuracy
 Applicable to decision trees or Bayesian classifier
 Learn a series of classifiers, where each classifier in the series pays more attention to the
examples misclassified by its predecessor
 Boosting requires only linear time and constant space
 Assign every example an equal weight 1/N
 For t = 1, 2, …, T Do
 Obtain a hypothesis (classifier) h(t) under w(t)
 Calculate the error of h(t) and re-weight the examples based on the error
 Normalize w(t+1) to sum to 1
 Output a weighted sum of all the hypothesis, with each hypothesis weighted according to
its accuracy on the training set

3c. Explain the importance of evalution criteria for classification methods.


(05 Marks) Jan 2015

 Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess the
value of predicted attribute for a new data.

 Speed − This refers to the computational cost in generating and using the classifier or
predictor.

 Robustness − It refers to the ability of classifier or predictor to make correct predictions from
given noisy data.

 Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently;
given large amount of data.

 Interpretability − It refers to what extent the classifier or predictor understands.

4 a. What is classification? Explain the two classification models with example.


(06 Marks) June/ July 2015

Dept of ISE, SJBIT Page 42


Data warehouse & Data Mining 10IS74

Classification is the task of assigning objects to one of several predefined categories.It’s a task
of mapping an input attribute set x into class label Y

 Classification:
 predicts categorical class labels
 classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
 Prediction:
 models continuous-valued functions, i.e., predicts unknown or missing values
 Typical Applications
 credit approval
 target marketing
 medical diagnosis treatment effectiveness analysis

Dept of ISE, SJBIT Page 43


Data warehouse & Data Mining 10IS74

4b. Discuss the characteristics of decision tree induction algorithms.


(10 Marks) June/ July 2015, (6 Marks) June/ July 2016

1. Decision tree induction is a Non parametric approach for building classification models.
2. Finding an optimal decision tree is NP complete problem.,ie greedy ,topdown recursive
uses all the approaches.
3. Computationally inexpensive.
4. Smaller sized trees are easy to interpret.
5. Expressive representation for learning discrete valued function.
6. Robust to the presence of noise
7. Redundant attributes does not affect the accuracy of decision trees
8. At leaf nodes no. of records may be too small to make decisions about class
representation of nodes known as data fragmentation problem. Solution is disallow
further splitting.
4c. Explain sequential covering algorithm in rule -based classifier.

(04 Marks) June/ July 2015

Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

Dept of ISE, SJBIT Page 44


Data warehouse & Data Mining 10IS74

R1: (Give Birth = no)  (Can Fly = yes)  Birds

R2: (Give Birth = no)  (Live in Water = yes)  Fishes

R3: (Give Birth = yes)  (Blood Type = warm)  Mammals

R4: (Give Birth = no)  (Can Fly = no)  Reptiles

R5: (Live in Water = sometimes)  Amphibians

5. Consider a training data set that contains 100 positive examples and 400 negative
examples. (10 Marks) June/ July 2016

For each of the following candidate rules.

R1 : A → + (covers 4 positive & 1 negative examples)

R2 : B → + (covers 30 positive & 10 negative examples)

R3 : C → + (covers 100 positive & 90 negative examples).

Determine which is the best and worst candidate rule according to :

i) Rule accuracy

ii) FOIL’s information gain

iii) The likelihood ratio statistic.

Dept of ISE, SJBIT Page 45


Data warehouse & Data Mining 10IS74

Unit -6

1a. What is Bayes Theorm? Show how it is used for classification (06marks) Jan 2014
 Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
 Incremental: Each training example can incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be combined with observed data.
 Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
 Standard: Even when Bayesian methods are computationally intractable, they can provide a
standard of optimal decision making against which other methods can be measured
 Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes
theorem
 MAP (maximum posteriori) hypothesis
 Practical difficulty: require initial knowledge of many probabilities, significant
computational cost

1b. Discuss the methods for estimating predictive accuracy of classification method.
(10 marks) Jan 2014,(10 Marks) June/ July 2016

Dept of ISE, SJBIT Page 46


Data warehouse & Data Mining 10IS74

1c. What are the two approaches for extending the binary classifiers to extend to handle
multi class problems. (4 marks) Jan 2014, (10 Marks) June/ July 2014

Each training point belongs to one of N different classes. The goal is to construct a function
which, given a new data point, will correctly predict the class to which the new point belongs.

Some classification algorithms/models have been adaptated to the multi-label task, without
requiring problem transformations. Examples of these include:

 boosting: AdaBoost.MH and AdaBoost.MR are extended versions of AdaBoost for multi-
label data.
 k-nearest neighbors: the ML-kNN algorithm extends the k-NN classifier to multi-label
data.
 decision trees: "Clare" is an adapted C4.5 algorithm for multi-label classification; the
modification involves the entropy calculations.MMC, MMDT, and SSC refined MMDT,
can classify multi-labeled data based on multi-valued attributes without transforming the
attributes into single-values. They are also named multi-valued and multi-labeled
decision tree classification methods.
 kernel methods for vector output
 neural networks: BP-MLL is an adaptation of the popular back-propagation algorithm for
multi-label learning.

2a. For the given confusion matrix below for three classes. Find sensitivity and specificity
metrics to estimate predictive accuracy of classification methods.
(10 Marks) June/ July 2014

Table: Confusion matrix for three classes

3a. What are Baysian classifiers? Explain Baye's theorem for classification.
(10 Marks) Jan 2015

Dept of ISE, SJBIT Page 47


Data warehouse & Data Mining 10IS74

• Consider each attribute and class label as random variables


• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )
– Can we estimate P(C| A1, A2,…,An ) directly from data?
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all values of C using
the Bayes theorem
– Choose valueof C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
• Assume independence among attributes Ai when class is given:
– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

– Can estimate P(Ai| Cj) for all Ai and Cj.


– New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

3b. How rule based classifiers are used for classification? Explain. (10 Marks) Jan 2015
Characteristics of Rule-Based Classifier
Mutually exclusive rules
– Classifier contains mutually exclusive rules if the rules are independent of each
other
– Every record is covered by at most one rule
– Exhaustive rules
– Classifier has exhaustive coverage if it accounts for every possible combination of
attribute values
– Each record is covered by at least one rule y
• Rules are rank ordered according to their priority
– An ordered rule set is known as a decision list

Dept of ISE, SJBIT Page 48


Data warehouse & Data Mining 10IS74

• When a test record is presented to the classifier


– It is assigned to the class label of the highest ranked rule it has triggered
– If none of the rules fired, it is assigned to the default class
4a. List five criteria for evaluating classification methods. Discuss them briefly.
(05 Marks) June/ July 2015, (05 Marks) June/ July 2016
Methods for Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random sub-sampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Stratified sampling
Over-sampling vs. Under-sampling
Bootstrap
Sampling with replaceme

4b. What is predictive accuracy of classification methods? Explain different types of


estimating the accuracy of a method. (07 Marks) June/ July 2015
• Holdout
– Reserve 2/3 for training and 1/3 for testing
• Random subsampling
– Repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
• Stratified sampling
– oversampling vs undersampling
• Bootstrap
– Sampling with replacement

Dept of ISE, SJBIT Page 49


Data warehouse & Data Mining 10IS74

Accuracy

Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio
between the number of correct predictions and the total number of predictions (the number of
test data points).

accuracy=# correct predictions# total data points


Confusion matrix
Accuracy looks easy enough. However, it makes no distinction between classes; correct answers
for class 0 and class 1 are treated equally. Sometimes this is not enough. You might want to look
at how many examples failed for class 0 vs. class 1. This would be the case if the cost of
misclassification is different, or if you have a lot more test data of one class than the other. For
instance, making the call that a patient has cancer when he doesn’t (known as a false positive)
has very different consequences than making the call that a patient doesn’t have cancer when he
does (a false negative). A confusion matrix (or confusion table) shows a more detailed
breakdown of correct and incorrect classifications for each class. The rows of the matrix
correspond to ground truth labels, and the columns represent the prediction.

4c. Consider the following training set for predicting the loan default problem: Find the
conditional independence for given training set using Bayes theorem for classification.
(08 Marks) June/ July 2015

• Normal distribution: 1 
(120110) 2

P( Income  120 | No)  e 2 ( 2975)


 0.0072
2 (54.54)
– One for each (Ai,ci) pair

( Ai  ij ) 2
– For (Income, Class=No): 
1 2 ij2
P( Ai | c j )  e
– If Class=No 2 ij2

– sample mean = 110

sample variance = 2975

Dept of ISE, SJBIT Page 50


Data warehouse & Data Mining 10IS74

5. Explain how bootstrapping,bagging,boosting improve accuracy of classification methods


(05 Marks) June/ July 2016
Bootstrap
– Sampling with replacement
Bagging
• Sampling with replacement
• Build classifier on each bootstrap sample
• Each sample has probability (1 – 1/n)n of being selected
Boosting
• An iterative procedure to adaptively change distribution of training data by focusing more
on previously misclassified records
– Initially, all N records are assigned equal weights
– Unlike bagging, weights may change at the end of boosting round

Dept of ISE, SJBIT Page 51


Data warehouse & Data Mining 10IS74

Unit -7

1a. List and explain four distance measures to compute the distance between a pair of
points and find out the distance between two objects represented by attribute.
(08 marks) Jan 2014

Similarity and Dissimilarity Between Objects


Euclidean distance:
 Properties d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j 2 ip jp
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance, parametric Pearson product moment correlation, or
other disimilarity measures.
Manhattan distance

d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp
Simple matching coefficient (invariant, if the binary variable is symmetric):

d (i, j)  bc
a b c  d

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

d (i, j)  bc
a b c
1b. Explain the cluster analysis methods briefly.
(08 marks) Jan 2014, (08 Marks) June/ July 2015
 Partitioning algorithms: Construct various partitions and then evaluate them by some
criterion

Dept of ISE, SJBIT Page 52


Data warehouse & Data Mining 10IS74

 Partitioning method: Construct a partition of a database D of n objects into a set of k


clusters
 Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster

 Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Density-based: based on connectivity and density functions
 Grid-based: based on a multiple-level granularity structure
 Model-based: A model is hypothesized for each of the clusters and the idea is to find the
best fit of that model to each other

1c. What are the features of cluster analysis


(04 marks) Jan 2014, (05 Marks) June/ July 2016
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

Dept of ISE, SJBIT Page 53


Data warehouse & Data Mining 10IS74

2a. Explain K means clustering method and alogorithm. (10 Marks) June/ July
2014, (10 Marks) Jan 2015, (12 Marks) June/ July 2015, (5 Marks) June/ July 2016

 Given k, the k-means algorithm is implemented in 4 steps:


 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the current partition. The
centroid is the center (mean point) of the cluster.
 Assign each object to the cluster with the nearest seed point.
 Go back to Step 2, stop when no more new assignment.

2b. What is Hierarchical clustering method? Explain the algorithms for computing
distances between clusters. (10 Marks) June/ July 2014, (10 Marks) June/ July 2016

Use distance matrix as clustering criteria. This method does not require the number of clusters k
as an input, but needs a termination condition

Dept of ISE, SJBIT Page 54


Data warehouse & Data Mining 10IS74

3. How density based methods are used for clustering? Explain with example.
(10 Marks) Jan 2015
 Arbitrary select a point p
 Retrieve all points density-reachable from p wrt Eps and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database.
 Continue the process until all of the points have been processed.

Dept of ISE, SJBIT Page 55


Data warehouse & Data Mining 10IS74

Unit -8

1. Write short notes on the following:


(20 Marks) Jan 2014, (20 Marks) June/ July 2014
a. web content mining
Web mining - is the application of data mining techniques to discover patterns from the World
Wide Web. Web mining can be divided into three different types – Web usage mining, Web
content mining and Web structure mining.
Web content mining, also known as text mining, is generally the second step in Web data
mining. Content mining is the scanning and mining of text, pictures and graphs of a Web page to
determine the relevance of the content to the search query. This scanning is completed after the
clustering of web pages through structure mining and provides the results based upon the level of
relevance to the suggested query. With the massive amount of information that is available on
the World Wide Web, content mining provides the results lists to search engines in order of
highest relevance to the keywords in the query.
Text mining is directed toward specific information provided by the customer search information
in search engines. This allows for the scanning of the entire Web to retrieve the cluster content
triggering the scanning of specific Web pages within those clusters. The results are pages relayed
to the search engines through the highest level of relevance to the lowest. Though, the search
engines have the ability to provide links to Web pages by the thousands in relation to the search
content, this type of web mining enables the reduction of irrelevant information.
Web text mining is very effective when used in relation to a content database dealing with
specific topics. For example online universities use a library system to recall articles related to
their general areas of study. This specific content database enables to pull only the information
within those subjects, providing the most specific results of search queries in search engines.
This allowance of only the most relevant information being provided gives a higher quality of
results. This increase of productivity is due directly to use of content mining of text and visuals.
The main uses for this type of data mining are to gather, categorize, organize and provide the
best possible information available on the WWW to the user requesting the information. This
tool is imperative to scanning the many HTML documents, images, and text provided on Web
pages. The resulting information is provided to the search engines in order of relevance giving

Dept of ISE, SJBIT Page 56


Data warehouse & Data Mining 10IS74

more productive results of each search.


Web content categorization with a content database is the most important tool to the efficient use
of search engines. A customer requesting information on a particular subject or item would
otherwise have to search through thousands of results to find the most relevant information to his
query. Thousands of results through use of mining text are reduced by this step. This eliminates
the frustration and improves the navigation of information on the Web.
Business uses of content mining allow for the information provided on their sites to be structured
in a relevance-order site map. This allows for a customer of the Web site to access specific
information without having to search the entire site. With the use of this type of mining, data
remains available through order of relativity to the query, thus providing productive marketing.
Used as a marketing tool this provides additional traffic to the Web pages of a company’s site
based on the amount of keyword relevance the pages offer to general searches.
As the second section of data mining, text mining is useful to improve the productive uses of
mining for businesses, Web designers, and search engines operations. Organization,
categorization, and gathering of the information provided by the WWW becomes easier and
produces results that are more productive through the use of this type of mining.
In short, the ability to conduct Web content mining allows results of search engines to maximize
the flow of customer clicks to a Web site, or particular Web pages of the site, to be accessed
numerous times in relevance to search queries. The clustering and organization of Web content
in a content database enables effective navigation of the pages by the customer and search
engines. Images, content, formats and Web structure are examined to produce a higher quality of
information to the user based upon the requests made. Businesses can maximize the use of this
text mining to improve marketing of their sites as well as the products they offer

b.text mining (5 Marks) Jan 2015 , (05 Marks) June/ July 2016

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to
the process of deriving high-quality information from text. High-quality information is typically
derived through the devising of patterns and trends through means such as statistical pattern
learning. Text mining usually involves the process of structuring the input text (usually parsing,

Dept of ISE, SJBIT Page 57


Data warehouse & Data Mining 10IS74

along with the addition of some derived linguistic features and the removal of others, and
subsequent insertion into a database), deriving patterns within the structured data, and finally
evaluation and interpretation of the output. 'High quality' in text mining usually refers to some
combination of relevance, novelty, and interestingness. Typical text mining tasks include text
categorization, text clustering, concept/entity extraction, production of granular taxonomies,
sentiment analysis, document summarization, and entity relation modeling (i.e., learning
relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency
distributions, pattern recognition, tagging/annotation, information extraction, data mining
techniques including link and association analysis, visualization, and predictive analytics. The
overarching goal is, essentially, to turn text into data for analysis, via application of natural
language processing (NLP) and analytical methods.

c..spatial databases mining (05 Marks) June/ July 2016


The main difference between data mining in relational DBS and in spatial DBS is that
attributes of the neighbors of some object of interest may have an influence on the object and
therefore have to be considered as well. The explicit location and extension of spatial objects
define implicit relations of spatial neighborhood (such as topological, distance and direction
relations) which are used by spatial data mining algorithms. Therefore, new techniques are
required for effective and efficient data mining.
Database Primitives for Spatial Data Mining
We have developed a set of database primitives for mining in spatial databases which are
sufficient to express most of the algorithms for spatial data mining and which can be efficiently
supported by a DBMS. We believe that the use of these database primitives will enable the
integration of spatial data mining with existing DBMS’s and will speed-up the development of
new spatial data mining algorithms. The database primitives are based on the concepts of
neighborhood graphs and neighborhood paths.
Efficient DBMS Support
Effective filters allow to restrict the search to such neighborhood paths ―leading away‖
from a starting object. Neighborhood indices materialize certain neighborhood graphs to support
efficient processing of the database primitives by a DBMS. The database primitives have been

Dept of ISE, SJBIT Page 58


Data warehouse & Data Mining 10IS74

implemented on top of the DBMS Illustra and are being ported to Informix Universal Server.
Algorithms for Spatial Data Mining
New algorithms for spatial characterization and spatial trend analysis were developed.
For spatial characterization it is important that class membership of a database object is not only
determined by its non-spatial attributes but also by the attributes of objects in its neighborhood.
In spatial trend analysis, patterns of change of some non-spatial attributes in the neighborhood of
a database object are determined.
Applications
Spatial Trend Detection in GIS
Spatial trends describe a regular change of non-spatial attributes when moving away from
certain start objects. Global and local trends can be distinguished. To detect and explain such
spatial trends, e.g. with respect to the economic power, is an important issue in economic
geography.
Spatial Characterization of Interesting Regions
Another important task of economic geography is to characterize certain target regions
such as areas with a high percentage of retirees. Spatial characterization does not only consider
the attributes of the target regions but also neighboring regions and their properties.

d. mining temporal databases (5 Marks) Jan 2015

Temporal Data Mining (TDM) deals with the problem of mining patterns from temporal data,
which can be either symbolic sequences or numerical time series. It has the capability to look for
interesting correlations or rules in large sets of temporal data, which might be overlooked when
the temporal component is ignored or treated as a simple numeric, attribute .Currently TDM is a
fast expanding field with many research results reported and many new temporal data mining
analysis methods or prototypes developed recently. There are two factors that contribute to the
popularity of temporal data mining. The first factor is an increase in the volume of temporal data
stored, as many real world applications deal with huge amount of temporal data. The second
factor is the mounting recognition in the value of temporal data.
In many application domains, temporal data are now being viewed as invaluable assets from
which hidden knowledge can be derived, so as to help understand the past and/or plan for the

Dept of ISE, SJBIT Page 59


Data warehouse & Data Mining 10IS74

future. TDM covers a wide spectrum of paradigms for knowledge modeling and discovery. Since
temporal data mining is relatively a new field of research, there is no widely accepted taxonomy

2 .Write short notes on :

a. Web content mining (5 Marks) Jan 2015

The web content mining approach of using the traditional search engine has migrated into
intelligent agent-based mining and database-driven mining, where intelligent software agents for
specific tasks support the search for more relevant web contents by taking domain characteristics
and user profiles into consideration more intelligently. They also help users interpret the
discovered web contents.
Web content mining is related but different from data mining and text mining. It is related to data
mining because many data mining techniques can be applied in Web content mining. It is related
to text mining because much of the web contents are texts. However, it is also quite different
from data mining because Web data are mainly semi-structured and/or unstructured, while data
mining deals primarily with structured data. Web content mining is also different from text
mining because of the semi-structure nature of the Web, while text mining focuses on
unstructured texts. Web content mining thus requires creative applications of data mining and/or
text mining techniques and also its own unique approaches. In the past few years, there was a
rapid expansion of activities in the Web content mining area. This is not surprising because of
the phenomenal growth of the Web contents and significant economic benefit of such mining.
However, due to the heterogeneity and the lack of structure of Web data, automated discovery of
targeted or unexpected knowledge information still present many challenging research problems.
we will examine the following important Web content mining problems and discuss existing
techniques for solving these problems. Some other emerging problems will also be surveyed.

Dept of ISE, SJBIT Page 60


Data warehouse & Data Mining 10IS74

 Data/information extraction: Our focus will be on extraction of structured data from


Web pages, such as products and search results. Extracting such data allows one to
provide services. Two main types of techniques, machine learning and automatic
extraction are covered.
 Web information integration and schema matching: Although the Web contains a
huge amount of data, each web site (or even page) represents similar information
differently. How to identify or match semantically similar data is a very important
problem with many practical applications. Some existing techniques and problems are
examined.
 Opinion extraction from online sources: There are many online opinion sources, e.g.,
customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially
consumer opinions) is of great importance for marketing intelligence and product
benchmarking. We will introduce a few tasks and techniques to mine such sources.
 Knowledge synthesis: Concept hierarchies or ontology are useful in many applications.
However, generating them manually is very time consuming. A few existing methods that
explores the information redundancy of the Web will be presented. The main application
is to synthesize and organize the pieces of information on the Web to give the user a
coherent picture of the topic domain..
 Segmenting Web pages and detecting noise: In many Web applications, one only wants
the main content of the Web page without advertisements, navigation links, copyright
notices. Automatically segmenting Web page to extract the main content of the pages is
interesting problem. A number of interesting techniques have been proposed in the past
few years.
b. Unstructed text
the phrase unstructured data usually refers to information that doesn't reside in a traditional row-
column database. As you might expect, it's the opposite of structured data — the data stored in
fields in a database.
Examples of Unstructured Data
Unstructured data files often include text and multimedia content. Examples include e-mail
messages, word processing documents, videos, photos, audio files, presentations, webpages and
many other kinds of business documents. Note that while these sorts of files may have an

Dept of ISE, SJBIT Page 61


Data warehouse & Data Mining 10IS74

internal structure, they are still considered "unstructured" because the data they contain doesn't
fit neatly in a database.
Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the
amount of unstructured data in enterprises is growing significantly — often many times faster
than structured databases are growing.
Mining Unstructured Data
Many organizations believe that their unstructured data stores include information that could
help them make better business decisions. Unfortunately, it's often very difficult to analyze
unstructured data. To help with the problem, organizations have turned to a number of different
software solutions designed to search unstructured data and extract important information. The
primary benefit of these tools is the ability to glean actionable information that can help a
business succeed in a competitive environment.
Because the volume of unstructured data is growing so rapidly, many enterprises also turn to
technological solutions to help them better manage and store their unstructured data. These can
include hardware or software solutions that enable them to make the most efficient use of their
available storage space.
Unstructured Data and Big Data
As mentioned above, unstructured data is the opposite of structured data. Structured data
generally resides in a relational database, and as a result, it is sometimes called relational data.
This type of data can be easily mapped into pre-designed fields. For example, a database
designer may set up fields for phone numbers, zip codes and credit card numbers that accept a
certain number of digits. Structured data has been or can be placed in fields like these. By
contrast, unstructured data is not relational and doesn't fit into these sorts of pre-defined data
models.
Semi-Structured Data
In addition to structured and unstructured data, there's also a third category: semi-structured data.
Semi-structured data is information that doesn't reside in a relational database but that does have
some organizational properties that make it easier to analyze. Examples of semi-structured data
might include XML documents and NoSQL databases.
The term big data is closely associated with unstructured data. Big data refers to extremely large
datasets that are difficult to analyze with traditional tools. Big data can include both structured

Dept of ISE, SJBIT Page 62


Data warehouse & Data Mining 10IS74

and unstructured data, but IDC estimates that 90 percent of big data is unstructured data. Many
of the tools designed to analyze big data can handle unstructured data.
Unstructured Data Management
Organizations use of variety of different software tools to help them organize and manage
unstructured data. These can include the following:
Big data tools
Software like Hadoop can process stores of both unstructured and structured data that are
extremely large, very complex and changing rapidly.
Business intelligence software
Also known as BI, business intelligence is a broad category of analytics, data mining, dashboards
and reporting tools that help companies make sense of their structured and unstructured data for
the purpose of making better business decisions.
Data integration tools
These tools combine data from disparate sources so that they can be viewed or analyzed from a
single application. They sometimes include the capability to unify structured and unstructured
data.
Document management systems
Also called enterprise content management systems, a DMS can track, store and share
unstructured data that is saved in the form of document files.
Information management solutions
This type of software tracks structured and unstructured enterprise data throughout its lifecycle.
Search and indexing tools
These tools retrieve information from unstructured data files such as documents, Web pages and
photos.
Unstructured Data Technology
A group called the Organization for the Advancement of Structured Information Standards
(OASIS) has published the Unstructured Information Management Architecture (UIMA)
standard. The UIMA "defines platform-independent data representations and interfaces for
software components or services called analytics, which analyze unstructured information and
assign semantics to regions of that unstructured information."

Dept of ISE, SJBIT Page 63


Data warehouse & Data Mining 10IS74

Many industry watchers say that Hadoop has become the de facto industry standard for
managing Big Data. This open source project is managed by the Apache Software Foundation.

c. Text clustering (5 Marks) Jan 2015


• Partition unlabeled examples into disjoint subsets of clusters, such that:
– Examples within a cluster are very similar
– Examples in different clusters are very different
• Discover new categories in an unsupervised manner (no sample category labels
provided).
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled
examples.
• Recursive application of a standard clustering algorithm can produce a hierarchical
clustering.

Aglommerative vs. Divisive Clustering


• Aglommerative (bottom-up) methods start with each example in its own cluster and
iteratively combine them to form larger and larger clusters.
• Divisive (partitional, top-down) separate all examples immediately into clusters.
Direct Clustering Method
• Direct clustering methods require a specification of the number of clusters, k, desired.
• A clustering evaluation function assigns a real-value quality measure to a clustering.
• The number of clusters can be determined automatically by explicitly generating
clusterings for multiple values of k and choosing the best result according to a clustering
evaluation function.

d. Temporal data mining tasks. (20 Marks) June/ July 2015

Data mining has been used in a wide range of applications. However, the possible objectives of
data mining,which are often called tasks of data mining, can be classified into some broad
categories:prediction,classification, clustering, search and retrieval, and pattern discovery. This

Dept of ISE, SJBIT Page 64


Data warehouse & Data Mining 10IS74

categorization follows the categorization of data mining tasks extended to temporal data mining
Prediction:Prediction is the task of explicitly modelling variable dependencies to predict a
subset.
Classification:Classification is the task of assigning class labels to the data according to a model
learned from the training data where the classes are known. Classification is one of the most
common tasks in supervised learning, but it has not received much attention in temporal data
mining In sequence classification, each sequence presented to the system is assumed to belong
to one of predefined classes and the goal is to automatically determine the corresponding
category for a given input sequence.
Clustering: Clustering is the process of finding intrinsic groups, called clusters, in the data.
Clustering of time series is concerned with grouping a collection of time series (or sequences)
based on their similarity. Time series clustering has been shown effective in providing useful
information in various domains .Clustering of sequences is relatively less explored but is
becoming increasingly important in data mining applications such as web usage
mining and bioinformatics .
Searching and Retrieval Searching and retrieval are concerned with efficiently locating
subsequences or sub series in large databases of sequences or time series. In data mining, query
based searches are more concerned with the problem of efficiently locating approximate
matching than exact matching, known as content - based retrieval

3. Explain the concept of finding similar web pages and finger printing in detail
(10 Marks) June/ July 2016

finding similar web pages


Google search algorithm" redirects here. For other search algorithms used by Google, see Google
Penguin, Google Panda, and Google Hummingbird. Mathematical PageRanks for a simple
network, expressed as percentages. (Google uses a logarithmic scale.) Page C has a higher
PageRank than Page E, even though there are fewer links to C; the one link to C comes from an
important page and hence is of high value. If web surfers who start on a random page have an
85% likelihood of choosing a random link from the page they are currently visiting, and a 15%

Dept of ISE, SJBIT Page 65


Data warehouse & Data Mining 10IS74

likelihood of jumping to a page chosen at random from the entire web, they will reach Page E

8.1% of the time


. (The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.)
Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other
pages would have PageRank zero. In the presence of damping, Page A effectively links to all
pages in the web, even though it has no outgoing links of its own.
PageRank is an algorithm used by Google Search to rank websites in their search engine results.
PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of
measuring the importance of website pages.
According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough
estimate of how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites.
It is not the only algorithm used by Google to order search engine results, but it is the first
algorithm that was used by the company, and it is the best-known.
The PageRank algorithm outputs a probability distribution used to represent the likelihood that a
person randomly clicking on links will arrive at any particular page. PageRank can be calculated
for collections of documents of any size. It is assumed in several research papers that the
distribution is evenly divided among all documents in the collection at the beginning of the
computational process. The PageRank computations require several passes, called "iterations",

Dept of ISE, SJBIT Page 66


Data warehouse & Data Mining 10IS74

through the collection to adjust approximate PageRank values to more closely reflect the
theoretical true value.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly
expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5 means there is
a 50% chance that a person clicking on a random link will be directed to the document with the
0.5 PageRank

finger printing

In short, fingerprinting is the capability of a site to identify or re-identify a visiting user, user
agent or device via configuration settings or other observable characteristics.
ingerprinting can be used as a security measure (e.g. as means of authenticating the user).
However, fingerprinting is also a potential threat to users' privacy on the Web. This document
does not attempt to provide a single unifying definition of "privacy" or "personal data", but we
highlight how browser fingerprinting might impact users' privacy. For example, browser
fingerprinting can be used to:
 identify a user
 correlate a user’s browsing activity within and across sessions
 track users without transparency or control
The privacy implications associated with each use case are discussed below. Following from the
practice of security threat model analysis, we note that there are distinct models of privacy
threats for fingerprinting. Defenses against these threats differ, depending on the particular
privacy implication and the threat model of the user.
Passive fingerprinting is browser fingerprinting based on characteristics observable in the
contents of Web requests, without the use of any code executing on the client side.
For active fingerprinting, we also consider techniques where a site runs JavaScript or other code
on the local client to observe additional characteristics about the browser. Techniques for active
fingerprinting might include accessing the window size, enumerating fonts or plug-ins,
evaluating performance characteristics, or rendering graphical patterns. Key to this distinction is
that active fingerprinting takes place in a way that is potentially detectable on the client.

Dept of ISE, SJBIT Page 67

Potrebbero piacerti anche