DataMining IA3 SolManualCS PDF

USN 1 P E C S
PESIT Bangalore South Campus

Hosur road, 1km before Electronic City, Bengaluru -100
Department of Computer Science and Engineering
INTERNAL ASSESSMENT TEST – 3

Date : 15/05/18 Max Marks: 40
Subject & Code: Data Mining and Data Warehousing(15CS651) Section: 6th Sem A, B, C
Name of Faculty: Dr. Sandesh B J & Prof Pooja Agarwal Time: 11:30-1:00PM
Note: Answer FIVE full questions.
1 What are Rule Based classifiers? How to build a Rule based classifier? 2+3+3
Explain Sequential covering algorithm.
• Rule: (Condition) → y
o where
▪ Condition is a conjunctions of attributes
▪ y is the class label
o LHS: rule antecedent or condition
o RHS: rule consequent
o Examples of classification rules:
▪ (Blood Type=Warm)  (Lay Eggs=Yes) → Birds
▪ (Taxable Income < 50K)  (Refund=Yes) → Evade=No
• A rule r covers an instance x if the attributes of the instance satisfy the condition of the
rule
➢ R1: (Give Birth = no)  (Can Fly = yes) → Birds
➢ R2: (Give Birth = no)  (Live in Water = yes) → Fishes
➢ R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
➢ R4: (Give Birth = no)  (Can Fly = no) → Reptiles
➢ R5: (Live in Water = sometimes) → Amphibians
• Coverage of a rule:
o Fraction of records that satisfy the antecedent of a rule
• Accuracy of a rule:
– Fraction of records that satisfy both the antecedent and consequent of a rule
OR
2 What are Bayesian Classifiers? Explain Bayes theorem for classification. 4+4
Bayesian Classifier –
l An approach for modelling probabilistic relationships between attribute set and the class
variable.
l Let X and Y be a pair of random variables. Their joint probability P(X=x, Y=y) refers to
the probability that variable X will take on the value x and variable Y will take on the
value y.
l A conditional probability is the probability that a random variable will take on a
particular value give that the outcome for another random variable is known.
l i.e. P(Y=y | X=x) refers to the probability that the variable Y will take the value y, given
that variable X is observed to have the value x.
l P(X,Y) = P(Y|X) * P(X) = P(X|Y) * P(Y)
Bayes Theorem - Let X is attribute set and Y ix class variable
If Y has non deterministic relationship with attributes then X and Y can be treated as random
BE(CSE), VI Semester
variables and their relationship as P(Y|X).
P(Y|X) is also known as posterior probability and P(Y) is prior probability.
3 Write and explain the K-Nearest Neighbor classification algorithm. 4

a) K-Nearest Neighbor classifier-
Requires three things
The set of stored records
Distance Metric to compute distance between records
The value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
Compute distance to other training records
Identify k nearest neighbors
Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by
taking majority vote)
K-nearest neighbors of a record x are data points that have the k smallest distance to x
l Compute distance between two points:
– Euclidean distance
d ( p, q ) =  ( pi −q ) 2
– i i
– Determine the class from nearest neighbor list

– take the majority vote of class labels among the k-nearest neighbors
– Weigh the vote according to distance
◆ weight factor, w = 1/d2
b) Consider the one-dimensional data set shown below 4

x 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.5 7.0 9.5
y - - + + + - - + - -
Classify the data point x=5.0 according to its 1-, 3-, 5-, and 9-nearest neighbors.
X Y Distance Rank
0.5 - 4.5 7
3 - 2 6
4.5 + 0.5 5
4.6 + 0.4 4
4.9 + 0.1 1
5.2 - 0.2 2
5.5 + 0.5 5
7 - 2 6
9.5 - 4.5 7
1 NN +
3 NN –
5 NN +
9 NN -
OR
4 What is Hierarchical clustering? Use similarity matrix in the table to perform single link 2+6
hierarchical clustering. Show the results by drawing clusters and a dendrogram.
P1 P2 P3 P4 P5
P1 0.0 0.90 0.59 0.45 0.65
P2 0.90 0.0 0.36 0.53 0.02
P3 0.59 0.36 0.0 0.56 0.15
P4 0.45 0.53 0.56 0.0 0.24
P5 0.65 0.02 0.15 0.24 0.0
5 Consider a training set that contain 100 +ve examples and 400 -ve examples for each of the 4+4
following candidate rule. Determine which is the best and worst candidate according to
i) Rule accuracy ii) Foil Information gain.
R1 :A → + (covers 4 + ve and 1- ve examples)
R2 :B → + (covers 30 +ve and 10-ve examples)
R3 :C → + (covers 100 +ve and 90 -ve examples).
OR
6 What are the characteristics of Rule Based classifiers? Give difference between rule-based 4+4
ordering and class based ordering scheme.
7 What is Cluster analysis? Write the K- means clustering algorithm. List and explain different 2+3+3
types of clustering.
● A clustering is a set of clusters
● Important distinction between hierarchical and partitional sets of clusters
● Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset
● Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
OR
8 Explain density based methods for clustering with an example of DBSCAN.. 8
● Density-based
– A cluster is a dense region of points, which is separated by low-density regions,
from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise and outliers
are present.
9 Explain different methods for computing distances between clusters. 8
● ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
OR
10 List and explain different types of evaluation measures that are used to judge various aspects of 4+4
cluster validity. Give equations for cohesion and Separation.
● Evaluation measures that are applied to judge various aspects of cluster validity, are
classified into the following three types.
– Unsupervised (Internal Index): Used to measure the goodness of a clustering
structure without respect to external information.
◆ Sum of Squared Error (SSE)
– -Cluster Cohesion
– -Cluster Separation
– Supervised (External Index): Used to measure the extent to which cluster labels
match externally supplied class labels.
◆ Entropy - that measures how well cluster labels match externally supplied
class labels.
– Relative Index: Used to compare two different clusters.
◆ Often an external or internal index is used for this function, e.g., SSE or
entropy
● A proximity graph based approach can also be used for cohesion and separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster and
nodes outside the cluster.

DataMining IA3 SolManualCS PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

DataMining IA3 SolManualCS PDF

Caricato da

Copyright:

Formati disponibili

USN 1 P E C S

PESIT Bangalore South Campus

INTERNAL ASSESSMENT TEST – 3

3 Write and explain the K-Nearest Neighbor classification algorithm. 4

– Determine the class from nearest neighbor list

b) Consider the one-dimensional data set shown below 4

Potrebbero piacerti anche