Sei sulla pagina 1di 19

Machine

Learning
(Part 2)
IYKRA DATA FELLOWSHIP BATCH 3
Outline
• Introduction to Clustering

• Clustering Method
• K-Means
• Hierarchical

• Model Evaluation
• Cross Validation
• Model Performance and
Selection
Clustering is the process of dividing the entire
data into groups (also known as clusters)
based on the patterns in the data.

S U C H P RO B L E M S , W I TH O UT A N Y F I X E D TA RG E T VA R I A B L E , A R E K N OW N A S
U N S U P E RV I S E D L E A R N I N G P RO B L E MS . I N TH E S E P RO B L E MS , W E O N LY H AV E
T H E I N D E P E N D E N T VA R I A B LE S A N D N O TA RG E T / D E P E N D E N T VA R I A B L E .
K Means
Clustering
K-means clustering algorithm
computes the centroids and
iterates until we it finds
optimal centroid.
Working of K-Means Algorithm
• First, we need to specify the number of clusters, K,
need to be generated by this algorithm (Good K-value
can be distinguish by elbow method)
• Next, randomly select K data points and assign each
data point to a cluster. In simple words, classify the
data based on the number of data points.
• Now it will compute the cluster centroids.
• Next, keep iterating the following until we find
optimal centroid which is the assignment of data
points to the clusters that are not changing any more
Advantages and Disadvantages of
K-Means
ADVANTAGES DISADVANTAGES

It is very easy to understand and implement. It is a bit difficult to predict the number of clusters i.e.
the value of k.
If we have large number of variables then, K-means would
be faster than Hierarchical clustering. Output is strongly impacted by initial inputs like number
of clusters (value of k)
On re-computation of centroids, an instance can change the
cluster. Order of data will have strong impact on the final output.

Tighter clusters are formed with K-means as compared to It is very sensitive to rescaling. If we will rescale our data
Hierarchical clustering. by means of normalization or standardization, then the
output will completely change.

It is not good in doing clustering job if the clusters have


a complicated geometric shape.
Applications of K-Means
Market segmentation

Document Clustering

Image segmentation

Image compression

Customer segmentation

Analyzing the trend on dynamic data


Hierarchical
Clustering
Hierarchical cluster
analysis or HCA is an
unsupervised clustering
algorithm which involves
creating clusters that have
predominant ordering from
top to bottom.
Two Types of HCA
AGGLOMERATIVE HIERARCHICAL DIVISIVE HIERARCHICAL CLUSTERING
CLUSTERING (BOTTOM-UP) (TOP-DOWN)
Some of The Common Linkage
Methods
▪ Complete-linkage: the distance between two clusters
is defined as the longest distance between two points
in each cluster.
▪ Single-linkage: the distance between two clusters is
defined as the shortest distance between two points in
each cluster.
▪ Average-linkage: the distance between two clusters is
defined as the average distance between each point in
one cluster to every point in the other cluster.
▪ Centroid-linkage: finds the centroid of cluster 1 and
centroid of cluster 2, and then calculates the distance
between the two before merging.
Advantages and Disadvantages
of HCA
ADVANTAGES DISADVANTAGES
HCA is not having to pre- It doesn't work well when we
define the number of clusters have huge amount of data.
gives it quite an edge over k-
Means.
Cross
Validation
Cross-validation is a
technique in which we train
our model using the subset
of the data-set and then
evaluate using the
complementary subset of the
data-set.
The three steps involved in cross-validation are as
follows:
1. Reserve some portion of sample data-
set.

2. Using the rest data-set train the model.

3. Test the model using the reserve


portion of the data-set.
Methods of Cross Validation
o Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used
for the training purpose.

o LOOCV (Leave One Out Cross Validation)


In this method, we perform training on the whole data-set but leaves only one data-point of
the available data-set and then iterates for each data-point.

o K-Fold Cross Validation


In this method, we split the data-set into k number of subsets(known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of the
trained model. In this method, we iterate k times with a different subset reserved for testing
purpose each time.
Model
Performance
and Selection
Measuring model
performance in machine
learning is an important
task, to see how well a model
works.
Evaluation Metrics for Regression

We usually use Mean Absolute Error


(MAE) and Root Mean Square Error
(RMSE) to evaluate how well our
model has and how varied the model
is.

If the value of R2 is getting closer to


1 then the model is getting better.
Confusion Matrix, Precision, Recall,
and Accuracy (Classification Problem)
AUC-ROC Curve (Classification
Problem)
AUC - ROC curve is a performance
measurement for classification
problem at various thresholds
settings. ROC is a probability curve
and AUC represents degree or
measure of separability. It tells how
much model is capable of
distinguishing between classes.
Silhoutte Method (Clustering
Problem)

Potrebbero piacerti anche