Sei sulla pagina 1di 28

Academic Year 2019-20

Experiment List
Subject: Machine Learning Code: CSDLO 6021

Class: TE COMP (A+B)

Course Outcomes: Upon completion of the course students will be able to

SN Course Outcomes Cognitive levels as


per blooms
Taxonomy
1 Understand types, issues, applications and steps to develop ML
L1, L2
application
2 Understand ANN and DL L1, L2
3 Understand optimization techniques and apply Hyper parameter tuning
L1, L2, L3, L4
for model selection
4 Apply regression and trees for learning and assess the outcome
L1, L2, L3, L4
5 Apply classification and clustering algorithms for learning L1, L2, L3
6 Apply dimensionality reduction methods L1, L2, L3, L4
L1: Remember L2: Understand L3: Apply L4: Analyze L5: Evaluate L6: Create

Sr. Experiment Name Cour Planned / Resources/ Learning Outcomes


No. se Completed Tools/
Outc Date Technology
to be used
ome

Basic Experiments
1 Experiment a use case CO4 A1:13/1 Python Students will able to:
utilizing online datasets in LO1.1 relate a problem which can be solved
order to apply Logistic A+B: 20/1 using Logistic Regression. [R]
LO1.2 compare regularized Logistic Regression
Regression and measure
B1:13/1 with Un regularized. [U]
the performance. LO1.3 implement a Logistic Regression model
for large scale classification. [A]
2 Experiment a use case CO4 A1:20/1 Python Students will able to:
utilizing online datasets in LO2.1 relate a problem which can be solved
order to apply Decision A+B: 27/1 using Logistic Regression. [R]
LO2.2 compare Decision Tree Regressor and
Tree and measure the
B1:20/1 Classifier. [U]
performance. LO2.3 experiment a Decision Tree model for
interpretability. [AN]
3 Experiment a use case CO5 A1:27/1 Python Students will able to:
utilizing online datasets in LO3.1 Compare several SVM kernels. [U]
order to apply SVM and A+B: 3/2 LO3.2 Review optimal SVM parameters. [A]
LO3.3 Analyze how SVM is stronger and
measure the performance.
B1:27/1 powerful in building models. [AN]
4 Experiment a use case CO5 A1:3/2 Python Students will able to:
utilizing online datasets in LO4.1 relate a problem which can be solved
order to apply kNN and A+B: 10/2 using kNN. [R]
LO4.2 apply a kNN model for classification
measure the performance.
B1: 3/2 task. [A]
LO4.3 analyze pros and cons of kNN. [AN]
5 Experiment a use case CO5 A1:10/2 Python Students will able to:
utilizing online datasets in LO5.1 compare clustering and classification
order to apply clustering A+B: 17/2 task. [U]
LO5.2 apply a clustering algorithm on unlabeled
and measure the
B1: 10/2 data. [A]
performance. LO5.2 analyze performance of clustering
algorithm used. [AN]
Design Experiments
6 Experiment a use case CO4 A1:17/2 Python Students will able to:
utilizing online datasets in LO6.1 relate a problem which can be solved
order to apply ensemble A+B: 9/3 using ensembles. [R]
LO6.2 compare several ensemble algorithms.
and measure the
B1: 17/2 [U]
performance. LO6.2 analyze hoe ensembles improve
performance. [AN]
7 Experiment a use case CO6 A1:9/3 Python Students will able to:
utilizing online datasets in LO7.1 relate to a problem that requires feature
order to perform feature A+B: 16/3 extraction. [A]
LO7.2 apply feature extraction on online
extraction in order to
B1: 9/3 dataset. [A]
enhance performance. LO7.2 analyze before and after performance wrt
feature extraction. [AN]
8 Experiment a use case CO6 A1:16/3 Python Students will able to:
utilizing online datasets in LO8.1 relate to a problem that requires feature
order to perform feature A+B: 23/3 selection. [A]
LO8.2 apply feature selection on online dataset.
selection in order to
B1: 16/3 [A]
enhance performance. LO8.3 analyze before and after performance wrt
feature selection. [AN]
9 Experiment a use case CO3 A1:23/3 Python Students will able to:
utilizing online datasets in LO8.1 relate to a problem that requires hyper
order to perform hyper A+B: 30/3 parameter tuning. [A]
LO8.2 apply optimization algorithms on online
parameter tuning in order
B1: 23/3 dataset. [A]
to enhance performance. LO8.3 analyze before and after performance
w.r.t. hyper parameter tuning. [AN]
Group Activity
10 Present a case study on the CO1- A1:30/3 Python Students will able to:
mini project developed as 6 LO11.1 comprehend relevance of ML in real
extension to any one A+B: 13/4 life. [U]
LO11.2 apply machine learning algorithms
experiment listed above
B1: 30/3 learnt to real life problem. [A]
LO11.3 Present the case study. [A]
Mini Project Calendar

Sr. Cognitive levels of attainment


Work to be done Planned / Completed Date
No as per Bloom’s Taxonomy

A1:13/1 B1:14/1 A+B: 22/1 L1.L2


1 Study tool for implementation
A1:20/1 B1:21/1 A+B: 29/1 L1,L2
2 Project Title and Course Identification
A1:27/1 B1:28/1 A+B: 5/2 L1,L2
3 Choose Data
A1:3/2 B1:4/2 A+B: 12/2 L1,L2,L3
4 Perform EDA
A1:10/2 B1:11/2 A+B: 11/3 L1,L2,L3
5 Perform Feature Engineering
A1:17/2 B1:18/2 A+B: 18/3 L1,L2
6 Chose Model
A1:9/3 B1:17/3 A+B: 18/3 L1,L2,L3,L4
7 Train and Validate Model
A1:16/3 B1:24/3 A+B: 1/4 L1,L2,L3,L4
8 Tune Hyper parameters
A1:23/3 B1:31/3 A+B: 8/4 L1,L2,L3,L4,L5
9 Test and Evaluate Model

A1:30/3 B1:7/4 A+B: 15/4 L1,L2


10 Prepare report

Total Hours 30

Experiment 01
Learning Objective: Experiment a use case utilizing online datasets in order to apply Logistic Regression and measure
the performance.

Tools: Python on Anaconda Spyder / Jupyter Notebook / GoogleColab

Theory: The logistic regression hypothesis is as follows:

and the sigmoid activation function looks like this:

We have an example where y = 1, then we hope hθ(x) is close to 1. With hθ(x) close to 1, (θT x) must be much larger
than 0

Similarly, when y = 0, we hope hθ(x) is close to 0. With hθ(x) close to 0, (θT x) must be much less than 0

The overall cost function =

For the overall cost function, we sum over all the training examples using the above function, and have a 1/m term. If
you then plug in the hypothesis definition (h θ(x)), you get an expanded cost function =

The plot shows the cost contribution of an example when y = 1 given z. So if z is big, the cost is low - this is good!. But
if z is 0 or negative the cost contribution is high. This is why, when logistic regression sees a positive example, it tries to
set θT x to be a very large term. If y = 0 then only the second term matters.

Implementation: ………………………………………………………………………………..
Result and Discussion: …………………………….…………………..………………………

Learning Outcomes: Students should have the ability to

LO1: identify a problem which can be solved using Logistic Regression.

LO2: implement a Logistic Regression model for large scale classification.

LO3: compare Logistic Regression with Linear Regression.

LO4: compare regularized Logistic Regression with un regularized Logistic Regression.

Course Outcomes: Upon completion of the course students will be able to apply regression for learning and assess the
outcome.
Conclusion: ……………………………………………………………………………………

Viva Questions:

1. How is Logistic Regression different from Linear Regression.


2. Why is Logistic Regression classification technique.
3. Describe a Usecase where Logistic Regression can be applied.
4. How due you measure performance of a Logistic Regression model.

For Faculty Use

Correction Formative Timely completion of Attendance /


Parameters Assessment Practical [ 40%] Learning Attitude
[40%] [20%]

Marks
Obtained

Experiment 02
Learning Objective: Experiment a use case utilizing online datasets in order to apply Decision Tree and measure the
performance.

Tools: Python on Anaconda Spyder / Jupyter Notebook / GoogleColab

Theory:
Decision Tree is one of the most powerful and popular algorithm. Decision-tree algorithm falls under the category of
supervised learning algorithms. It works for both continuous as well as categorical output variables.

We make following assumptions while using Decision tree:


 At the beginning, we consider the whole training set as the root.
 Attributes are assumed to be categorical for Information Gain and attributes are assumed to be continuous
for gini index.
 On the basis of attribute values records are distributed recursively.
 We use statistical methods for ordering attributes as root or internal node.

ALGORITHM:
1. Find the best attribute and place it on the root node of the tree.
2. Now, split the training set of the dataset into subsets. While making the subset make sure that each subset of
training dataset should have the same value for an attribute.
3. Find leaf nodes in all branches by repeating 1 and 2 on each subset.

While implementing the decision tree we will go through the following two phases:
1. Building Phase
 Preprocess the dataset.
 Split the dataset into train and test set.
 Train the classifier.
2. Operational Phase
 Make predictions.
 Calculate the accuracy.

Implementation: …………………………………………………………………………………
Result and Discussion: …………………………………………………………………………

Learning Outcomes: Students should have the ability to

LO1: identify a problem which can be solved using Decision Tree.

LO2: implement a Decision Tree model for interpretability.

LO3: compare Information Gain and Gini Index.

Course Outcomes: Upon completion of the course students will be able to apply tree based algorithm for learning and
assess the outcome.
Conclusion: ………………………………………………………………………………………

Viva Questions:

5. Differentiate decision tree based classification and regression.


6. Define key concepts of decision trees..
7. How is feature selection performed in DT.
8. How is Gini index computed.
For Faculty Use

Correction Formative Timely completion of Attendance /


Parameters Assessment Practical [ 40%] Learning Attitude
[40%] [20%]

Marks
Obtained
Experiment 03
Learning Objective: Experiment a use case utilizing online datasets in order to apply SVM and measure the
performance.

Tools: Python on Anaconda Spyder / Jupyter Notebook / GoogleColab

Theory:

Support Vector Machine (SVM) is a supervised machine learning algorithm capable of performing classification,
regression and even outlier detection. The linear SVM classifier works by drawing a straight line between two classes.
All the data points that fall on one side of the line will be labeled as one class and all the points that fall on the other
side will be labeled as the second. Sounds simple enough, but there’s an infinite amount of lines to choose from. How
do we know which line will do the best job of classifying the data? This is where the LSVM algorithm comes in to play.
The LSVM algorithm will select a line that not only separates the two classes but stays as far away from the closest
samples as possible. In fact, the “support vector” in “support vector machine” refers to two position vectors drawn from
the origin to the points which dictate the decision boundary.
ALGORITHM
Suppose, we had a vector w which is always normal to the hyperplane (perpendicular to the line in 2 dimensions). We
can determine how far away a sample is from our decision boundary by projecting the position vector of the sample on
to the vector w. As a quick refresher, the dot product of two vectors is proportional to the projection of the first vector
on to the second.

If it’s a positive sample, we’re going to insist that the proceeding decision function (the dot product of w and the
position vector of a given sample plus some constant) returns a value greater than or equal to 1.

Similarly, if it’s a negative sample, we’re going to insist that the proceeding decision function returns a value smaller
than or equal to -1.
In other words, we won’t consider any samples located between the decision boundary and support vectors. The
variable y will be equal to positive one for all positive samples and negative one for all negative samples.

After multiplying by y, the equations for the positive and negative samples are equal to one another.

and
Meaning, we can simplify the constraints down to a single equation.

Implementation: …………………………………………………………………………………

Result and Discussion: …………………………………………………………………………


Learning Outcomes: Students should have the ability to

LO1: identify a problem which can be solved using Support Vector Machine.

LO2: implement a SVM for non linear classification task.

LO3: compare SVM for classification, regression and outlier detection.

LO4: explain kernels of SVM.

Course Outcomes: Upon completion of the course students will be able to apply SVM for learning and assess the
outcome.
Conclusion: ………………………………………………………………………………………

Viva Questions:

1. How is Logistic Regression different from Linear Regression.


2. Why is Logistic Regression classification technique.
3. Describe a Usecase where Logistic Regression can be applied.
4. How due you measure performance of a Logistic Regression model.

For Faculty Use

Correction Formative Timely completion of Attendance /


Parameters Assessment Practical [ 40%] Learning Attitude
[40%] [20%]

Marks
Obtained

Experiment 04
Learning Objective: Experiment a use case utilizing online datasets in order to apply kNN and measure the
performance.

Tools: Python on Anaconda Spyder / Jupyter Notebook / GoogleColab

Theory:

K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset in its training phase.
Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most
similar instances and the data with the most similar instance is finally returned as the prediction. kNN is often used in
search applications where you are looking for similar items, like find items similar to this one. Algorithm suggests
that if you’re similar to your neighbours, then you are one of them. The k-nearest neighbors algorithm uses a very
simple approach to perform classification. When tested with a new example, it looks through the training data and finds
the k training examples that are closest to the new example. It then assigns the most common class label (among
those k-training examples) to the test example.

k in kNN algorithm represents the number of nearest neighbor points which are voting for the new test data’s class. If
k=1, then test examples are given the same label as the closest example in the training set. If k=3, the labels of the three
closest classes are checked and the most common (i.e., occurring at least twice) label is assigned, and so on for
larger ks.

ALGORITHM
Step1: Calculate the Euclidean distance between the new point and the existing points
Step 2: Choose the value of K and select K neighbors closet to the new point.
Step 3: Count the votes of all the K neighbors / Predicting Values
Following are the Steps for execution:

Step 1: Handling the data


Step 2: Calculate the distance
Step 3: Find k nearest point
Step 4: Predict the class
Step 5: Check the accuracy

Implementation: …………………………………………………………………………………

Result and Discussion: …………………………………………………………………………


Learning Outcomes: Students should have the ability to

LO1: identify a problem which can be solved using kNN algorithm.

LO2: implement a kNN model for classification task.

LO3: identify the best value of k.

LO4: implement several distance metrics.

Course Outcomes: Upon completion of the course students will be able to apply regression for learning and assess the
outcome.
Conclusion: ………………………………………………………………………………………

Viva Questions:

1. Define several distance metrics for kNN classification.


2. How can kNN be used for imputing missing values.
3. Describe a Usecase where kNN can be applied.
4. How do you measure performance of kNN model.

For Faculty Use

Correction Formative Timely completion of Attendance /


Parameters Assessment Practical [ 40%] Learning Attitude
[40%] [20%]

Marks
Obtained

Experiment 05
Learning Objective: Experiment a use case utilizing online datasets in order to apply clustering and measure the
performance.

Theory:

In Machine Learning, we often think about how to use data to make predictions on new data points. This is called
“supervised learning.” Sometimes, however, rather than ‘making predictions’, we instead want to categorize data into
buckets. This is termed “unsupervised learning.” Clustering is one of the most frequently utilized forms of unsupervised
learning. The EM (expectation maximization) technique is similar to the K-Means technique. The basic operation of K-
Means clustering algorithms is relatively simple: Given a fixed number of k clusters, assign observations to those
clusters so that the means across clusters (for all variables) are as different from each other as possible.

The EM algorithm extends this basic approach to clustering in two important ways: Instead of assigning examples to
clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes
probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering
algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.

Expectation Maximization algorithmThe basic approach and logic of this clustering method is as follows. Suppose you
measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two
clusters of observations with different means (and perhaps different standard deviations); within each sample, the
distribution of values for the continuous variable follows the normal distribution. The goal of EM clustering is to
estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data
(distribution).

Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of
different distributions in different clusters. The results of EM clustering are different from those computed by k-means
clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm
does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each
observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an
actual assignment of observations to clusters, based on the (largest) classification probability.

Implementation: …………………………………………………………………………………

Result and Discussion: …………………………………………………………………………


Learning Outcomes: Students should have the ability to

LO1: identify a problem which can be solved using clustering.

LO2: implement Expectation Maximisation algorithm for clustering task.

LO3: compare EM with k means.

LO4: describe need of supervised learning after clustering.

Course Outcomes: Upon completion of the course students will be able to apply clustering for learning and assess the
outcome.
Conclusion: ………………………………………………………………………………………

Viva Questions:

1. How is EM different from k means.


2. Why is EM clustering technique.
3. Describe a usecase where clustering algorithms can be applied.
4. How due you measure performance of a clustering algorithm.

For Faculty Use

Correction Formative Timely completion of Attendance /


Parameters Assessment Practical [ 40%] Learning Attitude
[40%] [20%]

Marks
Obtained

Experiment 06
Learning Objective: Experiment a use case utilizing online datasets in order to apply ensemble and measure the
performance.

Theory:

1. Simple Ensemble Techniques:

1.1 Max Voting: The max voting method is generally used for classification problems. In this technique, multiple
models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The
predictions which we get from the majority of the models are used as the final prediction.
For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll assume three of them rated it as
4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating will be taken as 4. You can
consider this as taking the mode of all the predictions.

The result of max voting would be something like this:


Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
5 4 5 4 4 4

1.2 Averaging: Similar to the max voting technique, multiple predictions are made for each data point in averaging. In
this method, we take an average of predictions from all the models and use it to make the final prediction. Averaging
can be used for making predictions in regression problems or while calculating probabilities for classification problems.
For example, in the below case, the averaging method would take the average of all the values.
i.e. (5+4+5+4+4)/5 = 4.4

1. 3 Weighted Average: This is an extension of the averaging method. All models are assigned different weights
defining the importance of each model for prediction. For instance, if two of your colleagues are critics, while others
have no prior experience in this field, then the answers by these two friends are given more importance as compared to
the other people.
The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.

2. Advanced Ensemble techniques:

2.1 Stacking: Stacking is an ensemble learning technique that uses predictions from multiple models (for example
decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set.

2.2 Blending: Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set
to make predictions. In other words, unlike stacking, the predictions are made on the holdout set only. The holdout set
and the predictions are used to build a model which is run on the test set.

2.3 Bagging: The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to
get a generalized result. Bootstrapping is a sampling technique in which we create subsets of observations from the
original dataset, with replacement. The size of the subsets is the same as the size of the original set.
Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution (complete
set).

2.4 Boosting: Boosting is a sequential process, where each subsequent model attempts to correct the errors of the
previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works
in the below steps:
 A subset is created from the original dataset.
 Initially, all data points are given equal weights.
 A base model is created on this subset.
This model is used to make predictions on the whole dataset.
Errors are calculated using the actual values and predicted values. The observations which are incorrectly predicted, are
given higher weights. Another model is created and predictions are made on the dataset.
Similarly, multiple models are created, each correcting the errors of the previous model. The final model (strong
learner) is the weighted mean of all the models (weak learners). Thus, the boosting algorithm combines a number of
weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they
work well for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.

3. Algorithms based on Bagging and Boosting


Bagging and Boosting are two of the most commonly used techniques in machine learning. In this section, we will look
at them in detail. Following are the algorithms:
 Bagging algorithms
o Bagging meta-estimator
o Random forest
 Boosting algorithms
o AdaBoost
o GBM
o XGBM
o Light GBM
o CatBoost
Implementation: …………………………………………………………………………………

Result and Discussion: …………………………………………………………………………

Learning Outcomes: The student should have the ability to

LO1: Implement Ensemble model.

LO2: Measure the performance of ensemble model.

LO3: Compare Bagging, Boosting and stacking ensemble methods.

LO4: Improve the prediction using ensemble techniques.

Course Outcomes: Upon completion of the course students will be able to apply ensemble techniques for learning and
assess the outcome.
Conclusion:
Viva Questions:

1. What is an Ensemble model?


2. What is Bagging, Boosting and Stacking?

3. Can we Ensemble multiple models of same ML algorithm?

4. How can we identify the weights of different models?

5. What are the benefits of Ensemble model?

For Faculty Use

Correction Formative Timely completion of Attendance /


Parameters Assessment Practical [ 40%] Learning Attitude
[40%] [20%]

Marks
Obtained

Experiment 7
Learning Objective: Experiment a use case utilizing online datasets in order to perform feature extraction in order to
enhance performance.

Theory:

Principal Components Analysis:


Generally speaking, PCA has two objectives:
1. “Data” reduction - moving from many original variables down to a few “composite” variables (linear combinations
of the original variables).
2. Interpretation - which variables play a larger role in the explanation of total variance.

Think of the new variables, called the principal components, as composite variables consisting of a mixture of the
original variables. In PCA, the goal is to find a set of k principal components (composite variables) that: Is much
smaller than the original set of p variables, and Accounts for nearly all of the total sample variance. If these two goals
can be accomplished, then the set of k principal components contains almost as much information as the original p
variables. This means that the k principal components can then replace the p original variables. The original data set is
thereby reduced, from n measurements on p variables to n measurements on k variables. PCA often reveals
relationships between variables that were not previously suspected.Because of such relationships, new interpretations of
the data and variables often stem from PCA. PCA usually serves a more of a means to an end rather than an end in
themselves in that the composite variables (the principal components) are often used in larger investigations or in other
statistical techniques such as:
1. Multiple regression
2. Cluster analysis.

Principal components depend solely on the covariance matrix or the correlation matrix. Linear combination weights in
PCA aren’t typically either ones or zeros rather, all the linear combination weights for each principal component come
directly from the eigenvectors of the covariance matrix or the correlation matrix. Recall that for p variables, the p × p
covariance/correlation matrix has a set of p eigenvalues and p eigenvectors
Each principal component is formed by taking the values of the elements of the eigenvalues as the weights of the linear
combination.
If the k-th eigenvector ek = (e1k, e2k, …, epk), then the principal components Y1, … are formed by:
Y1 = e11X1 + e21X2 + . . . + ep1Xp
Y2 = e12X1 + e22X2 + . . . + ep2Xp
...
Yp = e1pX1 + e2pX2 + . . . + eppXp

The importance of each component, measured by the proportion of total sample variance accounted for by the
component. The importance of each original variable within each component. The weight of the component for each
variable (for interpretation of the relative importance of the original variables). The sum of the eigenvalues is equal to
the total sample variance.

When all measurements are positively correlated, the first principal component is often some kind of average of the
measurements (e.g., size of birds, severity index of psychiatric symptoms)
Then the other principal components give important information about the remaining pattern (e.g., shape of birds,
pattern of psychiatric symptoms)

Applications:
 Dimension reduction: summarize the data with a smaller number of variables, losing as little information as
possible.
 Can be used for graphical representations of the data.
 Use PCA as input for regression analysis.

Limitations of PCA:
 We only consider orthogonal transformations (rotations) of the original variables. i.e., PCA allows only linear
combinations of the original variables.
 PCA is based only on the correlation (or covariance matrix for non-standardized data) of the data. Some
distributions (e.g. multivariate normal) are completely characterized by this, but others are not.
 Dimension reduction can only be achieved if the original variables were correlated. If the original variables
were uncorrelated, PCA does nothing, except for ordering them according to their variance.
 PCA is not scale invariant.

Singular value decomposition (SVD)


In linear algebra, the singular-value decomposition (SVD) is a factorization of a real or complex matrix. It is the
generalization of the eigen decomposition of a positive semi definite normal matrix (for example, a symmetric matrix
with positive eigenvalues. It has many useful applications in signal processing and statistics. Singular value
decomposition is a method of decomposing a matrix into three other matrices:

X=

where:

A is an N × D matrix
U is an N × N orthornormal matrix (so U= ),
S is an N × D diagonal matrix

V is an D × D orthornormal matrix (so V= V = ),


The columns of U are the left singular vectors and the columns of V are the right singular vectors.

Since there are at most D Singular values, the last N-D Columns of U are irrelevant, since they will be multiplied by D
If N > D, then we have

If N < D, then we have

For an arbitrary real matrix X, if X = , we have

and

Implementation: …………………………………………………………………………………

Result and Discussion: …………………………………………………………………………


Learning Outcomes: The student should have the ability to

LO1: Identify the need of dimensionality reduction techniques in ML.

LO2: Implement PCA and SVD techniques.

LO3: Compare different dimensinality reduction techniques.

LO4: Analyze the performance of different dimensionality reduction techniques.

Course Outcomes: Upon completion of the course students will be able to apply dimensionality reduction methods.

Conclusion: ………………………………………………………………………………………

Viva Questions:

1. Which applications require dimensionality reduction?


2. What are the applications of SVD?
3. What are the properties of PCA?
4. What is principal Components?

For Faculty Use

Correction Formative Timely completion Attendance / Learning


Parameters Assessment [40%] of Practical [ 40%] Attitude [20%]

Marks
Obtained

Experiment 8
Learning Objective: Experiment a use case utilizing online datasets in order to perform feature selection in order to
enhance performance.

Theory:

According to Forbes, about 2.5 quintillion bytes of data is generated every day. This data can then be analysed using
Data Science and Machine Learning techniques in order to provide insights and make predictions. Although, in most of
the cases, the originally gathered data needs to be first preprocessed before starting any statistical analysis with it. There
are many different reasons why it might be necessary to carry out a preprocessing analysis, some examples are:
 The gathered data is not in the right format (eg. SQL Database, JSON, CSV, etc…).
 Missing values and Outliers.
 Scaling and Normalization.
 Reduce Intrinsic Noise present in the dataset (part of the stored data might be corrupted).
 Some features in the dataset might not gather any information to our analysis.
Reducing the number of features to use during a statistical analysis can possibly lead to several benefits such as:
 Accuracy improvements.
 Overfitting risk reduction.
 Speed up in training.
 Improved Data Visualization.
 Increase in explainability of our model.

Figure 1: Relationship between Classifier Performance and Dimensionality

There are many different methods which can be applied for Feature Selection. Some of the most important ones are:
 Filter Method: filtering our dataset and taking only a subset of it containing all the relevant features (eg.
correlation matrix using Pearson Correlation).
 Wrapper Method: follows the same objective of the FIlter Method but uses a Machine Learning model as it’s
evaluation criteria (eg. Forward/Backward/Bidirectional/Recursive Feature Elimination). We feed some
features to our Machine Learning model, evaluate their performance and then decide if add or remove the
feature to increase accuracy. As a result, this method can be more accurate than filtering but is more
computationally expensive.
 Embedded Method: like the FIlter Method also the Embedded Method makes use of a Machine Learning
model. The difference between the two methods is that the Embedded Method examines the different training
iterations of our ML model and then ranks the importance of each feature based on how much each of the
features contributed to the ML model training (eg. LASSO Regularization).

Figure 2: Filter, Wrapper and Embedded Methods Representation

Implementation: …………………………………………………………………………………
Result and Discussion: …………………………………………………………………………

Learning Outcomes: The student should have the ability to

LO1: Identify the need of preprocessing and feature selection techniques in ML.
LO2: Implement different feature selection methods.
LO3: Compare different feature selection techniques.
LO4: Measure the performance of classifier after feature selection.

Course Outcomes: Upon completion of the course students will be able to apply dimensionality reduction methods.

Conclusion: ………………………………………………………………………………………

Viva Questions:

1. What is the importance of Feature selection in ML?


2. Explain Filter based feature selection methods with example?
3. Explain Wrapper based feature selection methods with example?
4. Explain Embedded feature selection methods with example?

For Faculty Use

Correction Formative Timely completion of Attendance / Learning


Parameters Assessment [40%] Practical [ 40%] Attitude [20%]

Marks
Obtained

Experiment 9

Learning Objective: Experiment a use case utilizing online datasets in order to perform hyperparameter tuning in
order to enhance performance.

Theory:

Machine Learning models are composed of two different types of parameters:


Hyperparameters are all the parameters which can be arbitrarily set by the user before starting training (eg. number of
estimators in Random Forest).
Model parameters are instead learned during the model training (eg. weights in Neural Networks, Linear Regression).
The model parameters define how to use input data to get the desired output and are learned at training time. Instead,
Hyperparameters determine how our model is structured in the first place.
Machine Learning models tuning is a type of optimization problem. We have a set of hyperparameters and we aim to
find the right combination of their values which can help us to find either the minimum (eg. loss) or the maximum (eg.
accuracy) of a function (Figure 1).
This can be particularly important when comparing how different Machine Learning models performs on a dataset. In
fact, it would be unfair for example to compare an SVM model with the best Hyperparameters against a Random Forest
model which has not been optimized.

There are several approaches to hyperparameter tuning:

 Manual: select hyperparameters based on intuition/experience/guessing, train the model with the
hyperparameters, and score on the validation data. Repeat process until you run out of patience or are satisfied
with the results.
 Grid Search: set up a grid of hyperparameter values and for each combination, train a model and score on the
validation data. In this approach, every single combination of hyperparameters values is tried which can be
very inefficient!
 Random search: set up a grid of hyperparameter values and select random combinations to train the model
and score. The number of search iterations is set based on time/resources.
 Automated Hyperparameter Tuning: use methods such as gradient descent, Bayesian Optimization, or
evolutionary algorithms to conduct a guided search for the best hyperparameters.

Bias-variance tradeoff: Every estimator has its advantages and drawbacks. Its generalization error can be decomposed
in terms of bias, variance and noise. The bias of an estimator is its average error for different training sets.
The variance of an estimator indicates how sensitive it is to varying training sets. Noise is a property of the data. Bias
and variance are inherent properties of estimators and we usually have to select learning algorithms and
hyperparameters so that both bias and variance are as low as possible. Another way to reduce the variance of a model is
to use more training data. However, you should only collect more training data if the true function is too complex to be
approximated by an estimator with a lower variance.In the simple one-dimensional problem that we have seen in the
example it is easy to see whether the estimator suffers from bias or variance. However, in high-dimensional spaces,
models can become very difficult to visualize.

Validation Curve: To validate a model we need a scoring function, for example accuracy for classifiers. The proper
way of choosing multiple hyperparameters of an estimator are of course grid search or similar methods that select the
hyperparameter with the maximum score on a validation set or multiple validation sets. If we optimized the
hyperparameters based on a validation score the validation score is biased and not a good estimate of the generalization
any longer. To get a proper estimate of the generalization we have to compute the score on another test set.However, it
is sometimes helpful to plot the influence of a single hyperparameter on the training score and the validation score to
find out whether the estimator is overfitting or underfitting for some hyperparameter values.

Learning Curve: A learning curve shows the validation and training score of an estimator for varying numbers of
training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator
suffers more from a variance error or a bias error. Consider the following example where we plot the learning curve of a
naive Bayes classifier and SVM. For the naive Bayes, both the validation score and the training score converge to a
value that is quite low with increasing size of the training set. Thus, we will probably not benefit much from more
training data.In contrast, for small amounts of data, the training score of the SVM is much greater than the validation
score. Adding more training samples will most likely increase generalization.

Grid Search: Grid search is the process of performing hyper parameter tuning in order to determine the optimal values
for a given model. This is significant as the performance of the entire model is based on the hyper parameter values
specified.

Implementation: …………………………………………………………………………………

Result and Discussion: …………………………………………………………………………


Learning Outcomes: The student should have the ability to

1. Identify parameters for model selection.

2. Compare different hyperparameter tuning techniques.

3. Explain different optimization techniques in ML.

4. Enhance the model performance using different parameter tuning techniques.

Course Outcomes: Upon completion of the course students will be able to understand optimization techniques and
apply Hyperparameter tuning for model selection

Conclusion: ………………………………………………………………………………………

Viva Questions:

1. What is hyperparameter tuning in ML?


2. What is parameter optimization?
3. Which are different parameter optimaization techniqus?
4. What is Holdout and cross validataion?
For Faculty Use

Correction Formative Timely completion of Attendance /


Parameters Assessment Practical [ 40%] Learning Attitude
[40%] [20%]

Marks
Obtained

Experiment 10
Learning Objectives: Present a case study on the mini project developed as extension to any one experiment listed
above.

Theory: As per topic chosen


Learning Outcomes: The student should have the ability to understand real life application of ML algorithms for
prediction/ classification tasks.

Course Outcomes: Upon completion of the course students will be able to present

Conclusion: ………………………………………………………………………………………

Viva Questions: As per topic chosen

For Faculty Use

Correction Formative Timely completion of Attendance /


Parameters Assessment Practical [ 40%] Learning Attitude
[40%] [20%]

Marks
Obtained

Potrebbero piacerti anche