Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Experiment List
Subject: Machine Learning Code: CSDLO 6021
Basic Experiments
1 Experiment a use case CO4 A1:13/1 Python Students will able to:
utilizing online datasets in LO1.1 relate a problem which can be solved
order to apply Logistic A+B: 20/1 using Logistic Regression. [R]
LO1.2 compare regularized Logistic Regression
Regression and measure
B1:13/1 with Un regularized. [U]
the performance. LO1.3 implement a Logistic Regression model
for large scale classification. [A]
2 Experiment a use case CO4 A1:20/1 Python Students will able to:
utilizing online datasets in LO2.1 relate a problem which can be solved
order to apply Decision A+B: 27/1 using Logistic Regression. [R]
LO2.2 compare Decision Tree Regressor and
Tree and measure the
B1:20/1 Classifier. [U]
performance. LO2.3 experiment a Decision Tree model for
interpretability. [AN]
3 Experiment a use case CO5 A1:27/1 Python Students will able to:
utilizing online datasets in LO3.1 Compare several SVM kernels. [U]
order to apply SVM and A+B: 3/2 LO3.2 Review optimal SVM parameters. [A]
LO3.3 Analyze how SVM is stronger and
measure the performance.
B1:27/1 powerful in building models. [AN]
4 Experiment a use case CO5 A1:3/2 Python Students will able to:
utilizing online datasets in LO4.1 relate a problem which can be solved
order to apply kNN and A+B: 10/2 using kNN. [R]
LO4.2 apply a kNN model for classification
measure the performance.
B1: 3/2 task. [A]
LO4.3 analyze pros and cons of kNN. [AN]
5 Experiment a use case CO5 A1:10/2 Python Students will able to:
utilizing online datasets in LO5.1 compare clustering and classification
order to apply clustering A+B: 17/2 task. [U]
LO5.2 apply a clustering algorithm on unlabeled
and measure the
B1: 10/2 data. [A]
performance. LO5.2 analyze performance of clustering
algorithm used. [AN]
Design Experiments
6 Experiment a use case CO4 A1:17/2 Python Students will able to:
utilizing online datasets in LO6.1 relate a problem which can be solved
order to apply ensemble A+B: 9/3 using ensembles. [R]
LO6.2 compare several ensemble algorithms.
and measure the
B1: 17/2 [U]
performance. LO6.2 analyze hoe ensembles improve
performance. [AN]
7 Experiment a use case CO6 A1:9/3 Python Students will able to:
utilizing online datasets in LO7.1 relate to a problem that requires feature
order to perform feature A+B: 16/3 extraction. [A]
LO7.2 apply feature extraction on online
extraction in order to
B1: 9/3 dataset. [A]
enhance performance. LO7.2 analyze before and after performance wrt
feature extraction. [AN]
8 Experiment a use case CO6 A1:16/3 Python Students will able to:
utilizing online datasets in LO8.1 relate to a problem that requires feature
order to perform feature A+B: 23/3 selection. [A]
LO8.2 apply feature selection on online dataset.
selection in order to
B1: 16/3 [A]
enhance performance. LO8.3 analyze before and after performance wrt
feature selection. [AN]
9 Experiment a use case CO3 A1:23/3 Python Students will able to:
utilizing online datasets in LO8.1 relate to a problem that requires hyper
order to perform hyper A+B: 30/3 parameter tuning. [A]
LO8.2 apply optimization algorithms on online
parameter tuning in order
B1: 23/3 dataset. [A]
to enhance performance. LO8.3 analyze before and after performance
w.r.t. hyper parameter tuning. [AN]
Group Activity
10 Present a case study on the CO1- A1:30/3 Python Students will able to:
mini project developed as 6 LO11.1 comprehend relevance of ML in real
extension to any one A+B: 13/4 life. [U]
LO11.2 apply machine learning algorithms
experiment listed above
B1: 30/3 learnt to real life problem. [A]
LO11.3 Present the case study. [A]
Mini Project Calendar
Total Hours 30
Experiment 01
Learning Objective: Experiment a use case utilizing online datasets in order to apply Logistic Regression and measure
the performance.
We have an example where y = 1, then we hope hθ(x) is close to 1. With hθ(x) close to 1, (θT x) must be much larger
than 0
Similarly, when y = 0, we hope hθ(x) is close to 0. With hθ(x) close to 0, (θT x) must be much less than 0
For the overall cost function, we sum over all the training examples using the above function, and have a 1/m term. If
you then plug in the hypothesis definition (h θ(x)), you get an expanded cost function =
The plot shows the cost contribution of an example when y = 1 given z. So if z is big, the cost is low - this is good!. But
if z is 0 or negative the cost contribution is high. This is why, when logistic regression sees a positive example, it tries to
set θT x to be a very large term. If y = 0 then only the second term matters.
Implementation: ………………………………………………………………………………..
Result and Discussion: …………………………….…………………..………………………
Course Outcomes: Upon completion of the course students will be able to apply regression for learning and assess the
outcome.
Conclusion: ……………………………………………………………………………………
Viva Questions:
Marks
Obtained
Experiment 02
Learning Objective: Experiment a use case utilizing online datasets in order to apply Decision Tree and measure the
performance.
Theory:
Decision Tree is one of the most powerful and popular algorithm. Decision-tree algorithm falls under the category of
supervised learning algorithms. It works for both continuous as well as categorical output variables.
ALGORITHM:
1. Find the best attribute and place it on the root node of the tree.
2. Now, split the training set of the dataset into subsets. While making the subset make sure that each subset of
training dataset should have the same value for an attribute.
3. Find leaf nodes in all branches by repeating 1 and 2 on each subset.
While implementing the decision tree we will go through the following two phases:
1. Building Phase
Preprocess the dataset.
Split the dataset into train and test set.
Train the classifier.
2. Operational Phase
Make predictions.
Calculate the accuracy.
Implementation: …………………………………………………………………………………
Result and Discussion: …………………………………………………………………………
Course Outcomes: Upon completion of the course students will be able to apply tree based algorithm for learning and
assess the outcome.
Conclusion: ………………………………………………………………………………………
Viva Questions:
Marks
Obtained
Experiment 03
Learning Objective: Experiment a use case utilizing online datasets in order to apply SVM and measure the
performance.
Theory:
Support Vector Machine (SVM) is a supervised machine learning algorithm capable of performing classification,
regression and even outlier detection. The linear SVM classifier works by drawing a straight line between two classes.
All the data points that fall on one side of the line will be labeled as one class and all the points that fall on the other
side will be labeled as the second. Sounds simple enough, but there’s an infinite amount of lines to choose from. How
do we know which line will do the best job of classifying the data? This is where the LSVM algorithm comes in to play.
The LSVM algorithm will select a line that not only separates the two classes but stays as far away from the closest
samples as possible. In fact, the “support vector” in “support vector machine” refers to two position vectors drawn from
the origin to the points which dictate the decision boundary.
ALGORITHM
Suppose, we had a vector w which is always normal to the hyperplane (perpendicular to the line in 2 dimensions). We
can determine how far away a sample is from our decision boundary by projecting the position vector of the sample on
to the vector w. As a quick refresher, the dot product of two vectors is proportional to the projection of the first vector
on to the second.
If it’s a positive sample, we’re going to insist that the proceeding decision function (the dot product of w and the
position vector of a given sample plus some constant) returns a value greater than or equal to 1.
Similarly, if it’s a negative sample, we’re going to insist that the proceeding decision function returns a value smaller
than or equal to -1.
In other words, we won’t consider any samples located between the decision boundary and support vectors. The
variable y will be equal to positive one for all positive samples and negative one for all negative samples.
After multiplying by y, the equations for the positive and negative samples are equal to one another.
and
Meaning, we can simplify the constraints down to a single equation.
Implementation: …………………………………………………………………………………
LO1: identify a problem which can be solved using Support Vector Machine.
Course Outcomes: Upon completion of the course students will be able to apply SVM for learning and assess the
outcome.
Conclusion: ………………………………………………………………………………………
Viva Questions:
Marks
Obtained
Experiment 04
Learning Objective: Experiment a use case utilizing online datasets in order to apply kNN and measure the
performance.
Theory:
K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset in its training phase.
Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most
similar instances and the data with the most similar instance is finally returned as the prediction. kNN is often used in
search applications where you are looking for similar items, like find items similar to this one. Algorithm suggests
that if you’re similar to your neighbours, then you are one of them. The k-nearest neighbors algorithm uses a very
simple approach to perform classification. When tested with a new example, it looks through the training data and finds
the k training examples that are closest to the new example. It then assigns the most common class label (among
those k-training examples) to the test example.
k in kNN algorithm represents the number of nearest neighbor points which are voting for the new test data’s class. If
k=1, then test examples are given the same label as the closest example in the training set. If k=3, the labels of the three
closest classes are checked and the most common (i.e., occurring at least twice) label is assigned, and so on for
larger ks.
ALGORITHM
Step1: Calculate the Euclidean distance between the new point and the existing points
Step 2: Choose the value of K and select K neighbors closet to the new point.
Step 3: Count the votes of all the K neighbors / Predicting Values
Following are the Steps for execution:
Implementation: …………………………………………………………………………………
Course Outcomes: Upon completion of the course students will be able to apply regression for learning and assess the
outcome.
Conclusion: ………………………………………………………………………………………
Viva Questions:
Marks
Obtained
Experiment 05
Learning Objective: Experiment a use case utilizing online datasets in order to apply clustering and measure the
performance.
Theory:
In Machine Learning, we often think about how to use data to make predictions on new data points. This is called
“supervised learning.” Sometimes, however, rather than ‘making predictions’, we instead want to categorize data into
buckets. This is termed “unsupervised learning.” Clustering is one of the most frequently utilized forms of unsupervised
learning. The EM (expectation maximization) technique is similar to the K-Means technique. The basic operation of K-
Means clustering algorithms is relatively simple: Given a fixed number of k clusters, assign observations to those
clusters so that the means across clusters (for all variables) are as different from each other as possible.
The EM algorithm extends this basic approach to clustering in two important ways: Instead of assigning examples to
clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes
probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering
algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.
Expectation Maximization algorithmThe basic approach and logic of this clustering method is as follows. Suppose you
measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two
clusters of observations with different means (and perhaps different standard deviations); within each sample, the
distribution of values for the continuous variable follows the normal distribution. The goal of EM clustering is to
estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data
(distribution).
Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of
different distributions in different clusters. The results of EM clustering are different from those computed by k-means
clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm
does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each
observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an
actual assignment of observations to clusters, based on the (largest) classification probability.
Implementation: …………………………………………………………………………………
Course Outcomes: Upon completion of the course students will be able to apply clustering for learning and assess the
outcome.
Conclusion: ………………………………………………………………………………………
Viva Questions:
Marks
Obtained
Experiment 06
Learning Objective: Experiment a use case utilizing online datasets in order to apply ensemble and measure the
performance.
Theory:
1.1 Max Voting: The max voting method is generally used for classification problems. In this technique, multiple
models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The
predictions which we get from the majority of the models are used as the final prediction.
For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll assume three of them rated it as
4 while two of them gave it a 5. Since the majority gave a rating of 4, the final rating will be taken as 4. You can
consider this as taking the mode of all the predictions.
1.2 Averaging: Similar to the max voting technique, multiple predictions are made for each data point in averaging. In
this method, we take an average of predictions from all the models and use it to make the final prediction. Averaging
can be used for making predictions in regression problems or while calculating probabilities for classification problems.
For example, in the below case, the averaging method would take the average of all the values.
i.e. (5+4+5+4+4)/5 = 4.4
1. 3 Weighted Average: This is an extension of the averaging method. All models are assigned different weights
defining the importance of each model for prediction. For instance, if two of your colleagues are critics, while others
have no prior experience in this field, then the answers by these two friends are given more importance as compared to
the other people.
The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
2.1 Stacking: Stacking is an ensemble learning technique that uses predictions from multiple models (for example
decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set.
2.2 Blending: Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set
to make predictions. In other words, unlike stacking, the predictions are made on the holdout set only. The holdout set
and the predictions are used to build a model which is run on the test set.
2.3 Bagging: The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to
get a generalized result. Bootstrapping is a sampling technique in which we create subsets of observations from the
original dataset, with replacement. The size of the subsets is the same as the size of the original set.
Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution (complete
set).
2.4 Boosting: Boosting is a sequential process, where each subsequent model attempts to correct the errors of the
previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works
in the below steps:
A subset is created from the original dataset.
Initially, all data points are given equal weights.
A base model is created on this subset.
This model is used to make predictions on the whole dataset.
Errors are calculated using the actual values and predicted values. The observations which are incorrectly predicted, are
given higher weights. Another model is created and predictions are made on the dataset.
Similarly, multiple models are created, each correcting the errors of the previous model. The final model (strong
learner) is the weighted mean of all the models (weak learners). Thus, the boosting algorithm combines a number of
weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they
work well for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.
Course Outcomes: Upon completion of the course students will be able to apply ensemble techniques for learning and
assess the outcome.
Conclusion:
Viva Questions:
Marks
Obtained
Experiment 7
Learning Objective: Experiment a use case utilizing online datasets in order to perform feature extraction in order to
enhance performance.
Theory:
Think of the new variables, called the principal components, as composite variables consisting of a mixture of the
original variables. In PCA, the goal is to find a set of k principal components (composite variables) that: Is much
smaller than the original set of p variables, and Accounts for nearly all of the total sample variance. If these two goals
can be accomplished, then the set of k principal components contains almost as much information as the original p
variables. This means that the k principal components can then replace the p original variables. The original data set is
thereby reduced, from n measurements on p variables to n measurements on k variables. PCA often reveals
relationships between variables that were not previously suspected.Because of such relationships, new interpretations of
the data and variables often stem from PCA. PCA usually serves a more of a means to an end rather than an end in
themselves in that the composite variables (the principal components) are often used in larger investigations or in other
statistical techniques such as:
1. Multiple regression
2. Cluster analysis.
Principal components depend solely on the covariance matrix or the correlation matrix. Linear combination weights in
PCA aren’t typically either ones or zeros rather, all the linear combination weights for each principal component come
directly from the eigenvectors of the covariance matrix or the correlation matrix. Recall that for p variables, the p × p
covariance/correlation matrix has a set of p eigenvalues and p eigenvectors
Each principal component is formed by taking the values of the elements of the eigenvalues as the weights of the linear
combination.
If the k-th eigenvector ek = (e1k, e2k, …, epk), then the principal components Y1, … are formed by:
Y1 = e11X1 + e21X2 + . . . + ep1Xp
Y2 = e12X1 + e22X2 + . . . + ep2Xp
...
Yp = e1pX1 + e2pX2 + . . . + eppXp
The importance of each component, measured by the proportion of total sample variance accounted for by the
component. The importance of each original variable within each component. The weight of the component for each
variable (for interpretation of the relative importance of the original variables). The sum of the eigenvalues is equal to
the total sample variance.
When all measurements are positively correlated, the first principal component is often some kind of average of the
measurements (e.g., size of birds, severity index of psychiatric symptoms)
Then the other principal components give important information about the remaining pattern (e.g., shape of birds,
pattern of psychiatric symptoms)
Applications:
Dimension reduction: summarize the data with a smaller number of variables, losing as little information as
possible.
Can be used for graphical representations of the data.
Use PCA as input for regression analysis.
Limitations of PCA:
We only consider orthogonal transformations (rotations) of the original variables. i.e., PCA allows only linear
combinations of the original variables.
PCA is based only on the correlation (or covariance matrix for non-standardized data) of the data. Some
distributions (e.g. multivariate normal) are completely characterized by this, but others are not.
Dimension reduction can only be achieved if the original variables were correlated. If the original variables
were uncorrelated, PCA does nothing, except for ordering them according to their variance.
PCA is not scale invariant.
X=
where:
A is an N × D matrix
U is an N × N orthornormal matrix (so U= ),
S is an N × D diagonal matrix
Since there are at most D Singular values, the last N-D Columns of U are irrelevant, since they will be multiplied by D
If N > D, then we have
and
Implementation: …………………………………………………………………………………
Course Outcomes: Upon completion of the course students will be able to apply dimensionality reduction methods.
Conclusion: ………………………………………………………………………………………
Viva Questions:
Marks
Obtained
Experiment 8
Learning Objective: Experiment a use case utilizing online datasets in order to perform feature selection in order to
enhance performance.
Theory:
According to Forbes, about 2.5 quintillion bytes of data is generated every day. This data can then be analysed using
Data Science and Machine Learning techniques in order to provide insights and make predictions. Although, in most of
the cases, the originally gathered data needs to be first preprocessed before starting any statistical analysis with it. There
are many different reasons why it might be necessary to carry out a preprocessing analysis, some examples are:
The gathered data is not in the right format (eg. SQL Database, JSON, CSV, etc…).
Missing values and Outliers.
Scaling and Normalization.
Reduce Intrinsic Noise present in the dataset (part of the stored data might be corrupted).
Some features in the dataset might not gather any information to our analysis.
Reducing the number of features to use during a statistical analysis can possibly lead to several benefits such as:
Accuracy improvements.
Overfitting risk reduction.
Speed up in training.
Improved Data Visualization.
Increase in explainability of our model.
There are many different methods which can be applied for Feature Selection. Some of the most important ones are:
Filter Method: filtering our dataset and taking only a subset of it containing all the relevant features (eg.
correlation matrix using Pearson Correlation).
Wrapper Method: follows the same objective of the FIlter Method but uses a Machine Learning model as it’s
evaluation criteria (eg. Forward/Backward/Bidirectional/Recursive Feature Elimination). We feed some
features to our Machine Learning model, evaluate their performance and then decide if add or remove the
feature to increase accuracy. As a result, this method can be more accurate than filtering but is more
computationally expensive.
Embedded Method: like the FIlter Method also the Embedded Method makes use of a Machine Learning
model. The difference between the two methods is that the Embedded Method examines the different training
iterations of our ML model and then ranks the importance of each feature based on how much each of the
features contributed to the ML model training (eg. LASSO Regularization).
Implementation: …………………………………………………………………………………
Result and Discussion: …………………………………………………………………………
LO1: Identify the need of preprocessing and feature selection techniques in ML.
LO2: Implement different feature selection methods.
LO3: Compare different feature selection techniques.
LO4: Measure the performance of classifier after feature selection.
Course Outcomes: Upon completion of the course students will be able to apply dimensionality reduction methods.
Conclusion: ………………………………………………………………………………………
Viva Questions:
Marks
Obtained
Experiment 9
Learning Objective: Experiment a use case utilizing online datasets in order to perform hyperparameter tuning in
order to enhance performance.
Theory:
Manual: select hyperparameters based on intuition/experience/guessing, train the model with the
hyperparameters, and score on the validation data. Repeat process until you run out of patience or are satisfied
with the results.
Grid Search: set up a grid of hyperparameter values and for each combination, train a model and score on the
validation data. In this approach, every single combination of hyperparameters values is tried which can be
very inefficient!
Random search: set up a grid of hyperparameter values and select random combinations to train the model
and score. The number of search iterations is set based on time/resources.
Automated Hyperparameter Tuning: use methods such as gradient descent, Bayesian Optimization, or
evolutionary algorithms to conduct a guided search for the best hyperparameters.
Bias-variance tradeoff: Every estimator has its advantages and drawbacks. Its generalization error can be decomposed
in terms of bias, variance and noise. The bias of an estimator is its average error for different training sets.
The variance of an estimator indicates how sensitive it is to varying training sets. Noise is a property of the data. Bias
and variance are inherent properties of estimators and we usually have to select learning algorithms and
hyperparameters so that both bias and variance are as low as possible. Another way to reduce the variance of a model is
to use more training data. However, you should only collect more training data if the true function is too complex to be
approximated by an estimator with a lower variance.In the simple one-dimensional problem that we have seen in the
example it is easy to see whether the estimator suffers from bias or variance. However, in high-dimensional spaces,
models can become very difficult to visualize.
Validation Curve: To validate a model we need a scoring function, for example accuracy for classifiers. The proper
way of choosing multiple hyperparameters of an estimator are of course grid search or similar methods that select the
hyperparameter with the maximum score on a validation set or multiple validation sets. If we optimized the
hyperparameters based on a validation score the validation score is biased and not a good estimate of the generalization
any longer. To get a proper estimate of the generalization we have to compute the score on another test set.However, it
is sometimes helpful to plot the influence of a single hyperparameter on the training score and the validation score to
find out whether the estimator is overfitting or underfitting for some hyperparameter values.
Learning Curve: A learning curve shows the validation and training score of an estimator for varying numbers of
training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator
suffers more from a variance error or a bias error. Consider the following example where we plot the learning curve of a
naive Bayes classifier and SVM. For the naive Bayes, both the validation score and the training score converge to a
value that is quite low with increasing size of the training set. Thus, we will probably not benefit much from more
training data.In contrast, for small amounts of data, the training score of the SVM is much greater than the validation
score. Adding more training samples will most likely increase generalization.
Grid Search: Grid search is the process of performing hyper parameter tuning in order to determine the optimal values
for a given model. This is significant as the performance of the entire model is based on the hyper parameter values
specified.
Implementation: …………………………………………………………………………………
Course Outcomes: Upon completion of the course students will be able to understand optimization techniques and
apply Hyperparameter tuning for model selection
Conclusion: ………………………………………………………………………………………
Viva Questions:
Marks
Obtained
Experiment 10
Learning Objectives: Present a case study on the mini project developed as extension to any one experiment listed
above.
Course Outcomes: Upon completion of the course students will be able to present
Conclusion: ………………………………………………………………………………………
Marks
Obtained