Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
2] The goal of cross-validation is to test the model's ability to predict new data
that was not used in estimating it, in order to flag problems like overfitting or
selection bias and to give an insight on how the model will generalize to an
independent dataset (i.e., an unknown dataset, for instance from a real problem).
-----------------------------------------------------------------------------------
------
Once a company was interviewing a candidate for the post of a data scientist. One
person answered every question such as the importance of cross validation, machine
learning, and so on perfectly.
The interviewer asked him the reason for his perfection. The candidate replied that
he built a database of all the questions asked by this interviewer over the past
five years and built a system that could predict the exact questions he would ask
with 85% precision.
The interviewer said that he could not hire the candidate on ethical grounds. The
candidate replied, “It doesn’t matter. I was only cross validating my prediction
model.”
-----------------------------------------------------------------------------------
-------
The purpose of cross validation is to assess how your prediction model performs
with an unknown dataset. We shall look at it from a layman’s point of view.
You are learning how to drive a car. Now, anyone can drive a car on an empty road.
The real test is how you drive in demanding traffic. It is why the trainers train
you on roads that have traffic so that you get used to it.
Therefore, when it is time for you actually to drive your car, you are prepared to
do so without the trainer sitting by your side to guide you. You are ready to
handle any situation, the like of which you might not have encountered before.
-----------------------------------------------------------------------------------
--------
(A) Exhaustive Cross Validation – This method involves testing the machine on all
possible ways by dividing the original sample into training and validation sets.
(B) Non-Exhaustive Cross Validation – Here, you do not split the original sample
into all the possible permutations and combinations.
There is a disadvantage because the cross validation process can become a lengthy
one. It depends on the number of observations in the original sample and your
chosen value of ‘p.’
You repeat the cross validation process ‘k’ times using each ‘k’ sample as the
validation data once. Take an average of the ‘k’ number of results to produce your
estimation. The advantage of this method is that you use all the observations for
both training and validation, and each sample for validation once.
3. Holdout Method
-----------------
The holdout cross validation method is the simplest of all. In this method, you
randomly assign data points to two sets. The size of the sets does not matter.
Treat the smaller set say ‘d0’ as the testing set and the larger one, ‘d1’ as the
training set. You train your model on the d0 set and test it on the d1. There is a
disadvantage because you do a single run. It can give misleading results.
Finally, you average the results over all the splits. The advantage of this method
is that the proportion of the validation or training split is not dependent on the
number of folds (K-fold test). However, there is a disadvantage as well.
There are chances that you might miss out some observations whereas you might
select some observations more than once. Under such circumstances, the validation
subsets could overlap. You also refer to this procedure as Repeated Random Sub-
sampling method.
-----------------------------------------------------------------------
You verify how accurate your model is on multiple and different subsets of data.
Therefore, you ensure that it generalizes well to the data that you collect in the
future. It improves the accuracy of the model.
-----------------------------------------------------------------------
We have seen what cross validation in machine learning is and understood the
importance of the concept. It is a vital aspect of machine learning, but it has its
limitations.
(a) In an ideal world, the cross validation will yield meaningful and accurate
results. However, the world is not perfect. You never know what kind of data the
model might encounter in the future.
(b) Usually, in predictive modelling, the structure you study evolves over a
period. Hence, you can experience differences between the training and validation
sets. Let us consider you have a model that predicts stock values.
You have trained the model using data of the previous five years. Would it be
realistic to expect accurate predictions over the next five-year period?
(c) Here is another example where the limitation of the cross validation process
comes to the fore. You develop a model for predicting the individual’s risk of
suffering from a particular ailment.
However, you train the model using data from a study involving a specific section
of the population. The moment you apply the model to the general population, the
results could vary a lot.
------------------------------------------------------------------------
(a) You can use cross validation to compare the performances of a set of predictive
modelling procedures.
(b) It has excellent use in the field of medical research. Consider that we use the
expression levels of a certain number of proteins, say 15 for predicting whether a
cancer patient will respond to a specific drug.
The ideal way is to determine which subset of the 15 features produce the ideal
predictive model. Using cross validation, you can determine the exact subset that
provides the best results.
(c) Recently, data analysts have used cross validation in the field of medical
statistics. These procedures are useful in the meta-analysis.