Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Let us take the Height and Weight of mice as shown in the following scatter plot. We would like to
predict mouse height given its weight.
Let us first divide the data into 2 sets: the training set and the testing set.
If we use another model where a squiggly line can be fit all the data points accurately. Here bias is
very low.
In the training set, the squiggly line may be good. But it may fail with testing set. In the context of
testing data, it failed. And the straight line did better.
The difference in fits between datasets is called variance. The squiggly line has low bias but it cannot
predict accurately with future datasets.
Points to remember:
1. bias happens because of inconsistent model (wrong model that does not fit the data properly).
2. variance happens when the model fails to predict the results accurately (model may be good but it
may not work on future data)
3. we need to have low bias and low variance. This means: correct model and correct predictions.
4. If the model shows high variance and low bias, it is called ‘overfitting’. If the model shows low
variance and high bias, it is called ‘underfitting’.
MINIMIZING VARIANCE
Regularization is the technique to minimize the variance (or overfitting). This is done using 1. Ridge
regression 2. Lasso regression and 3. ElasticNET regression.
RIDGE REGRESSION
Simple linear regression is also called ‘Least squared regression’. In this model, we will take the
deviations from the predicted line to the actual data points. The sum of the squares of these
deviations should be minimized. Then the line fits better.
Note: The reason behind squaring is this: when we square the negative deviations, then they will
become positive and add to the overall deviation. Otherwise, they may balance the positive
deviations.
In Ridge regression, we add some more penalty to the equation so that it will achieve low bias and
low variance.
Problem: We have height and weight of humanbeings. Use train.csv and test.csv files to retrieve
training and testing data. Fit Linear regression on training data and Ridge regression on testing
data.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge
x = pd.read_csv("F:/datascience-notes/ml/6-ridge-regression/train.csv")
x
y = pd.read_csv("F:/datascience-notes/ml/6-ridge-regression/test.csv")
y
Use boston_houses.csv and take only ‘AGE’ and ‘Price’ of the houses. Divide the data as training and
testing data. Fit the line using Ridge regression and find the price of a house if the age is 50 months.