Sei sulla pagina 1di 4

BIAS AND VARIANCE

Let us take the Height and Weight of mice as shown in the following scatter plot. We would like to
predict mouse height given its weight.

Let us first divide the data into 2 sets: the training set and the testing set.

a) in the training set:


We can use Simple Linear Regression and draw a straight line. This straight line does not have
flexibility to accurately represent the arc (or curve). This is called bias. The inability of the model to
accurately represent the relationship between the data points is called bias.

If we use another model where a squiggly line can be fit all the data points accurately. Here bias is
very low.

b) in the testing set:

In the training set, the squiggly line may be good. But it may fail with testing set. In the context of
testing data, it failed. And the straight line did better.

The difference in fits between datasets is called variance. The squiggly line has low bias but it cannot
predict accurately with future datasets.

Points to remember:
1. bias happens because of inconsistent model (wrong model that does not fit the data properly).
2. variance happens when the model fails to predict the results accurately (model may be good but it
may not work on future data)
3. we need to have low bias and low variance. This means: correct model and correct predictions.
4. If the model shows high variance and low bias, it is called ‘overfitting’. If the model shows low
variance and high bias, it is called ‘underfitting’.

MINIMIZING VARIANCE

Regularization is the technique to minimize the variance (or overfitting). This is done using 1. Ridge
regression 2. Lasso regression and 3. ElasticNET regression.

RIDGE REGRESSION

Simple linear regression is also called ‘Least squared regression’. In this model, we will take the
deviations from the predicted line to the actual data points. The sum of the squares of these
deviations should be minimized. Then the line fits better.

Note: The reason behind squaring is this: when we square the negative deviations, then they will
become positive and add to the overall deviation. Otherwise, they may balance the positive
deviations.

In Ridge regression, we add some more penalty to the equation so that it will achieve low bias and
low variance.

Simple linear regression: equation of line + the sum of squared residuals


Ridge regression: equation of line + the sum of squared residuals + λ x square of slope
Here, λ is called penalty. Also, the term ‘λ x square of slope ‘ is called L2 Penalty. When λ is 0, then
Ridge regression will be equal to Linear regression. Addition of penalty parameter λ is called
Regularization.

λ values can be anything from 0 to positive infinity.

Problem: We have height and weight of humanbeings. Use train.csv and test.csv files to retrieve
training and testing data. Fit Linear regression on training data and Ridge regression on testing
data.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge

x = pd.read_csv("F:/datascience-notes/ml/6-ridge-regression/train.csv")
x

y = pd.read_csv("F:/datascience-notes/ml/6-ridge-regression/test.csv")
y

# take the data into x and y axes


h_train = x.iloc[:, 0:1].values
w_train = x.iloc[:, 1].values
h_test = y.iloc[:, 0:1].values
w_test = y.iloc[:, 1].values
# draw scatter plot for train and test data separately
plt.scatter(h_train, w_train, color='blue')
plt.scatter(h_test, w_test, color='red')
plt.show()

# using Linear regression fit a line for the train data


lr = LinearRegression()
lr.fit(h_train, w_train)

plt.scatter(h_train, w_train, color='blue')


plt.plot(h_train, lr.predict(h_train), color='orange')

# using Ridge regression to fit a line for test data


rr = Ridge(alpha=0.5) # take alpha values as: 0.05, 0.5, 1, 2, 3 and see
rr.fit(h_test, w_test)

plt.scatter(h_test, w_test, color='red')


plt.plot(h_test, rr.predict(h_test), color='green')
plt.show()

# predict wight a person having 6.5 ft height


rr.predict([[6.5]]) # 70.3 kg

Task on Ridge Regression

Use boston_houses.csv and take only ‘AGE’ and ‘Price’ of the houses. Divide the data as training and
testing data. Fit the line using Ridge regression and find the price of a house if the age is 50 months.

Potrebbero piacerti anche