Sei sulla pagina 1di 55

Concepts of Machine Learning

and Linear Regression


Learning Objectives
• Introduction to Machine Learning and its importance
• Machine Learning happens in Mathematical space
• Supervised Machine Learning
• Linear Regression and Pearson’s Coefficient
• Best fit line and Coefficient of Determinant
• Pros and Cons of Linear Regression
• Hands on exercise on Linear Regression

2
Introduction to Machine Learning

• The ability to do the tasks come from the underlying model which is the result of the learning process.

• The model is generated from huge volume of data, huge both in breadth and depth reflecting the real world
in which the processes are performed.

What machine learning algorithms do?

• Search through the data to look for patterns in form of trends, cycles, associations, etc.

• Express these patterns as mathematical structures such as probability or polynomial equations.


Importance of Machine Learning

• Used for character recognition or natural language processing.

• Too complex and dynamic, eg: Weather Forecasting.

• Patterns hidden in humongous data, eg: Recommendation System

• Too many permutations and combinations possible , eg: Genetic Code mapping
Machine Learning happens in
Mathematical space

• A data representing the real world, is a collection attributes that define an entity.
Supervised Machine Learning

• Class of machine learning that work on externally supplied instances in form of


predictor attributes and associated target values.

• The model thus generated is used to make predictions about future instances.
Eg: building model to predict the re-sale value of a car based on its mileage, age, colour etc

• Supervised learning problems can be further grouped into regression and


classification problems.
Data Science Machine Learning Steps:

• Identify data required


• Pre-process Data
• Create training & test set
• Select appropriate algorithm’s
• Train and build the model
• Evaluate with test data
Correlation
How is relationship measured?
Strength of Linear Association
Correlation Coefficient
Introduction To Regression
• Regression is a tool for finding a relationship between a
Explained variable and one or more Explanatory variables in
a study.
• The relationship can be linear or non-linear.
• The basic function of regression is to identify statistically
significant explanatory variables and estimate the model
parameters
Why Do A Regression Analysis?

• How MRP affects the purchase decision?


• Which promotion is more effective?
• What is the risk associated with price increase on customer
retention?
• Which customer is likely to default?
• What percentage of loans is likely to result in a loss?
• How to identify the most profitable customer?
Where Is It Used?

• Every functional area of management uses regression:


• Finance: Chance of bankruptcy, credit risk fraud.
• Marketing: Sales, market share, customer satisfaction, customer
churn, customer retention, customer life time value.
• Operations: Inventory, productivity, efficiency.
• HR: Job satisfaction, attrition.
• Healthcare : New plans, health insurance
Linear Regression

• The term “Regression” generally refers to predicting a real number.

• The term “linear” in linear regression refers to the fact that the method models data
with linear combination of the explanatory variables.

• In case of linear regression with a single explanatory variable, the linear


combination can be expressed as :

response = intercept + constant + explanatory


Linear Regression
Armand’s Pizza Parlors is a chain of Italian-food restaurants located in a five-state area. Armand’s most successful locations are near
college campuses. The managers believe that quarterly sales for these restaurants (denoted by y) are related positively to the size of the
student population (denoted by x); that is, restaurants near campuses with a large student population tend to generate more sales than
those located near campuses with a small student population. Using regression analysis, we can develop an equation showing how the
dependent variable y is related to the independent variable x.
Estimation Process in Linear Regression

The estimation of Beta0 and Beta1 is


a statistical process much like the
estimation of Mu. Beta0 and Beta1 are
the unknown parameters of interest,
and b0 and b1 are the sample
statistics used to estimate the
parameters.
Least Square Method

Least Sum of Square is a method that uses the sample data to provide the value of b0
an b1such that it minimizes the sum of squared differences between observed value of y
(denoted by yi) and estimated value of y ( denoted by y^i)
Applying Least Square method

b0=60

b1=5

Y^=60+5X
Coefficient of Determination
SST a.k.a Total Sum Of Squares

SST = how well the observations


clutter about the y-bar line
SSE = how well the observations
clutter around the estimated
regression line

r-square can take values


between 0 and 1 but r can
take value between -1 and 1
Model Assumptions
• The error term epsilon is a random variable with mean or expected value equal to 0.
• The variance or error term epsilon, denoted by (sigma)-square is same for all value of x.
Implication - This means that the variance of y about the regression line is same for all values of x.
• The value of epsilon is independent.
Implication- The value of epsilon for a particular value of x is not dependent on value of epsilon for any other value of x.
• The error term epsilon is normally distributed random variable
Implication: As y is a linear function of epsilon; therefore y is also a normally distributed random variable.
Test Of Significance
Estimates and Prediction

Confidence Interval: Interval Estimate of


the mean value of y for a given value of x
Prediction Interval: Interval estimate of
an individual value of y for a given value
of x. Margin of error is more for
prediction interval as compared to
confidence interval.
Residual Analysis
• E(€) = 0
• The variance of €, denoted by sigma-
square is same for all values of x
• The value of € are independent
• The error term € have a normal
distribution.

1. A plot of the residuals against values of


the independent variable x
2. A plot of residuals against the predicted
values of the dependent variable
3. A standardized residual plot
4. A normal probability plot
Residual Plot against x
Residual Plot against predicted y ( y-hat)

Plot of standardized residuals Vs y-hat


Normal Probability Plot
Outliers and Influential Observation
Multiple Linear Regression
Multiple Coefficient of Determination
Assumptions
What about Non-Normality

Skew and Kurtosis


•Skew – much easier to deal with
•Kurtosis – less serious anyway

Transform data
•removes skew
•positive skew – log transform
•negative skew – square
Assumption:
Heteroscedasticity
• The word Heteroscedasticity comes from Heteros ->
Different
• Scedasticity -> Conditional Variance i.e. Variation in residuals
is independent of X
• So if there is no Heteroscedasticity then there is
Homoscedasticity i.e. Uniform Variance
• Heteroscedasticity: When the error variance changes in a
systematic pattern with changes in the X value
Remedies for Heteroscedasticity:
 A Critical Question: Why is it so important to detect
heteroscedasticity?
 Biased Standard Error estimation- inferences based on OLS
estimation becomes incorrect (though coefficient estimates remain
unbiased)
 Possible solution: Transformation
Looking at the scatterplot of squared residuals against X,
decide on the appropriate transformation of X
Regression Diagnostics
What is Collinearity?
Effects of collinearity
Detection of multicollinearity: Simple Signs
How to detect Multi-collinearity
• None of the coefficients has significant statistic
• Pair-wise correlations high
• Negative coefficient when theory suggests positive
relationship
How to Measure Collinearity
Multicollinearity: Detection & Removal
• Independent variables have significant correlation
• Check with Variance Inflation factor (VIF)
• VIF>5 => Multicollinearity
• Why it is a problem?
– It increases the standard error and affects the coefficients of
independent variables
• If more than one variable has VIF > 5, one of them must be
removed
• Remove one by one and choose the one which maximizes R2
VIF-Steps
• Proportion of variance of one predictor explained by all other
predictors
• VIFj = 1/(1-R2j)
• VIF = 1 indicates no collinearity
• Compute VIF for all predictors
• Drop the one with largest VIF above cut-off
• Re-compute and repeat until all VIFs are below cut-off (2,5,10)
• VIF=2.5 for a variable means R2 of this variable with other
predictors is 0.6 which is pretty high
Problem
Dataset1 Dataset 2 Dataset 3

Variable VIF Variable VIF Variable VIF

FAM 37.6 DOPROD 469.7 At 37.4

PEER 30.2 STOCK 1.0 Pt 33.5

SCHOOL 83.2 Consum 469.4 Et 1.1

At-1 26.6

Pt-1 44.1

Average 50.3 Average 313.4 Average 28.5


Pros and Cons of Linear Regression

Advantages
• Simple to implement and easier to interpret the outputs coefficient.

Disadvantages

• Assumes a linear relationships between dependent and independent variables.


• Outliers can have huge effects on regression.
• Linear Regression assume independence between attributes.
Hands on exercise on Linear Regression
The dataset used here is car-mpg.csv (https://www.kaggle.com/uciml/autompg-dataset)

Some important functions

1. Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. Drop NAs’
mpg_df = mpg_df.drop(‘car_name’, axis=1)
3. Replace the numbers in categorical variables with the actual country names in the
origin col
mpg_df[‘origin’] = mpg_df[‘origin’].replace({1: “America”, 2: “Europe”, 3: “Asia”})
Analyse the distribution of dependent column
mpg_df.describe().transpose()

Missing values Imputation

On inspecting records if we find “?” in the columns, Replace them with “nan”
mpg_df = mpg_df.replace (‘?’, np.nan)
mpg_df [mpg_df.isnull().any(axis=1)]

Replace the missing values with mean/median/mode depends on the type of variables
mpg_df.median()
Bivariate Analysis
sns.pairplot (mpg_df_attr, diag_kind = ‘kde’) #to plot density curve
sns.pairplot(mpg_df_attr) #to plot histogram

Building Model after all information

Split X and Y into training and test set in 75:25 ratio


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)
Invoke the Linear Regression function and find the best fit model on
training data
regression_model = LinearRegression()
regression_model .fit(x_train, y_train)

Check the intercept for the model


intercept = regression_model.intercept_[0]
print(“The intercept of the model is {}”. format(intercept))
Questions?

Potrebbero piacerti anche