Sei sulla pagina 1di 20

ECONOMETRIC ANALYSIS

ASSIGNMENT
CIA

Submitted To:
Mr. Anand S
(Econometric Analysis)

Submitted By:
Sayon Das
1421328
MBA-F2

Q1. We learned Ordinary Least Squares or OLS in the class. What are the other
estimation techniques available? How are they similar or different from OLS? When
are these estimators used?
Ans). The ordinary least square method is a statistical method which is utilized for estimating
the various unknown parameters in a linear regression model. This technique is applied to get
the best fit in a regression line and used to identify the points which have the least squared
distance from the regression line. In OLS the curve that explains the relationship between
expected and observed data set by minimizing sum of squares of deviation between observed
and expected values. The other estimation techniques:1.
2.
3.
4.
5.
6.

Weighted least squares- A special case of least squares method


Generalized Least Square
Best linear unbiased estimator (BLUE)
Best linear unbiased prediction (BLUP)
Minimum Mean square error (MMSE)
Root mean squares (RMS)

Weighted least squares:


A special case of least squares when the all the values except the diagonal are null. The
weighted OLS takes the assumption that weight is unknown and cannot handle
autocorrelation.
Generalized Least square: This method can have Heteroscedasticity with autocorrelation
but the sample properties are not known.
Best Linear Unbiased Estimator (BLUE):
It involves application of unbiased and efficient linear estimation which is helpful in various
statistical inferences about parameters based on sample statistics estimators. Value equal to
true value of population parameters are provide by BLUE.
BLUP is used when random data is needed from the population set
Minimum Mean square error (MMSE): It is an estimation method which minimizes the
mean square error (MSE) of the fitted values of a dependent variable, which is a common
measure of estimator quality.
Root Mean Squares(RMS): It is a statistical measure which can be defined as the square
root of the mean of square of a sample.

We can use these estimators in order to identify the data set such that proper variables are
used to represent population and identify the data sets which represents in the regression line.

Q2.)

I. What is heteroscedasticity?
II. What problems would you face in your data analysis if you have
heteroscedasticity?
III. How do you detect it?
IV. What are the remedies to it?

Ans).
I)

The heteroscedasticity is referred to situation in which the variability of a factor is


unequal across the range of values of a second variable which predicts it and residuals

do not have a constant variance.


A scatter plot of the variables makes a cone structure or variability of dependent

variable widens or narrows as value of the independent variable increases.


Conditional Heteroscedasticity identifies variable volatility which cannot be
identifies for future periods and the volatility cannot be predicted such as stock prices,

bonds etc.
Unconditional Heteroscedasticity can identify the volatility for future periods. Eg.
Seasonal variability such as sale of woolen cloths and sweaters.

Example: 1. Income vs expenditure on meals is a clear example of heteroscedasticity. If the


income increases variability of food consumption will increase and poor person will make a
constant expense purchasing inexpensive food. People with higher incomes display greater
variability of food consumption.
In linear regression model we take the assumption that error term in normal distribution with
mean 0 and variance 2, i.e. Var(ui) = 2 is called homoscedasticity. when the error term
does not have constant variance, i.e. Var(ui) = 2 i we call it heteroscedasticity.
Heteroscedasticity variables plotted in graph appear as they fall into normal distribution

whereas in heteroscedacity the scatter has high variability and does not fall into normal
distribution curve.

II). Assumptions of linear regression model are correct, OLS give unbiased estimates,
Heteroscedasticity occurs when variance of errors varies in observations and if errors are
Heteroscedasticity OLS estimator remains unbiased where standard errors are inconsistent.
The problems faced by the data analysis of Heteroscedasticity are
1. estimator is no longer BLUE (Best Linear Unbiased Estimators) which means that

it has a higher sampling variance.


2. It constantly overestimates the standard error.

III). Detection of Heteroscedasticity


a. Graphical Method: Plotting residuals in Y axis and dependent variable X axis the
pattern of Heteroscedasticity can be detected.
b. Formal Method

White Tests: The white test is a statistical test that establishes residual variance of factor
in a regression model in constant and is estimator for heteroscedasticity consistent
standard errors. R square =ESS/TSS

If the initial model is not selected R square will become significant.


H0 : Model is homoscedastic
Ha : Model is heteroscedasticity
When the p-value < 0.05 ,
We reject the H0 and accept Ha.

Goldfeld-Quandt Test
In statistics the goldfield quandt test checks the heteroscedasticity in regression analysis
by dividing a dataset into two groups called two group test which offers diagnostic of
heteroskedastic errors in univariate or multivariate regression model.
Required test statistic T= [RSS2/{(n-k)/2 c}] / [RSS1/{(n-k)/2 c}]
H0 : 12 = 22
Ha : 12 > 22
When the p-value < 0.05, well reject H0 and accept Ha.

BPG Test:
The Breush Pagan Test is used to test the heteroskedascity in linear regression model and
tests whether the estimate variance of residuals from a regression are dependent on
values independent variables.
Y=X+C
And then regression residual are to be computed
Then auxiliary regression are to be performed
Finally the test statistic is the result of the coefficient of determination of the auxiliary
regression and the sample size :
LM = nR2.
Glejser test for heteroscedasticity, developed by Herbert Glejser, regresses the
residuals on the explanatory variable that is thought to be related to the
heteroscedastic variance. After it was found to be not asymptotically valid under
asymmetric disturbances
IV). The remedies of heteroscedasticity
a. Weighted Least Square : It is used when there is heteroscedasticity in the model
b.

Generalized Least Square : It is used when there is heteroscedasticity along with


autocorrelation

c.

Use of robust standard error

d. Data Transformation.
Q3. Take any data of your choice within the built-in R dataset. Illustrate all that you
have explained in question 2 with this data. Explain the R-results of this exercise.
Submit R-codes script along with results.
Ans:
R-Script:
p<- lm(mpg ~ cyl + disp, data = mtcars)
library(lmtest)
bptest(p)
plot(p)
Console:
> p<- lm(mpg ~ cyl + disp, data = mtcars)
> library(lmtest)
> bptest(p)
studentized Breusch-Pagan test
data: p
BP = 5.3769, df = 2, p-value = 0.06799
> plot(p)

Hit <Return> to see next plot:


Interpretation: H- The Data has homoscedacity
H- The data has no homoscedacity (heteroscedacity exists)
Since p value is greater than 0.05 we reject null hypothesis and accept alternate
hypothesis i.e the data contains heteroscedacity.

Q4. Test weak-form efficiency of the efficient markets hypothesis for any
Indian stock of your choice for the period 01-01-2014 to 30-10-2015,
using your knowledge of time series analysis. (25 marks)5 Use the data
insurance.xls attached with this mail to build a linear regression model
with the following points in mind:

R-Code:
`mydata` <- read.csv("C:/Users/Gaurav/Downloads/500087.csv")
`mydata`$ts<-ts(`mydata`$Close.Price, freq=12, start=c(2014,1))
plot.ts(mydata$ts)
mydata$t<-ts(1:451, freq=12, start=c(2014,1))
trend<-lm(mydata$ts~mydata$t)
mydata$T_hat<-predict(trend)
mydata$T_res<-residuals(trend)
plot(mydata$ts~mydata$t, col="blue", type = "l")
abline(trend, col="red")
plot(mydata$T_res~mydata$t, col="blue", type = "l")
mydatats<-decompose(mydata$ts)
mydatats$seasonal
mydatats$trend
mydatats$random
mydata$trend<-mydatats$trend
mydata$seas<-mydatats$seasonal
mydata$rand<-mydatats$random
plot(mydatats, col="red")

library(quantmod)
getSymbols("Cipla.NS", src="yahoo")
getSymbols("^NSEI", src="yahoo")
plot(CIPLA.NS)
plot(NSEI)
barChart(NSEI)
library(forecast)
price_arima<-arima(mydata$Close.Price, order=c(0,1,1))
summary(price_arima)
price_hat<-forecast.Arima(price_arima, h=5)
price_hat
plot.forecast(price_hat)
R-Console:
> `mydata` <- read.csv("C:/Users/Gaurav/Downloads/500087.csv")
> `mydata`$ts<-ts(`mydata`$Close.Price, freq=12, start=c(2014,1))
> plot.ts(mydata$ts)
>
> mydata$t<-ts(1:451, freq=12, start=c(2014,1))
> trend<-lm(mydata$ts~mydata$t)
> mydata$T_hat<-predict(trend)
> mydata$T_res<-residuals(trend)
> plot(mydata$ts~mydata$t, col="blue", type = "l")
> abline(trend, col="red")
>
> plot(mydata$T_res~mydata$t, col="blue", type = "l")
> mydatats<-decompose(mydata$ts)
> mydatats$seasonal
> mydatats$trend
> mydatats$random
mydata$trend<-mydatats$trend
> mydata$seas<-mydatats$seasonal
> mydata$rand<-mydatats$random
> plot(mydatats, col="red")
>
>
> library(quantmod)
> getSymbols("Cipla.NS", src="yahoo")
[1] "CIPLA.NS"
Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, :
downloaded length 123934 != reported length 200

> getSymbols("^NSEI", src="yahoo")


[1] "NSEI"
Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, :
downloaded length 146495 != reported length 200
> plot(CIPLA.NS)
Warning message:
In plot.xts(CIPLA.NS) : only the univariate series will be plotted
> plot(NSEI)
Warning message:
In plot.xts(NSEI) : only the univariate series will be plotted
> barChart(NSEI)
>
> library(forecast)
> price_arima<-arima(mydata$Close.Price, order=c(0,1,1))
> summary(price_arima)
Call:
arima(x = mydata$Close.Price, order = c(0, 1, 1))
Coefficients:
ma1
0.0118
s.e. 0.0483
sigma^2 estimated as 107.7: log likelihood = -1691.28, aic = 3386.56
Training set error measures:
ME RMSE MAE
MPE MAPE MASE
ACF1
Training set -0.6300071 10.3642 7.305634 -0.1330227 1.260025 0.9984617 -0.004093059
>
> price_hat<-forecast.Arima(price_arima, h=5)
> price_hat
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
452
401.6542 388.3572 414.9511 381.3183 421.9901
453
401.6542 382.7378 420.5706 372.7241 430.5843
454
401.6542 378.4410 424.8674 366.1527 437.1557
455
401.6542 374.8238 428.4846 360.6206 442.6878
456
401.6542 371.6393 431.6691 35
> plot.forecast(price_hat)
Graphs:

Q5. Use the data insurance.xls attached with this mail to build a linear
regression model with the following points in mind:
1. Dependent Variable - Charges
2. Create the following variables: sex = 1 if male and 0 if female
(female as base category)
Convert 'region' to dummies with northeast as the base category
smoke=1 if yes and 0 if no
Create a new variable bmi30 such that bmi30 = 1 if bmi >=30, 0
otherwise
Create an interaction variable (bmi30 x smoke)
Create age2=age x age
3. Independent Variables all original and user created variables other
than charges
4. Explain the H0 and H1 involved in the F-tests and t-tests and the
corresponding results (p-values)
5. Bonus points if you can build some exciting graphs and charts using
any tool of your convenience. Be sure to share the insights from the
graphs.

Ans.)
data= read.csv("C:/Users/Sayon/Desktop/christ/Christ 5th
trimester/Econometrics/insurance.csv")
> attach(data)
>
> sex1=NULL; smoker1=NULL; bmi30=NULL; region1=NULL;
bmi30.smoker1=NULL; age2=NULL
> for(i in 1:1338)
+{
+ sex1[i]=ifelse(sex[i]=="male",1,0)
+ smoker1[i]=ifelse(smoker[i]=="yes",1,0)
+ bmi30[i]=ifelse(bmi[i]>=30,1,0)
+
region1[i]=ifelse(region[i]=="northeast","1",ifelse(region[i]=="northwest","2",ifelse(r
egion[i]=="southeast","3","4")))
+ bmi30.smoker1[i]=bmi30[i]*smoker1[i]
+ age2[i]=age[i]^2
> sex1
> smoker1
> bmi30
> region1
> bmi30.smoker1
> age2
>
summary(aov(charges~null+children+sex1+smoker1+bmi30+region1+bmi30.smoker
1+age2))
ataset=data.frame(Charges=charges,
Null=null,Age=age,Age2=age2,Sex=sex,Sex1=sex1,Bmi=bmi,Bmi30=bmi30,Childre

n=children,Smoker=smoker,Smoker1=smoker1,Region=region,Region1=region1,Bmi
30.Smoker=bmi30.smoker1)
> Dataset
write.csv(Dataset," ("C:/Users/Sayon/Desktop/christ/Christ 5th
trimester/Econometrics/insurance.csv"Dataset.csv",row.names=F)
> summary(lm(charges~null+bmi))
> summary(lm(charges~null))
> summary(lm(charges~bmi))
l1=lm(charges~null); l2=lm(charges~bmi)
> e1=l1$residuals; e2=l2$residuals
>
> par(mfrow=c(2,2))
> plot(null,charges)
> plot(bmi,charges)
> plot(charges,e1)
> plot(charges,e2)

Hypothesis:
1. H0 : Number of children has no impact on charges
Ha : Number of children has significant impact on charges
2. H0 : Sex has no impact on charges
Ha : Sex has significant impact on charges
3. H0 : Smoker has no impact on charges
Ha : Smoker has significant impact on charges
4. H0 : BMI has no impact on charges
Ha : BMI has significant impact on charges
5. H0 : Region has no impact on charges
Ha : Region has significant impact on charges
6. H0 : Interacting variable bmi X smoker has no impact on charges
Ha : Interacting variable bmi X smoker has significant impact on charges
7. H0 : Age has no impact on charges
Ha : Age has significant impact on charges

Anova table:

Df Sum Sq Mean Sq F value Pr(>F)


null
1 1.955e+11 1.955e+11 23598.599 <2e-16 ***
children
1 2.990e+06 2.990e+06 0.361 0.5481
sex1
1 3.367e+06 3.367e+06 0.406 0.5239
smoker1
1 1.126e+06 1.126e+06 0.136 0.7124
bmi30
1 2.558e+05 2.558e+05 0.031 0.8605
region1
3 1.987e+07 6.622e+06 0.799 0.4942
bmi30.smoker1 1 1.115e+07 1.115e+07 1.346 0.2461
age2
1 3.203e+07 3.203e+07 3.867 0.0495 *
Residuals 1327 1.099e+10 8.285e+06
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Hypothesis testing:
1) 1. The independent variable null has a significant p-value which is less than 0.001
i.e 2e-16 ***. Hence well reject H0 and accept Ha. So, null has significant impact
on the dependent variable charges.

2) The independent variable children has an insignificant p-value of greater than 0.05
i.e. 0.5481 . Hence well accept H0 and reject Ha. So, children has no significant
impact on the dependent variable charges.

3) The independent variable sex has an insignificant p-value of greater than 0.05. i.e.
0.5239

Hence well accept H0 and reject Ha. So, sex has no significant impact on
the dependent variable charges.

4) The independent variable smoker has an insignificant p-value of greater than


0.05.i.e 0.7124 Hence well accept H0 and reject Ha. So, smoker has no significant
impact on the dependent variable charges.

5) The independent variable bmi has an insignificant p-value of greater than 0.05 i.e.
0.8605

. Hence well accept H0 and reject Ha. So, bmi has no significant impact
on the dependent variable charges.

6) The independent variable region has an insignificant p-value of greater than


0.05.i.e. 0.4942 Hence well accept H0 and reject Ha. So, region has no
significant impact on the dependent variable charges.

7) The Interacting variable bmi X smoker has a significant p-value of greater than
0.05.i.e. 0.2461
Hence well accept H0 and reject Ha. So, The Interacting variable
bmi X smoker has no significant impact on the dependent variable charges.

8) The independent variable age has an insignificant p-value of greater than 0.05.i.e
0.0495 * Hence well reject H0 and accept Ha. So, age has significant impact on
the dependent variable charges.

Plots:

50000
0

20000

charges

50000
20000
0

charges

20000

40000

60000

20

30

50

40000

60000

bmi

-20000

-4000

20000

e2

e1

4000

null

40

20000

40000

60000

charges

20000

charges

With plot no.1 we can infer that the null and bmi are the independent variables while
the charges dependent variable could be predicted with the help of the null. Thus with
the chart we can say that the they both are correlated positively.
With plot no.2 we can infer that there is no correlation between the charges and the
bmi.

BIBLIOGRAPHY
http://www.statsmakemecry.com/smmctheblog/confusing-stats-terms-explainedheteroscedasticity-heteroske.html
http://www3.wabash.edu/econometrics/EconometricsBook/chap19.htm
http://www.investopedia.com/terms/h/heteroskedasticity.asp
https://www.statisticssolutions.com/homoscedasticity/

Potrebbero piacerti anche