Sei sulla pagina 1di 67

My Econometrics

WHY INTERCEPT?
My view point about
Solution to Obstacles in Data Analysis
BY
ADNAN ALI CH

THE ROLE OF THE COMPUTER

EVIEWS, SAS, SPSS, STATA, Microfit, PcGive and

BMD econometric techniques and tests discussed in this book.


DEDICATION

I dedicate this document to my teachers as teachers are the only people who groom and

educate us to be better than ever before.


CHAPTER NO.1

INTRODUCTION

1.1 Introduction
Y = β1+β2X+ɛ

(All its bread and butter)

Once it was said that despite every effort everyone has to die but making a difference in

something is larger than life, In research all is well that ends well. The basic problem for the

research students are that they are not familiar with econometrics laws equations and basic

calculations behind the scene and software, core reason is that they are not statisticians. There

is a story behind every story and we always ignore something which might impact in our

research or which might create problem,

which might lead you towards significance of

independent on dependent variables or

result in low R- squared. Sometimes high R- squared and standard errors are making you feel

bad. We always need a short cut to manage our results in favour and to make sure that our

research approves. People like us when we hit a wall we see three options go around it, climb

over it or tunnel under if that doesn’t work we blow the thing up but we never think what the

problem is, There are three basic types to manage data in favour direct change of figures (not

allowed), Secondly filter the data manually through lining (outliers) in excel (manual and

lengthy process) excluding all jumping values (non- stationary) but what if the data is not

longitudinal data, what if it is a time series daily or monthly based data as months are 12 in a

year and we cannot exclude November or add another April. What if there is

multicollinearity? What if there is autocorrelation in variables? What is the reason of low or

very high R-squared? Which one of the explanatory variable (regressors) is creating problem?
What if after excluding or including observations your position is still stands where it was

before? The question arises how I would know. What you will do?

Nothing, I will exclude one variable and include another one.

What if your main variables or majority of your variables are insignificant?

Then I will call my supervisor he will rescue me.

What if the supervisor says he can’t do anything as this is your research you had to face the

music just turn around go back home and correct your data?

In panel data there is a margin where we can include or exclude number of observations but

else we cannot. Secondly, when a researcher deletes few observations in panel data some

variables turn out to be significant (p-values) which were not previously and the predictors

which were significant give the impression as insignificant. This is the case where we

eliminate homogeneity between dependent and independent by excluding few or adding new

observations. In simple words we weakened previous relationships among variables for the

new to be significant.

(We kill one to save one or we kill one to generate one)

T
his document is written in very simple way no much technical language is used to make sure

everyone can understand. Secondly, this is not about what is what and who is who; it is also

not about what econometrics is and what research is, you all might know more than me. The

basic purpose is to explain few problems in data, as there is war running between various

explanatory and dependent variables behind our one click in software, it is not easy to

understand all the calculations of various tests conducted in research analysis but with few

options available in the software such as Eviews, sas, stata and spss we can identifying and

rectify the errors in right way. If you want to make sure you need to get through, you should
be (blue) Best liner unbiased equation. The most difficult part is to identify the problems and

make sure that they are taken care off in positive manner. Regression analysis is concerned

with the study of the dependence of one variable, the dependent variable, on one or more

other variables, the explanatory variables, with a view to estimating and/or predicting the

(population) significance, mean average value, co efficient, standard error and R-squared

value.

1.2 Trend lines

1. Liner (least square)

2. parabola

3. cubic

4. Quadratic

5. Exponential

These mathematical trend lines will help you in understanding about residual actual fitted

graph and the case of heteroskedasticity in Eviews etc.


Liner equation

Subject Age x Glucose Level y Xy x2 y2


1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022

From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the
sample size (6, in our case).

Step 2: Use the following equations to find a and b.

a = 65.1416
b = .385225

Find a:

 ((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 2472)


 484979 / 7445
 =65.14

Find b:

 (6(20,485) – (247 × 486)) / (6 (11409) – 2472)


 (122,910 – 120,042) / 68,454 – 2472
 2,868 / 7,445
 = .385225

Step 3: Insert the values into the equation.


y’ = a + bx
y’ = 65.14 + .385225x
Ordinary least square is a technique for estimating unknown parameters in a linear

regression model. It attempts to estimate the vector β, based on the observation y which is

formed after β passes through a mixing matrix X and has noise ε added. y=Xβ+ε OLS gives

the maximum likelihood estimate for β when the parameters have equal variance and are

uncorrelated, and the noise ε is white, uncorrelated and follows a Gaussian distribution

(homoscedasticity).

Generalized least squares allows this approach to be generalized to give the maximum

likelihood estimate when the noise is colored (heteroscedasticity). If the noise has correlation

matrix E[εε′]=Σ then the generalized least squares estimate is β^=(X′Σ−1X)−1X′Σ−1y

THE RESEARCH PROCESS

Several authors have attempted to enumerate the steps involved in the research process,

however, inconclusive. Nevertheless, the research process broadly consists of the following

steps and predominantly follows a sequential order as depicted in figure

1. Problem formulation

2. Development of an approach to the problem

3. Research Design

4. Selection of Data collection techniques

5. Sampling techniques

6. Fieldwork or Data Collection

7. Analysis and interpretation


8. Report preparation and presentation

The above mentioned steps may be placed in three groups as follows:

First there is initiating or planning of a study, which comprises the initial four steps in our

model: determining

(1) Problem formulation,

(2) Development of an approach to the problem

(3) Research design

(4) Selection of data collection techniques

(5) Sampling techniques.

Second, there is

(6) Fieldwork or data collection

Third, there is (7) analysis and interpretation of the data and

(8) Report preparation and presentation.

1.5 PROBLEM IDENTIFICATION

The starting point of any research is to formulate the problem and mention the objectives

before specifying any variables or measures. This involved defining the problem in clear

terms. Problem definition involves stating the general problem and identifying the specific

components of the research problem. Components of the research problem include (1) the

decision maker and the objectives (2) the environment of the problem (3) alternative courses

of action (4) a set of consequences that relate to courses of action and the occurrence of

events not under the control of the decision maker and (5) a state of doubt as to which course
of action is best. Here, the first two components of the research problem are discussed

whereas others are not well within the scope, though, not beyond.

Problem formulation is perceived as most important of all the other steps, because of the fact

that a clearly and accurately identified problem would lead to effective conduct of the other

steps involved in the research process. Moreover, this is the most challenging task as the

result yields information that directly addresses the management issue, though, the end result

is for the management to understand the information fully and take action based on it. From

this we understand, that the correctness of the result depends on how well the research takes

on, at the starting point.

Problem formulation refers to translating the management problem into a research problem. It

involves stating the general problem and identifying the specific components of research

problem. This step and the findings that emerge would help define the management decision

problem and research problem.

Research problem cannot exist in isolation as it is an outcome of management decision

problem. The management decision problem may be, for example, to know whether keeping

Saturday a working day would increase productivity. The associated research problem for the

above example may be the impact of keeping Saturday a working day on employee morale.

The task of the researcher is to investigate on employee morale. Hence, it is understood that

the researcher is perhaps, a scientific means, to solve the management problem the decision

maker faces.

1.6 ROLE OF INFORMATION IN PROBLEM FORMULATION

Problem formulation starts with a sound information seeking process by the researcher. The

decision maker is the provider of information pertaining to the problem at the beginning of
the research process (problem formulation) as well as the user of the information that

germinates at the end of the research process. Given the importance of accurate problem

formulation, the research should take enough care to ensure that information seeking process

should be well within the ethical boundaries of a true research. The researcher may use

different types of information at the problem formulation stage. They are:

1. Subjective information termed as those based on the decision maker‟s past experiences,

expertise, assumptions, feelings or judgments without any systematic gathering of facts. Such

information is usually readily available.

2. Secondary information are those collected and interpreted at least once for some specific

situation other than the current one. Availability of this type of information is normally high.

3. Primary information refers to first hand information derived through a formalised research

process for a specific, current problem situation.

In order to have better understanding on problem formulation, the researcher may tend to

categorise the information collected into four types. The categorisation of the information is

done based on the quality and complexity of the information collected. They are:

1. Facts are some piece of information with very high quality information and a higher degree

of accuracy and reliability. They could be absolutely observable and verifiable. They are not

complicated and are easy to understand and use.

2. Estimates are information whose degree of quality is based on the representativeness of the

fact sources and the statistical procedures used to create them. They are more complex than

facts due to the statistical procedures involved in deriving them and the likelihood of errors.
3. Predictions are lower quality information due to perceived risk and uncertainty of future

conditions. They have greater complexity and are difficult to understand and use for decision-

making as they are forecasted estimates or projections into the future.

4. Relationships are information whose quality is dependent on the precision of the

researcher‟s statements of the interrelationship between sets of variables. They have the

highest degree of complexity as they involve any number of relationships paths with several

variables being analysed simultaneously.

1.7 APPROACHES TO THE PROBLEM

The outputs of the approach development process should include the following components:

(i) Objective/theoretical framework

(ii) Analytical model

(iii) Research questions

(iv) Hypothesis.

Each of these components is discussed below:

(i) Objective/theoretical framework: Every research should have a theoretical framework and

objective evidence. The theoretical framework is a conceptual scheme containing:

a set of concepts and definitions

a set of statements that describes the situations on which the theory can be applied

a set of relational statements divided into: axioms and theorems


The theoretical evidence is very much imperative in research as it leads to identification of

variables that should be investigated. They also lead to formulating the operational definition

of the marketing problem. An operational definition is a set of procedures that describe the

activities one should perform in order to establish empirically the existence or degree of

existence of a concept.

Operationalizing the concept gives more understanding on the meanings of the concepts

specified and explication of the testing procedures that provide criteria for the empirical

application of the concepts. Operational definition would specify a procedure that involves

say, for example, a weighing machine that measures the weight of a person or an object.

3.1 (ii) Analytical model: An analytical model could be referred to as a likeness of

something. It consists of symbols referred to a set of variables and their

interrelationships represented in logical arrangements designed to represent, in

whole or in part, some real system or process. It is a representation of reality

making explicit the significant relationships among the aspects. It enables the

formulation of empirically testable propositions regarding the nature of these

relationships. An empirical model refers to research that uses data derived

from actual observation or experimentation.


3.2 Conceptual framework / Theoretical framework

Write one paragraph the variables in their relationship

Then show one diagram about independent and dependent variables and their relationship

like

INDEPENDENT VARIABLES DEPENDENT VARIABLES

OWNERSHIP STRUCTURE FIRM’S FINANCIAL


AND CORPORATE PERFORMANCE
GOVERNANCE
Return on Assets (ROA)
Ownership Concentration
(OS) Net Profitability (NP)

Managerial Ownership Return on Equity (ROE)


(MO)

Associated Companies
Ownership (ACO)

Board Size
(SZ)

Board Composition
(BS)

Board Independence (BI)


3.3 Identification of variables in their expected behavior

3.4.1 Dependent variables

3.4.2 Independent variables

3.4 Hypothesis of the study

3.5 Model used for estimation

For example

General Model

The following model is formulated and used to investigate the impact of board structure and

corporate governance on firm’s financial performance.

Pit = βO + ∑ βi Xit + ε
i

Pit = It is the financial performance of the firms i at time t

βO = It is the intercept and constant of the linear equation

βi = It is the coefficient of change in Xit variables

Xit = It shows the independent variables used for board structure and corporate governance i

at time t.

i = It shows the number of firms used in the study i.e. here 200 firms.

t = It shows the time period used in the study i.e. here 6 years.
Specific Model

𝑹𝑶𝑨 = βO + β1 (OSit ) + β2 (MOit ) + β3 (ACOit ) + β4 (SZit ) + β5 (BSit ) + ε

(iii)Research Questions: Research questions are refined statements of the specific

components of the problem. It refers to a statement that ascertains the phenomenon to be

studied. The research questions should be raised in an unambiguous manner and hence,

would help the researcher in becoming resourceful in identifying the components of the

problem. The formulation of the questions should be strongly guided by the problem

definition, theoretical framework and the analytical model. The knowledge gained by the

researcher from his/her interaction with the decision maker should be borne in mind as they

sometimes form the basis of research questions.

The researcher should exercise extreme caution while formulation research questions as they

are the forerunner for developing hypothesis. Any flaw in the research questions may lead to

flawed hypothesis. The following questions may be asked while developing research

questions:

a) Do I know the area of investigation and its literature?

b) What are the research questions pertinent to the area of investigation?

c) What are the areas that are not explored by the previous researchers?

d) Would my study lead to greater understanding on the area of study?

e) Are enough number of literatures available in this topic area?

f) Is my study a new one thus contributing to the society or has it been done before?
(iv) Hypothesis: Hypothesis could be termed as tentative answers to a research problem. The

structure of a hypothesis involves conjectural statements relating to two or more variables.

They are deduced from theories, directly from observation, intuitively, or from a combination

of these. Hypothesis deduced from any of the means would have four common

characteristics. They should be clear, value-free, specific and amenable to empirical testing.

Hypothesis could be viewed as statements that indicate the direction of the relationship or

recognition of differences in groups. However, the researcher may not be able to frame

hypotheses in all situations. It may be because that a particular investigation does not warrant

a hypothesis or sufficient information may not be available to develop the hypotheses.

Few elements which plays significant role in generating solution to

obstacles in data analysis

1.3 Data analysis

Data penetration (stationary and non-stationary) in simple to much fluctuation in values. The
problem of insignificance could only be detected and mange when you have some specific
knowledge about the type of work file on which you are working, for example structured
dated, unstructured, unbalanced, questioners responding scale sheet (spss) or balanced panel.

a. Unit root ratio

b. Normality jarque - bera test

c. Standard error and deviation

d. Autocorrelation

e. Variance inflation factor tolerance level and Multicollinearity

f. Homoscedasticity

g. Heteroscedasticity

h. High and low R- squared


i. Creating Significance of independent variables on dependent

j. Increasing low r-squared to high

k. Fitness of regression line (actual residual fitted graph)

l. Actual fitted table

m. Granger casualty

n. Cointigration test

These are the basic factors in research which generates positive expected finding or else it
lead towards you where you and your supervisor will be doubtful on data collection process
Sometimes data is quite accurate as secondary data is easy to compare with original source
this push you towards defeat or an act where you manipulate data manually but where there is
a will there is a way and the right way is too look at other assumption which might creating
problem auto, hetro and collinearity etc.

The Structure of Economic Data


1.4 Types of Data

Mainly three types of data you may be working on for empirical analysis: time series, cross-

section and pooled data combination of time series and cross-section.

A time series is a set of observations on the values that a variable takes at different times.

Such data may be collected at regular time intervals, such as daily (e.g., stock prices, weather

reports), weekly (e.g., money supply figures), monthly [e.g., the unemployment rate, the

Consumer Price Index (CPI), quarterly (GDP) annually.

Cross-Section An important feature of cross-sectional data is that we can often assume that

they have been obtained by random sampling from the underlying population. For example, if

we obtain information on wages, education, experience, and other characteristics by

randomly drawing 500 people from the working population, then we have a random sample
from the population of all working people. Random sampling is the sampling scheme covered

in introductory statistics courses, and it simplifies the analysis of cross-sectional data. Cross

section data are data on one or more variables collected at the same point in time, such as the

census of population conducted by the Census Bureau every ten years (the latest being in year

2010), the surveys of consumer expenditures conducted and of course, the opinion polls by

Gallup and umpteen other organizations. A concrete example of cross-sectional data on egg

production and egg prices for the 50 states in the union for 2001 and 2014.

Pooled Data In pooled, or combined, data are elements of both time series and cross-section

data. Panel, Longitudinal, or Micropanel Data This is a special type of pooled data in which

the same cross-sectional unit (say, a family or a firm) is surveyed over time. For example, the

Austria Department of Commerce carries out a census of housing at periodic intervals.

1 Data types

7.2.1 Nominal Data


These are data which classify or categorise some attribute they may be coded as numbers but
the numbers has no real meaning, it’s just a label they have no default or natural order.
Examples:, town of residence, colour of car, male or female (this lat one is an example of a
dichotomous variable, it can take two mutually exclusive values.
7.2.2 Ordinal Data
These are data that can be put in an order, but don’t have a numerical meaning beyond the
order. So for instance, the difference between 2 and 4 in the example of a Lickert scale below
might not be the same as the difference between 2 and 5. Examples: Questionnaire responses
coded: 1 = strongly disagree, 2 = disagree, 3 = indifferent, 4 = agree, 5 = strongly agree.
Level of pain felt in joint rated on a scale from 0 (comfortable) to 10 (extremely painful).
7.2.3 Interval Data
These are numerical data where the distances between numbers have meaning, but the zero
has no real meaning. With interval data it is not meaningful to say than one measurement is
twice another, and might not still be true if the units were changed. Example: Temperature
measured in Centigrade, a cup of coffee at 80°c isn't twice as hot on eating 40°c.
7.2.4 Ratio Data
These are numerical data where the distances between data and the zero point have real
meaning. With such data it is meaningful to say that one value is twice as much as another,
and this would still be true if the units were changed. Examples: Heights, Weights, Salaries,
Ages. If someone is twice as heavy as someone else in pounds, this will still be true in
kilogram.
NOTE: Data utilized is Ordinal (number of respondents 207)

7.2.5 Parametric or Nonparametric data

Before choosing a statistical test to apply to your data you should address the issue of
whether your data are parametric or not. This is quite a subtle and convoluted decision but the
guide line here should help start you thinking, remember the important rule is not to make
unsupported assumptions about the data, don't just assume the data are parametric; you can
use academic precedence to share the blame "Bloggs et. al. 2001 used a t-test so I will" or
you might test the data for normality, we'll try this later, or you might decide that given a
small sample it is sensible to opt for nonparametric methods to avoid making assumptions.
1. Ranks, scores, or categories are generally non-parametric data.
2. Measurements that come from a population that is normally distributed can usually be
treated as parametric. If in doubt treat your data as non-parametric especially if you
have relatively small sample. Generally speaking, parametric data are assumed to be
normally distributed – the normal distribution (approximated mathematically by the
Gaussian distribution) is a data distribution with more data values near the mean, and
gradually less faraway, symmetrically.
Data Analysis with Classical Assumptions

1.4.1 Normality

Humiang (2012) explained that normality test aims to test the regression model whether the

dependent variables with three independent variables have normal distribution or not. A

histogram or P – P plot of the residual can help researcher to check normality assumption of

the error term. The requirements is that the P – P plotted residuals should follow the 45-

degree line. Secondly, the points closely clustered about the 45-degree line while not

exceeding the standardized residual.

1.4.2 Heteroscedasticity

Black (2007) heteroscedasticity is a phenomenon happened when the error variances are not

constant. Webster (2013) explained the heteroscedasticity happens when the variation in error

terms around the regression line are not all the same for all values of X.

1.4.3 Multicollinearity

Black (2007) stated that multicollinearity is when two or more of the independent variables of

a multiple regression model are highly correlated. Hair et al (2007) acknowledged that

multicollinearity can be identified by two measurements which are the tolerance value and

the variance inflation factor (VIF is the inverse of the tolerance value)

1.4.5 Autocorrelation

Humming (2012) explained that autocorrelation test is used to test a linear regression model

whether there is no correlation between the variables tested or not. If there is correlation, then
there is a problem called autocorrelation which causes the constructed model to be not

appropriate.

TERMINOLOGIES

Dependent variable Explanatory variable

Explained variable Independent variable

Predictand Predictor

Regressand Regressor

Response Stimulus

Endogenous Exogenous

Controlled variable Control variable

CHAPTER NO.A

INTRODUCTION TO REGRESSION

A.1 Introduction
Getting the Most from Least Squares
Regression is the queen and king (attraction) of econometric tools. Regression’s job is to find
numerical values for theoretical parameters. In the simplest case this means telling us the
slope and intercept of a line drawn through two dimensional data. But Eviews tells us lots
more than just slope and intercept. In this chapter you’ll see how easy it is to get parameter
estimates plus a large variety of auxiliary statistics. Regression analysis is the statistical
dependence of one variable, the dependent (endogenous) variable, on one or more other
variables, the explanatory (exogenous) variables.

Figure (A1.1)
Go back over Galton’s law of regression. Galton was concerned in seeking out why there was
constancy in the distribution of heights in a population. But in the modern view our concern
is not with this explanation but rather with finding out how the average height of sons
changes, given the fathers’ height. In other words, our concern is with predicting the average
height of sons knowing the height of their fathers. To see how this can be done, consider
(Figure A1.1) which is a scatter diagram, or scatter gram. This figure shows the distribution
of heights of sons in a hypothetical population corresponding to the given or fixed values of
the father’s height. Notice that corresponding to any given height of a father is a range or
distribution of the heights of the sons. However, notice that despite the variability of the
height of sons for a given value of father’s height, the average height of sons generally
increases as the height of the father increases. To show this clearly, the circled crosses in the
figure indicate the average height of sons corresponding to a given height of the father.
Connecting these averages, we obtain the line shown in the figure. This line, as we shall see,
is known as the regression line. It shows how the average height of sons increases with the
father’s height.
Figure (A1.2)

In the above figure (A1.2) which emphasis that the mean value of ui, conditional upon the
given Xi, is zero. Geometrically, this assumption can be pictured as in Figure A1.2, which
shows a few values of the variable X and the Y populations associated with each of them. As
shown, each Y population corresponding to a given X is distributed around its mean value
(shown by the circled points on the PRF) with some Y values above the mean and some
below it. The distances above and below the mean values are nothing but the ui, and what
(A1.2) requires is that the average or mean value of these deviations corresponding to any
given X should be zero.9 This assumption should not be difficult to comprehend in view of
the discussion in Section. All that this assumption says is that the factors not explicitly
included in the model, and therefore subsumed in ui, do not systematically affect the mean
value of Y; so to speak, the positive ui values cancel out the negative ui values so that their
average or mean effect on Y is zero

THE COEFFICIENT OF DETERMINATION R2

A MEASURE OF “GOODNESS OF FIT”

Figure (A1.3)
In the above figure A1.3a the circle Y represents variation in the dependent variable Y and
the circle X represents variation in the explanatory variable X. The overlap of the two circles
(the shaded area) indicates the extent to which the variation in Y is explained by the variation
in X (say, via an OLS regression). The greater the extent of the overlap, the greater the
variation in Y is explained by X. Thus far we were concerned with the problem of estimating
regression coefficients, their standard errors, and some of their properties. We now consider
the goodness of fit of the fitted regression line to a set of data; that is, we shall find out how
“well” the sample regression line fits the data. From Figure A1.3 it is clear that if all the
observations were to lie on the regression line, we would obtain a “perfect” fit, but this is
rarely the case. Generally, there will be some positive ˆ ui and some negative ˆ ui. What we
hope for is that these residuals around the regression line are as small as possible.

Figure A1.2
Difference between R-squared and adjusted R-squared.

R-squared measures the proportion of the variation in your dependent variable (Y) explained
by your independent variables (X) for a linear regression model. Adjusted R-squared adjusts
the statistic based on the number of independent variables in the model.
The reason this is important is because you can "game" R-squared by adding more and more
independent variables, irrespective of how well they are correlated to your dependent
variable. Obviously, this isn't a desirable property of a goodness-of-fit statistic. Conversely,
adjusted R-squared provides an adjustment to the R-squared statistic such that an independent
variable that has a correlation to Y increases adjusted R-squared and any variable without a
strong correlation will make adjusted R-squared decrease. That is the desired property of a
goodness-of-fit statistic. “Adjusted R-squared” makes an adjustment to the plain-old to take
account of the number of right hand side variables in the regression. Measures what fraction
of the variation in the left hand side variable is explained by the regression. When you add
another right hand side variable to a regression, always rises. (This is a numerical property of
least squares.) The adjusted, sometimes written, subtracts a small penalty for each additional
variable added.

One major difference between R-squared and the adjusted R-squared is that R-squared
supposes that every independent variable in the model explains the variation in the dependent
variable. It gives the percentage of explained variation as if all independent variables in the
model affect the dependent variable, whereas the adjusted R-squared gives the percentage of
variation explained by only those independent variables that in reality affect the dependent
variable. R-squared cannot verify whether the coefficient ballpark figure and its predictions
are prejudiced. It also does not show if a regression model is satisfactory; it can show an R-
squared figure for a good model, or a high R-squared figure for a model that doesn’t fit.

An interesting algebraic fact is the following: If we add a new independent variable to a


regression equation, adjusted R-squared increases if, and only if, the t statistic on the new
variable is greater than one in absolute value. (An extension of this is that adjusted R-
squared increases when a group of variables is added to a regression if, and only if, the F
statistic for joint significance of the new variables is greater than unity) Thus, we see
immediately that using adjusted R-squared to decide whether a certain independent variable
(or set of variables) belongs in a model gives us a different answer than standard t or F testing
(because a t or F statistic of unity is not statistically significant at traditional significance
levels).
CHAPTER NO.B

OVERVIEW TO NORMALITY
B.1 Introduction
A normal distribution is an arrangement of a data set in which most values cluster in the
middle of the range and the rest taper off symmetrically toward either extreme. Height is one
simple example of something that follows a normal distribution pattern: Most people are of
average height, the numbers of people that are taller and shorter than average are fairly equal
and a very small (and still roughly equivalent) number of people are either extremely tall or
extremely short.
Figure B1.1
Figure B1.2

Residuals from the utility expenditure regression

The Jarque–Bera test shows that the JB statistic is about 0.2576, and the probability of
obtaining such a statistic under the normality assumption is about 88 per cent. Therefore, we
do not reject the hypothesis that the error terms are normally distributed. But keep in mind
that the sample size of 55 observations may not be large enough
B.2 Higher Moments of Probability Distributions
Although mean, variance, and covariance are the most frequently used summary measures of
Univariate and multivariate PDFs, we occasionally need to consider higher moments of the
PDFs, such as the third and the fourth moments. The third and fourth moments of a
Univariate PDF f(x) around its mean value (µ) are defined as
 Third moment: E(X−µ)3
 Fourth moment: E(X−µ)4
In general, the rth moment about the mean is defined as rth moment: E(X−µ)r
The third and fourth moments of a distribution are often used in studying the “shape” of a
probability distribution, in particular, its skewness, S (i.e., lack of symmetry) and kurtosis, K
(i.e., tallness or flatness), as shown in (Figure B2.1)

One measure of skewness is defined as:

S= E(X−µ)3/ σ3

𝑇𝐻𝐼𝑅𝐷 𝑀𝑂𝑀𝐸𝑁𝑇 𝐴𝐵𝑂𝑈𝑇 𝑇𝐻𝐸 𝑀𝐸𝐴𝑁


=
𝐶𝑈𝐵𝐸 𝑂𝐹 𝑇𝐻𝐸 𝑆𝑇𝐴𝑁𝐷𝐴𝑅𝐷 𝐷𝐸𝑉𝐼𝐴𝑇𝐼𝑂𝑁

A commonly used measure of kurtosis is given by:

𝐸(𝑋 − µ)⁴
𝐾=
[𝐸(𝑋 − µ)2 ]²

𝐹𝑂𝑈𝑅𝑇𝐻 𝑀𝑂𝑀𝐸𝑁𝑇𝑈𝑀 𝐴𝐵𝑂𝑈𝑇 𝑀𝐸𝐴𝑁


=
𝑆𝑄𝑈𝐴𝑅𝐸 𝑂𝐹 𝑇𝐻𝐸 𝑆𝐸𝐶𝑂𝑁𝐷 𝑀𝑂𝑀𝐸𝑁𝑇
Figure B1.3

PDFs with values of K less than 3 are called platykurtic (fat or short-tailed), and those with
values greater than 3 are called leptokurtic (slim or long tailed). See (Figure B2.2). A PDF
with a kurtosis value of 3 is known as mesokurtic, of which the normal distribution is the
prime example. (See the discussion of the normal distribution in Section) We will show
shortly how the measures of skewness and kurtosis can be combined to determine whether a
random variable follows a normal distribution. Recall that our hypothesis-testing procedure,
as in the t and F tests, is based on the assumption (at least in small or finite samples) that the
underlying distribution of the variable (or sample statistic) is normal. It is therefore very
important to find out in concrete applications whether this assumption is fulfilled.

NOTE: Normal distribution will be explained in the test analysis.


SUMMARYAND CONCLUSIONS

This chapter introduced one such test, the normality test, to find out whether the error term
follows the normal distribution. Since in small, or finite, samples, the t, F, and chi-square
tests require the normality assumption, it is important that this assumption be checked
formally
CHAPTER NO.C

HETEROSCEDASTICITY

ERROR VARIANCE IS NONCONSTANT

a) Nature of heteroscedasticity.
b) How it effects your estimation.
c) Various methods to detect.
d) Effective counteractive measures.

C.1 Error-learning models


As with the passage of time people learn from their mistakes is a natural phenomenon of life
similar in research it is assumed that with the increase in time there is lesser chance of doing
mistakes, their errors of behaviour become smaller over time. As the number of hours of
typing practice increases, the average number of typing errors as well as their variances
decreases.

a) Heteroscedasticity can also arise due to existence of outliers. An outlying observation,


or outlier, is an observation that is much different (either very small or very large) in
relation to the observations in the sample. Which is playing vary vital role in
disturbance these fluctuating figures or values are so much in variation that cause
your estimation biased.
Figure C1.1
OUTLIERS

INVTT
300

250

200

150
INVTT
100

50

0
316
106
141
176
211
246
281

351
386
421
456
491
526
561
596
631
666
701
1
36
71

As shown in the above figure C1.1 the observation values (either very small or very
large) are not constant the data sample show fluctuation. This sort of sample will
defiantly lead towards heteroscedasticity, considering this diagram the data will
appears to be not normal.

Figure C1.2
CH
25

20

15

CH
10

0
261
105
131
157
183
209
235

287
313
339
365
391
417
443
469
495
521
547
573
599
625
651
677
1
27
53
79

On the above figure C1.2 which show that the observation are normal no such
disturbance or fluctuation appears which could affect the analysis, it appears to be
normal very few outliers existence. If we include or exclude such observations
particularly where the sample size is small, can considerably change the results of
regression analysis in favourable or unfavourable fashion.
b) Another source of heteroscedasticity arises from violating Assumption 9 of CLRM,
namely, that the regression model is appropriately specified. Sometimes very often
heteroscedasticity may be due to the circumstance that some main variables are
omitted from the model.
c) Heteroscedasticity can also be present when we are estimating that variable in the
model which is not adjusting due to its natural composition and behaviour, For
example economic variables such as income, wealth and education is group, where
income and wealth can adjust or comfortable with each other but income and wealth
with education generates entirely a different scenario, heteroscedasticity is skewness
in the distribution of one or more independent incorporated in the model. Secondly in
research we chose a group of variables that are supposed to work together but with
their own identity and parameters one should not mix or instigate the other which
might Bias your estimation in any way. It is well known that the distribution of
income and wealth in most states and societies is uneven.
d) Causes of heteroscedasticity can also arise because of:
1. Incorrect data transformation (e.g., ratio or first difference transformations)
2. Incorrect functional form (e.g., linear versus log–linear models).

procedure of transforming the original variables in such a way that the transformed variables
satisfy the assumptions of the classical model and then applying OLS to them is known as the
method of generalized least squares (GLS). In short, GLS is OLS on the transformed
variables that satisfy the standard least-squares assumptions. The estimators thus obtained are
known as GLS estimators
Figure C1.3

C.2 DETECTION OF HETEROSCEDASTICITY AND VARIOUS MODELS THAT


CURES.
There is no such hard and fast rule to inquiry about the heteroscedasticity, regression results
indicates or the very low R2 show the problem area and sometimes low values of T-statistics.
 Graphical Method
 Various tests for heteroscedasticity

C2.1 Graphical Method And Transformation of Variables to Generate Linearity.


Exhibit any systematic pattern of variables y and x in case of heteroscedasticity various data
trends are appeared to examine the situation in which a research is revolving.

I've had a question in my mind to go over some stuff about transforming to linearity so taking
a data set that's nonlinear and transforming it using a log or reciprocal or a squared
transformation one of those things to make data more linear this is the basic idea behind.

Figure C.2.1
I've had a question in my mind to go over some
stuff about transforming to linearity so taking a
data set that's nonlinear and transforming it using
a log or reciprocal or a squared transformation
one of those things to make data more linear this
is the basic idea behind transforming the data so
we have a scatterplot that looks something like
this you can see that this is not really linear this
sort of follows this curvy kind of shape like this
see how that pattern there to those dots to those scatter points doesn't really follow a straight
line we've got that curve so if we want to work out an equation for this line in the form y
equals B1+B2X plus C. (see figure 2.2 &2.3)

Figure C.2.2

Figure 2.4
A straight line on it like this we would need to do something to the data before this could be
kind of a valid statistical thing that we could get some information out of because an upward
trend of green (a) line is not having linearity with variable X, there is a deviation appeared.

Figure C 2.4

In the above figure C 2.5 the starting area down here marked with lines is okay, this section
but then the dots kind of move away from the line over here from the top in upward trend
they're not sticking to this line sort of shape so something in the form y equals BX plus C isn't
going to do us much good because that's not the shape of this data we desired of.

Figure C 2.8
In the above figures (2.6,2.7 & 2.8) which explains that what we need to do is transform the
data so that it follows this straight line, so that it goes more along this street shape now these
bits down here is okay but this is where we veered off course these dots are all kind of going
up in that direction instead of sticking to this straight line down here so what we need to do is
pull these back towards the line now in this graph there's a couple of ways we could do that
we could either make them go this way towards this line by expanding out this x axis if we
shove everything over this way if we make everything go that way a little bit then all of these
will be pulled that way they'll all get pulled over this way to conform a bit more to this line so
it's something a bit like this I'm grabbing the data and I'm stretching it out that way which
makes it follow more of a line shape but there's something else I could do here I could try and

move them all that way by stretching out this x-axis or I could try and move them down that
way and smush them all down into this line so that would be smooshing up the y-axis so there
are two ways that I could fix this I could make this data more linear by stretching out the x-
axis making it all go that way a bit more or by smashing up the y-axis compressing that
down because then they'd all come down here into this line the problem is that they're above
and they're to the left so I could either make them go down or I could make them go to the
right we have a couple of sneaky little methods for doing this to data.
To stretch stuff out what I'm going to do is square it so if I want to stretch out that x-axis I'm
going to square the X's now the reason for that is I don't really need to affect these early ones
because they're kind of on that line already they follow that shape.

CHAPTER NO.D
Multicollinearity
MULTICOLLINEARITY: WHAT OCCURS IF THE
EXPLANATORY ARE CORRELATED

Conjecture three of the classical linear regression model (CLRM) is that there is no
multicollinearity among the regressors included in the regression model.it seems quite often
that the model is assumed to be blue that there is no biasness in regression of the research but
model is not pure of these factors until unless we filter with various analysis to make sure that
degree of biasness is minimized or removed for the better good of research. In this chapter we
will disuse about the concept and various types of Multicollinearity in the estimation that cause
problems. Secondly, this chapter focus on what sometimes we called imperfect multicollinearity
which indicates towards the circumstance in which exogenous variables are highly correlated
but not perfect.
√1 − 𝑅 2 𝑆
= √𝑉𝐼𝐹 × ∗ 𝑌⁄𝑆 𝑋
(𝑁 − 𝐾 − 1) 𝑘

The bigger R2XkGk is (i.e. the more highly correlated Xk is with the other IVs in the model),
the bigger the standard error will be. Indeed, if Xk is perfectly correlated with the other IVs,
the standard error will equal infinity. This is referred to as the problem of multicollinearity.
The problem is that, as the Xs become more highly correlated, it becomes more and more
difficult to determine which X is actually producing the effect on Y.
Also, recall that 1 - R2XkGk is referred to as the Tolerance of Xk. A tolerance close to 1 means
there is little multicollinearity, whereas a value close to 0 suggests that multicollinearity may
be a threat. The reciprocal of the tolerance is known as the Variance Inflation Factor (VIF).
The VIF shows us how much the variance of the coefficient estimate is being inflated by
multicollinearity. The square root of the VIF tells you how much larger the standard error is,
compared with what it would be if that variable were uncorrelated with the other X variables
in the equation. For example, if the VIF for a variable were 9, its standard error would be
three times as large as it would be if its VIF was 1. In such a case, the coefficient would have
to be 3 times as large to be statistically significant
D1) Pursuing answers to the following questions
a) Nature of multicollinearity?
b) Is multicollinearity surely a problematic area in analysis?
c) What are its practical magnitude and consequences?
d) How it might could one easily detect in explanatory variables?
e) What counteractive tools and techniques can be taken to lessen the problem of
multicollinearity?

Consider the following hypothetical data below


Table D1
X1 X2 X3 Difference between X2 and X3
(X2 - X3)
12 60 62 (60-63) 3 units
14 77 76 1 unit
19 93 98 5 units
23 120 127 7 units
34 150 152 2 units

There is flawless collinearity between X2 and X3 ever since the coefficient of correlation r23 is
unity. The variable X3 was created from X2 by just adding to it the following numbers, which
were taken from a table of random numbers: 3, 1, 5, 7, 2. At this moment there is no longer
perfect collinearity between X2 and X3. However, the two variables are highly correlated
because calculations will show that the coefficient of correlation between them is above 0.92.
Secondly the mean difference between two variables is of 3 as mean value of X 2 is 100 and
mean value of X3 is 103 there is no such deviation the errors are also be the same which
means in the regression it is quite impossible to find out which one of the two variables are
effecting the endogenous.

THE ASSUMPTIONS OF THE CONVENTIONAL APPROACH

The Ballantine view of multicollinearity


Explanation of diagram one

Model (D 1)

INCOME= α+ β1(EDUCATION it) + β2(EXPERIENCE it) + µit

Y X1 X2

In this (figure d1a) I'm going to look at a visual representation of what goes on when you're
running a regression and how the coefficients are calculated where r-squared comes from and
how bias can be created and so we're going to look at these Venn diagrams and they're not
ordinary Venn diagrams you have to really carefully think about what's going on with this
pictorial representation and so just to get started here this yellow circle here represents the
variation in Y the variation the ups and downs in the dependent variable that you're trying to
explain.

Figure D1a

Explanation of figure two


so let's (figure D1b) think about this dependent variable as being income so we want to
explain people's yearly salary using some explanatory variables and so this yellow is the
variation in people's incomes and so you can think about it as being Y the dependent variable
but more particularly think about it as the variation in Y so I'll put a big Y there, now this
pink circle here is supposed to represent an explanatory variable so it's one of our X's that's
going to explain people's incomes and so this pink is the variation of X and so let's think
about that as being education our first explanatory variable to explain why now when we run
a regression of Education explaining income suppose this is the relationship between these
two below
Figure D1b

so let me put X 1 here, X 1 that's education and let's look at what these various parts of the
diagram represent so again Y is the variation in income X 1 is the variation in people's
education and the data set this orange slice here represents the correlation between the two
variables is the correlation between education and income and so the bigger that correlation
between education and income is going to be then the bigger the overlap would be the more
orange we would have in this case and the smaller the overlap that would mean that there's
not much of a relationship. A correlation between education and incomes but let me just put it
right here and we can imagine that there's a good amount of correlation between education
and income now the more that the orange part the correlation between education and income
is the higher that correlation the higher the r-squared is going to get and so you can imagine
as two variables are more and more highly correlated their variations coincide with each other
the higher the R- squared is going to be in other words the more of this yellow that's covered
up the more is explained the more of that variation that's going to be explained now this other
moon shaped slice out here of X1 the variation in education that's the amount of the variation
in education that can't be X that is not related to someone's income similarly the yellow moon
shaped part of Y here that's the variation in income that is not related to education and so if
the orange represents R- squared the yellow is one minus R- squared the amount of
variation that can't be explained with education.

Explanation of figure three

however let's look at this other variable that we might have down here it might be
experienced in a job so this blue would be another explanatory variable that let's assume that
it should be included as an explanatory variable when explaining income the more
experienced the more income now let's see let's drop this variable here we'll call it X2
experience now let's see what parts we have in this diagram now when we run this regression
with two explanatory variables now look what happened what's going to happen to the R
squared when we add this second variable well the amount of variation the yellow that can't
be explained is now smaller and so our R-squared is going to go up now this part right here
let me draw a little triangle sort of thing on top of that so we can picture exactly what it is that
I'm trying to point your attention to right there let me drag it on you see that overlap where
the X1, X 2 and the Y are all overlapping at the same time let me make that green it's not
really a triangle there but you can kind of see what I'm talking about here's the interesting
thing that you need to understand when you're doing a regression when we have more than
one explanatory variable the green part is basically thrown away for some purposes so what
happens ? well let's remove everything again now remember this orange represents R-
squared when we bring in the blue what happens well that orange is still there and the R-
squared is only going to go up by the amount let me throw the triangle over here by this
amount this little intersection that is unique between experience and income so this region in
here is going to increase the R- squared but you see there's also this overlap of all three
areas this area was already sort of explained by the variation in education the X one variable
and so the X two is not really adding anything to that region it was already explained instead
the experience is only adding this region and here that I just set the green triangle on top of
now it's important also things are getting a little bit complicated.

Figure D 1C

UNUSED AREA

Explanation of figure four and five

So let me start labelling things all right so now we can see what I'm trying to say a little bit
better this orange would be used to calculate the estimate for the slope between X 1 and Y
education and income this little slice over here would be used to calculate the slope and the
standard errors for that slope between experience and income between X2 and Y but this
little area that's common to all three is not used to calculate the slope between either of these
variables in Y and it's not used to calculate the standard errors why not well it's impossible
to figure out since this variation is equally related to education and experience and income it's
impossible to figure out well which is it this little area here is it the fact that education X1 is
explaining income or is it experience explaining income since we can't tell we just kind of
have to throw it away we don't throw it away for the purposes of calculating or squared
however and we'll look at an extreme case in just a second but here's why this is important
think about what happens as education and experience become more and more and more
correlated themselves what is going to happen well we're only going to have a very small
slice up here that's uniquely the relationship between education and income we move that
over there we're going to have a very small slice that's uniquely related between experience
and income and look at this huge amount of information that even though it the r-squared it
cannot be used to try to figure out what the slope is between education or experience and
income and so we're going to have not a lot of information used to calculate these slopes
which means we're going to have high amounts of uncertainty and large standard errors and
you've got to remember this chain of ideas if the standard errors are large ceteris paribus the
T - statistics will be small and when the T- statistics are small the p- values will be big now
it just depends on what your purpose is of doing a regression like this if your purpose is
prediction only if you just want to predict incomes but you don't care what the important
thing is it income I mean is it education or is it experience then you don't have to worry
about this because you'll have a high r-squared but if your main job is to try to test whether
it is education or whether it is experience and actually get slopes for those things then that
would be a problem if you have a lot of overlap between those things now normally there's a
little bit of overlap this is the correlation that's not used area this is the correlation between
X1 and X2 between in education and experience and this is a problem as it gets larger that
we call multicollinearity that we'll talk about in another future in this chapter so one last
quick thing let me throw in a variable that I may be is not important just a random variable I
want you to just visualize what could happen what could happen is that random variable
could absorb all the variation in Y or it could be totally unrelated to anything or it could suck
out a lot of the explanatory variable power of the variables that should be in the regression.
Figure D 1d

Figure D 1e
The Ballentine view of multicollinearity

D2) Causes of multicollinearity

a) Inadequate use of dummy variables (e.g. failure to exclude one category)


b) Including a variable that is computed from other variables in the equation (e.g. family
income = husband’s income + wife’s income and the regression includes all 3 income
measures)
c) In time series data, may be that the regressors included in the model share a common
trend, that is, they all increase or decrease over time (for example) in the regression
of consumption expenditure (dependent dvs) on income (ivs), wealth, and
population, the regressors income, wealth, and population may all be growing over
time at more or less the same rate, leading to collinearity among these variables
d) In effect, including the same or almost the same variable twice (height in feet and
height in inches; or, more commonly, two different operationalization’s of the same
identical concept)
e) The above all imply some sort of error on the researcher’s part. But, it may just be
that variables really and truly are highly correlated.
f) The data collection method employed, for example, sampling over a limited range of
the values taken by the regressors in the population.
g) Model specification, for example, adding polynomial terms to a regression model,
especially when the range of the X variable is small.
h) An over determined model this occurs when the model has more exogenous variables
than the number of observations in the data analysis. This could happen in medical
research where there may be a lesser number of patients about whom information is
being collected on a very large number of variables.
i) Examine the bivariate correlations between the IVs, and look for “big” values, e.g.
.80 and above. However, the problem with this is 9 One IV may be a linear
combination of several IVs, and yet not be highly correlated with any one of them 9
Hard to decide on a cut-off point. The smaller the sample, the lower the cutoff point
should probably be.
j) Examining the tolerances or VIFs is probably superior to examining the bivariate
correlations. Indeed, you may want to actually regress each X on all of the other X’s,
to help you pinpoint where the problem is. A commonly given rule of thumb is that
VIFs of 10 or higher (or equivalently, tolerances of .10 or less) may be reason for
concern. This is, however, just a rule of thumb; Allison says he gets concerned when
the VIF is over 2.5 and the tolerance is under .40. In SPSS, you get the tolerances
and vifs by adding either the VIF or COLLIN parameter of the regression command;
in Stata you can use the vif command after running a regression, or you can use the
collin command.
k) Look at the correlations of the estimated coefficients (not the variables). High
correlations between pairs of coefficients indicate possible collinearity problems. In
SPSS, you can get this via the BCOV parameter; in Stata you get it by running the
vce, corr command after a regression.
l) Sometimes eigenvalues, condition indices and the condition number will be referred
to when examining multicollinearity. While all have their uses, I will focus on the
condition number. The condition number (κ) is the condition index with the largest
value; it equals the square root of the largest eigenvalue (λmax) divided by the
smallest eigenvalue (λmin), i.e.

Maximum eigenvalue
𝑘 =
Minimum eigenvalue
Condition index (CI)

Maximum eigenvalue
𝐶𝐼 = √ = √𝐾
Minimum eigenvalue

m) When there is no collinearity at all, the eigenvalues, condition indices and condition
number will all equal one. As collinearity increases, eigenvalues will be both
greater and smaller than 1 (eigenvalues close to zero indicate a multicollinearity
problem), and the condition indices and the condition number will increase. An
informal rule of thumb is that if the condition number is 15, multicollinearity is a
concern; if it is greater than 30 multicollinearity is a very serious concern

D3) Detecting high multicollinearity

Multicollinearity is a matter of degree there are various hints to detect it. There is no certain
test or make sure that some of particular values indicate such degree involved in your
estimation or is not a problem. But, there are various symptoms to this imperfect collinearity.

D4) Measuring the degree of multicollinearity


In the various literatures and by the authors around the world suggested the degree of
multicollinearity is quite often introduced and estimated with various methods. All these
statistical measures have some boundaries.
a) The use of correlation matrix (pair wise).
b) The variance inflation factor.
c) The tolerance measure
A. The use of correlation matrix (pair wise).

Correlation matrix
Pair wise

TABLE D 2

Y1 Y2 X1 X2 X3

Y1 1

Y2 0.671 1

X1 -0.579 -0.375 1

X2 -0.325 -0.222 0.979 1

X3 0.322 0.263 -0.292 0.968 1

As can be seen from the table above we can suspect the pair of variables (explanatory) are
highly correlated and this need an attention. From the above (table D2) the independent
variables X1 and X2 are highly correlated R2 0.979. Secondly, furthermore X3 and X2 both
of these variables in the table are imperfectly correlated which means there is a very strong
relation among them more than it is required for the analysis, which show that there is some
un wanted relation which inflated the R - squared that will ultimately leads you towards
biased estimation and need to be tackled.

B) The variance inflation factor


To analyse the element of multicollinearity in the model of in explanatory variables we focus
on the method, tool or measure to detect this factor which is called as variance inflation
factor. The speed with which variances and covariance increase can be seen with the
variance-inflating factor (VIF), which is defined as
𝟏
𝑽𝑰𝑭(𝒃𝟏) =
𝟏 − 𝐑𝟐
VIF shows how the variance of an estimator is inflated by the presence of multicollinearity

C) The tolerance measure


It may be noted that the inverse of the VIF is called tolerance (TOL). That is:
𝟏
𝑻𝑶𝑳 = = (𝟏 − 𝑹𝟐)
𝐕𝐈𝐅

When R2 j =1 (i.e., perfect collinearity), TOLj=0 and when R2 j =0 (i.e., no collinearity


whatsoever), TOL j is 1. Because of the intimate connection between VIF and TOL, one can
use them interchangeably. A tolerance close to 1 means there is little multicollinearity,
whereas a value close to 0 suggests that multicollinearity may be a threat. The reciprocal of
the tolerance is known as the Variance Inflation Factor (VIF). The VIF shows us how much
the variance of the coefficient estimate is being inflated by multicollinearity. The square root
of the VIF tells you how much larger the standard error is, compared with what it would be if
that variable were uncorrelated with the other X variables in the equation.

Table D 3
R2 VALUES
By employing Klein’s rule of thumb, we have seen R2 values achieved from the assumed
regressions surpass the overall R2 value (that is the one obtained from the regression of Y on
all the X variables) of 0.9954 in 3 out of 5.

DEPENDENT R2 VALUES TOL=1-R2


X1 0.992 0.007
X2 0.990 0.0006
X3 0.970 0.029
X4 0.721 0.288
X5 0.997 0.003
Table D 4
Regression results
PE SE VIF TOLRANCE
Constant 117 99 708 0.0014
X1 4.4 3.1 564 0.0017
X2 -3.8 2.5 104 0.0096
X3 -2.1 1.5
R2 .901
F-statistic 21.6
p-value 0.0001

a) In the above (table d 4) it is easily detected that there very large size element or high
multicollinearity exists between variables as variance inflation factor is more than 10
and tolerance level is close to zero which indicates that there is no such difference
between these explanatory variables in the model.
b) Very high R2 90% also indicates that these three variables are generating very large
variance on dependent.
c) F-statistic shows combine impact of three variables on endogenous which also very
much high value with only three exogenous variables used in the regression.
What is Serial Correlation / Autocorrelation?
Serial correlation (also called Autocorrelation) is where error terms in a time series
transfer from one period to another. In other words, the error for one time period a is
correlated with the error for a subsequent time period b. For example, an
underestimate for one quarter’s profits can result in an underestimate of profits for
subsequent quarters. This can result in a myriad of problems, including:
 Inefficient Ordinary Least Squares Estimates and any forecast based on those
estimates. An efficient estimatorgives you the most information about a sample;
inefficient estimators can perform well, but require much larger sample sizes to do
so.
 Exaggerated goodness of fit (for a time series with positive serial correlation and
an independent variablethat grows over time).
 Standard errors that are too small (for a time series with positive serial correlation and an
independent variable that grows over time).
 T-statistics that are too large.
 False positives for significant regression coefficients. In other words, a regression
coefficient appears to be statistically significant when it is not.
Types of Autocorrelation
The most common form of autocorrelation is first-order serial correlation, which can
either be positive or negative.
 Positive serial correlation is where a positive error in one period carries over into a
positive error for the following period.
 Negative serial correlation is where a negative error in one period carries over into
a negative error for the following period.
Second-order serial correlation is where an error affects data two time periods later.
This can happen when your data has seasonality. Orders higher than second-order do
happen, but they are rare.
Testing for Autocorrelation
You can test for autocorrelation with:
 A plot of residuals. Plot et against t and look for clusters of successive residuals on
one side of the zero line. You can also try adding a Lowess line, as in the image
below.
 A Durbin-Watson test.
 A Lagrange Multiplier Test.
 A correlogram. A pattern in the results is an indication for autocorrelation. Any
values above zero should be looked at with suspicion.

Define sampling

Sampling with Replacement and Sampling without Replacement

Sampling with replacement:

Consider a population of potato sacks, each of which has either 12, 13, 14, 15, 16, 17, or 18
potatoes, and all the values are equally likely. Suppose that, in this population, there is
exactly one sack with each number. So the whole population has seven sacks. If I sample two
with replacement, then I first pick one (say 14). I had a 1/7 probability of choosing that one.
Then I replace it. Then I pick another. Every one of them still has 1/7 probability of being
chosen. And there are exactly 49 different possibilities here (assuming we distinguish
between the first and second.) They are: (12,12), (12,13), (12, 14), (12,15), (12,16), (12,17),
(12,18), (13,12), (13,13), (13,14), etc.

Sampling without replacement:

Consider the same population of potato sacks, each of which has either 12, 13, 14, 15, 16, 17,
or 18 potatoes, and all the values are equally likely. Suppose that, in this population, there is
exactly one sack with each number. So the whole population has seven sacks. If I sample two
without replacement, then I first pick one (say 14). I had a 1/7 probability of choosing that
one. Then I pick another. At this point, there are only six possibilities: 12, 13, 15, 16, 17, and
18. So there are only 42 different possibilities here (again assuming that we distinguish
between the first and the second.) They are: (12,13), (12,14), (12,15), (12,16), (12,17),
(12,18), (13,12), (13,14), (13,15), etc.
What's the Difference?

When we sample with replacement, the two sample values are independent. Practically, this
means that what we get on the first one doesn't affect what we get on the second.
Mathematically, this means that the covariance between the two is zero.

In sampling without replacement, the two sample values aren't independent. Practically, this
means that what we got on the for the first one affects what we can get for the second one.
Mathematically, this means that the covariance between the two isn't zero. That complicates
the computations. In particular, if we have a SRS (simple random sample) without

replacement, from a population with variance , then the covariance of two of the different

sample values is , where N is the population size. (A brief summary of some formulas
is provided here. For a discussion of this in a textbook for a course at the level of M378K, see
the chapter on Survey Sampling in Mathematical Statistics and Data Analysis by John A.
Rice, published by Wadsworth & Brooks/Cole Publishers. There is an outline of an slick,
simple, interesting, but indirect, proof in the problems at the end of the chapter.)

Population size -- Leading to a discussion of "infinite" populations.

When we sample without replacement, and get a non-zero covariance, the covariance
depends on the population size. If the population is very large, this covariance is very close to
zero. In that case, sampling with replacement isn't much different from sampling without
replacement. In some discussions, people describe this difference as sampling from an
infinite population (sampling with replacement) versus sampling from a finite population
(without replacement).

Sampling
Sampling is a statistical procedure that is concerned with the selection of the individual
observation; it helps us to make statistical inferences about the population.

The Main Characteristics of Sampling


In sampling, we assume that samples are drawn from the population and sample means and
population means are equal. A population can be defined as a whole that includes all items
and characteristics of the research taken into study. However, gathering all this information is
time consuming and costly. We therefore make inferences about the population with the help
of samples.

Random sampling:

In data collection, every individual observation has equal probability to be selected into a
sample. In random sampling, there should be no pattern when drawing a sample.

Significance: Significance is the percent of chance that a relationship may be found in sample
data due to luck. Researchers often use the 0.05% significance level.

Probability and non-probability sampling:

Probability sampling is the sampling technique in which every individual unit of the
population has greater than zero probability of getting selected into a sample.

Non-probability sampling is the sampling technique in which some elements of the


population have no probability of getting selected into a sample.

Types of random sampling:

With the random sample, the types of random sampling are:

Simple random sampling: By using the random number generator technique, the researcher
draws a sample from the population called simple random sampling. Simple random
samplings are of two types. One is when samples are drawn with replacements, and the
second is when samples are drawn without replacements.

Equal probability systematic sampling: In this type of sampling method, a researcher starts
from a random point and selects every nth subject in the sampling frame. In this method,
there is a danger of order bias.

Stratified simple random sampling: In stratified simple random sampling, a proportion


from strata of the population is selected using simple random sampling. For example, a fixed
proportion is taken from every class from a school.
Multistage stratified random sampling: In multistage stratified random sampling, a
proportion of strata is selected from a homogeneous group using simple random sampling.
For example, from the nth class and nth stream, a sample is drawn called the multistage
stratified random sampling.

Cluster sampling: Cluster sampling occurs when a random sample is drawn from certain
aggregational geographical groups.

Multistage cluster sampling: Multistage cluster sampling occurs when a researcher draws a
random sample from the smaller unit of an aggregational group.

Types of non-random sampling: Non-random sampling is widely used in qualitative


research. Random sampling is too costly in qualitative research. The following are non-
random sampling methods:

Availability sampling: Availability sampling occurs when the researcher selects the sample
based on the availability of a sample. This method is also called haphazard sampling. E-mail
surveys are an example of availability sampling.

Quota sampling: This method is similar to the availability sampling method, but with the
constraint that the sample is drawn proportionally by strata.

Expert sampling: This method is also known as judgment sampling. In this method, a
researcher collects the samples by taking interviews from a panel of individuals known to be
experts in a field.
Analyzing non-response samples: The following methods are used to handle the non-
response sample:
Weighting: Weighting is a statistical technique that is used to handle the non-response data.
Weighting can be used as a proxy for data. In SPSS commands, “weight by” is used to assign
weight. In SAS, the “weight” parameter is used to assign the weight.

Dealing with missing data: In statistics analysis, non-response data is called missing data.
During the analysis, we have to delete the missing data, or we have to replace the missing
data with other values. In SPSS, missing value analysis is used to handle the non-response
data.

Sample size: To handle the non-response data, a researcher usually takes a large sample
THE HALL MARKS OF SCIENTIFIC RESEARCH

The hallmarks or main distinguishing characteristics of scientific research may be listed as


follows:

1. Purposiveness

2. Rigor

3. Testability

4. Replicability

5. Precision and Confidence

6. Objectivity

7. Generalizability

8. Parsimony

Each of these characteristics can be explained in the context of a concrete example. Let us
consider the case of a manager who is interested in investigating how employees‘
commitment to the organization can be increased. We shall examine how the eight hallmarks
of science apply to this investigation so that it may be considered ―scientific.‖

Purposiveness The manager has started the research with a definite aim or purpose. The
focus is on increasing the commitment of employees to the organization, as this will be
beneficial in many ways. An increase in employee commitment will translate into less
turnover, less absenteeism, and probably increased performance levels, all of which would
definitely benefit the organization. The research thus has a purposive focus.

Rigor A good theoretical base and a sound methodological design would add rigor to a
purposive study. Rigor connotes carefulness, scrupulousness, and the degree of exactitude
in research investigations. In the case of our example, let us say the manager of an
organization asks 10 to 12 of its employees to indicate what would increase their level of
commitment to it. If, solely on the basis of their responses, the manager reaches several
conclusions on how employee commit- ment can be increased, the whole approach to the
investigation would be unsci- entific. It would lack rigor for the following reasons: (1) the
conclusions would be incorrectly drawn because they are based on the responses of just
a few employees whose opinions may not be representative of those of the entire
workforce, (2) the manner of framing and addressing the questions could have introduced
bias or incorrectness in the responses, and (3) there might be many other important
influences on organizational commitment that this small sample of respondents did not or
could not verbalize during the interviews, and the researcher would have failed to
include them. Therefore, conclusions drawn from an investigation that lacks a good
theoretical foundation, as evidenced by reason (3), and methodological sophistication, as
evident from (1) and (2) above, would be unscientific. Rigorous research involves a good
theoretical base and a carefully thought-out methodology. These factors enable the
researcher to col- lect the right kind of information from an appropriate sample with the
minimum degree of bias, and facilitate suitable analysis of the data gathered. The follow-
ing chapters of this book address these theoretical and methodological issues. Rigor in
research design also makes possible the achievement of the other six hallmarks of science
that we shall now discuss.

Testability If, after talking to a random selection of employees of the organization and
study of the previous research done in the area of organizational commitment, the man- ager
or researcher develops certain hypotheses on how employee commitment can be enhanced,
then these can be tested by applying certain statistical tests to the data collected for the
purpose. For instance, the researcher might hypothesize that those employees who perceive
greater opportunities for participation in deci- sion making would have a higher level of
commitment. This is a hypothesis that can be tested when the data are collected. A
correlation analysis would indicate whether the hypothesis is substantiated or not. The use of
several other tests, such as the chi-square test and the t-test, is discussed in the Module titled
Refresher on Statistical Terms and Tests at the end of this book, and in Chapter 12.
Scientific research thus lends itself to testing logically developed hypotheses to see
whether or not the data support the educated conjectures or hypotheses that are developed
after a careful study of the problem situation. Testability thus becomes another hallmark
of scientific research.

Replicability Let us suppose that the manager/researcher, based on the results of the
study, concludes that participation in decision making is one of the most important fac- tors
that influences the commitment of employees to the organization. We will
place more faith and credence in these findings and conclusion if similar find- ings
emerge on the basis of data collected by other organizations employing the same methods.
To put it differently, the results of the tests of hypotheses should be supported again and yet
again when the same type of research is repeated in other similar circumstances. To the
extent that this does happen (i.e., the results are replicated or repeated), we will gain
confidence in the scientific nature of our research. In other words, our hypotheses would not
have been supported merely by chance, but are reflective of the true state of affairs in the
population. Replicability is thus another hallmark of scientific research.

Precision and Confidence In management research, we seldom have the luxury of being
able to draw ―definitive‖ conclusions on the basis of the results of data analysis.
This is because we are unable to study the universe of items, events, or population we are
interested in, and have to base our findings on a sample that we draw from the universe.
In all probability, the sample in question may not reflect the exact characteristics of the
phenomenon we try to study (these difficulties are dis- cussed in greater detail in a later
chapter). Measurement errors and other prob- lems are also bound to introduce an
element of bias or error in our findings. However, we would like to design the research in
a manner that ensures that our findings are as close to reality (i.e., the true state of
affairs in the universe) as possible, so that we can place reliance or confidence in the
results. Precision refers to the closeness of the findings to ―reality based on a sample. In
other words, precision reflects the degree of accuracy or exactitude of the results on the
basis of the sample, to what really exists in the universe. For exam- ple, if I estimated the
number of production days lost during the year due to absenteeism at between 30 and 40,
as against the actual of 35, the precision of my estimation compares more favorably than if
I had indicated that the loss of production days was somewhere between 20 and 50. You
may recall the term confidence interval in statistics, which is what is referred to here as
precision. Confidence refers to the probability that our estimations are correct. That is, it is
not merely enough to be precise, but it is also important that we can confi- dently claim
that 95% of the time our results would be true and there is only a 5% chance of our being
wrong. This is also known as confidence level. The narrower the limits within which we
can estimate the range of our pre- dictions (i.e., the more precise our findings) and the
greater the confidence we have in our research results, the more useful and scientific the
findings become. In social science research, a 95% confidence level—which implies that
there is only a 5% probability that the findings may not be correct—is accepted as con-
ventional, and is usually referred to as a significance level of .05 (p = .05). Thus, precision
and confidence are important aspects of research, which are attained through appropriate
scientific sampling design. The greater the precision and confidence we aim at in our
research, the more scientific is the investigation and the more useful are the results. Both
precision and confidence are discussed in detail in Chapter 11 on Sampling.

Objectivity The conclusions drawn through the interpretation of the results of data
analysis should be objective; that is, they should be based on the facts of the findings
derived from actual data, and not on our own subjective or emotional values. For instance, if
we had a hypothesis that stated that greater participation in decision making will increase
organizational commitment, and this was not supported by the results, it makes no sense if
the researcher continues to argue that increased opportunities for employee participation
would still help! Such an argument would be based, not on the factual, data-based
research findings, but on the sub- jective opinion of the researcher. If this was the
researcher‘s conviction all along, then there was no need to do the research in the first
place! Much damage can be sustained by organizations that implement non-data- based or
misleading conclusions drawn from research. For example, if the hypothesis relating to
organizational commitment in our previous example was not supported, considerable time
and effort would be wasted in finding ways to create opportunities for employee
participation in decision making. We would only find later that employees still keep
quitting, remain absent, and do not develop any sense of commitment to the
organization. Likewise, if research shows that increased pay is not going to increase the
job satisfaction of employ- ees, then implementing a revised increased pay system will
only drag down the company financially without attaining the desired objective. Such a
futile exer- cise, then, is based on nonscientific interpretation and implementation of the
research results. The more objective the interpretation of the data, the more scientific
the research investigation becomes. Though managers or researchers might start with some
initial subjective values and beliefs, their interpretation of the data should be stripped of
personal values and bias. If managers attempt to do their own research, they should be
particularly sensitive to this aspect. Objectivity is thus another hallmark of scientific
investigation.

Generalizability Generalizability refers to the scope of applicability of the research


findings in one organizational setting to other settings. Obviously, the wider the range of
applic- ability of the solutions generated by research, the more useful the research is to the
users. For instance, if a researcher‘s findings that participation in decision making
enhances organizational commitment are found to be true in a variety of manufacturing,
industrial, and service organizations, and not merely in the par- ticular organization studied
by the researcher, then the generalizability of the findings to other organizational settings
is enhanced. The more generalizable the research, the greater its usefulness and value.
However, not many research find- ings can be generalized to all other settings, situations,
or organizations.

Parsimony Simplicity in explaining the phenomena or problems that occur, and in


generating solutions for the problems, is always preferred to complex research frame- works
that consider an unmanageable number of factors. For instance, if two or three specific
variables in the work situation are identified, which when changed would raise the
organizational commitment of the employees by 45%, that would be more useful and
valuable to the manager than if it were recommended that he should change 10 different
variables to increase organizational commitment by 48%. Such an unmanageable number
of variables might well be totally beyond the manager‘s control to change. Therefore,
the achievement of a mean- ingful and parsimonious, rather than an elaborate and
cumbersome, model for problem solution becomes a critical issue in research. Economy in
research models is achieved when we can build into our research framework a lesser
number of variables that would explain the variance far more efficiently than a complex set
of variables that would only marginally add to the variance explained. Parsimony can be
introduced with a good understanding of the problem and the important factors that
influence it. Such a good conceptual theoretical model can be realized through unstructured
and structured interviews with the concerned people, and a thorough literature review of
the previous research work in the particular problem area.
Research Questionnaire

Section 1: Multifactor Leadership Questionnaire –


Using of the following scale, please rate your immediate supervisor/team leader by
circling your choice on the following statements.

Not at all once in a while sometimes fairly often always


0 1 2 3 4

My team leader/supervisor...............

Transactional leadership
(contingent reward) (CR)
1 Provides me with assistance in exchange for my 0 1 2 3 4
efforts
2 Discusses in specific terms who is responsible 0 1 2 3 4
for achieving performance targets

3 Makes clear what one can expect to receive 0 1 2 3 4


when performance goals are achieved
4 Expresses satisfaction when I meet expectations 0 1 2 3 4

Management by exception-active (MBEA)

5 Focuses attention on irregularities, mistakes, and 0 1 2 3 4


deviations from standards
6 Concentrates his/her full attention on dealing 0 1 2 3 4
with mistakes, complaints and failures

7 Keeps track of all mistakes 0 1 2 3 4

8 Directs my attention toward failures to meet 0 1 2 3 4


standards
Management by exception-passive (MBEP)

9 Fails to interfere until problems become serious 0 1 2 3 4

10 Waits for things to go wrong before taking 0 1 2 3 4


actions

11 Shows that he/she is a firm believer in “if it ain’t 0 1 2 3 4


broke, don’t fix it”

12 Demonstrates that problems must become 0 1 2 3 4


chronic before taking action
Lasses-faire (LF)

13 Avoids getting involved when important issues 0 1 2 3 4


arise
14 Is absent when needed 0 1 2 3 4

15 Avoids making decisions 0 1 2 3 4

16 Delays responding to urgent questions 0 1 2 3 4

Transformational leadership
(idealised influence attributed)
17 Instils pride in me for being associated with 0 1 2 3 4
him/her
18 Goes beyond self-interest for the good of the 0 1 2 3 4
group
19 Displays a sense of power and confidence 0 1 2 3 4

Idealised influence behaviour (IIB)

20 Talks about his/her most important values and 0 1 2 3 4


beliefs
21 Specifies the importance of having a strong 0 1 2 3 4
sense of purpose
22 Emphasizes the importance of having a 0 1 2 3 4
collective sense of mission
23 Considers the moral and ethical consequences of 0 1 2 3 4
decisions
Inspirational motivation(IM)

24 Talks optimistically about the future 0 1 2 3 4


25 Talks enthusiastically about what needs to be 0 1 2 3 4
accomplished
26 Articulates a compelling vision of the future 0 1 2 3 4
27 Expresses confidence that goals will be achieved 0 1 2 3 4

Intellectual stimulation (IS)


28 Re-examines critical assumptions to question 0 1 2 3 4
whether they are appropriate
29 Suggests new ways of looking at how to 0 1 2 3 4
complete assignments
30 Gets me to look at problems from many different 0 1 2 3 4
angles
Individual consideration (IC)
31 Helps me to develop my strengths 0 1 2 3 4
32 Spends time coaching 0 1 2 3 4
33 Seeks differing perspectives when solving 0 1 2 3 4
problems
34 Considers me as having different needs, abilities, 0 1 2 3 4
and aspirations from others
35 Acts in the way that builds my respect 0 1 2 3 4
36 Treats me as an individual rather than just a 0 1 2 3 4
member of a group

Section 4: background information


1. My name (optional)...................................................
2. Name of organizational...........................................
3. The major business function of my organization is:
Finance Health Engineering
Education Services Informational technology
Others...........................
4. The number of people in my organization is
20 and less 21 - 50 51-100
101-200 201-500 Above 500
5. Number of years worked in this organization is:
1-5 6-10
11-20 Over 20
6. What best describe my position:
Senior management Middle management Line management
7. In my organization, I mainly work as

Team leader Team

Potrebbero piacerti anche