Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1. The sample data from a research survey conducted in various cities on the amount of time 13-
15 year-old children spent with mobiles are as follows:
(i)
For a set of data, we determine a quantity used to summarise the whole set of data. This
quantity is termed a measure of central tendency. The most commonly used measures are mean,
medium and mode.
Mean:
For ungrouped data, the formula for calculation of mean is given below:
x1 x2 x3 ... xn
Mean( x) x , where x is a study variable. (1.1)
n
Using formula (1.1), the average time with mobiles (hours per week) is given as:
x1 x2 x3 ... xn 46 50 46 54 42 30 42 50 46
Mean( x) x 45.11 (hours)
n 9
(ii)
In probability and statistics, the standard deviation of a probability distribution, random variable,
or population or multiset of values is a measure of the spread of its values. It is usually denoted
with the letter σ (lower case sigma). It is defined as the square root of the variance. To
understand standard deviation, keep in mind that variance is the average of the squared
differences between data points and the mean. Variance is tabulated in units squared. Standard
deviation, being the square root of that quantity, therefore measures the spread of data about the
mean, measured in the same units as the data. Said more formally, the standard deviation is the
root mean square (RMS) deviation of values from their arithmetic mean. For example, in the
population {4, 8}, the mean is 6 and the deviations from mean are {-2, 2}. Those deviations
squared are {4, 4} the average of which (the variance) is 4. Therefore, the standard deviation is
2. In this case 100% of the values in the population are at one standard deviation of the mean.
The standard deviation is the most common measure of statistical dispersion, measuring how
widely spread the values in a data set are. If the data points are close to the mean, then the
standard deviation is small. As well, if many data points are far from the mean, then the standard
deviation is large. If all the data values are equal, then the standard deviation is zero.
n
( xi x ) 2
i 1
Formula for Standard deviation for ungrouped data= (1.2)
n
n
( xi x ) 2 376.888
Standard deviation for data given in Q.1 is calculated as i 1
n 9 = 6.47
(iii)
Mode: The mode is the most common (frequent) value. A list can have more than one mode.
In case of data set given in Q.1, most frequent value is 46 (hours).
Therefore mode=46 hours
(iv)
The meaning of percentile can be captured by stating that the pth percentile of a
distribution is a number such that approximately p percent (p%) of the values in the distribution
are equal to or less than that number. So, if ‘28’ is the 80th percentile of a larger batch of
numbers, 80% of those numbers are less than or equal to 28. A percentile can be calculated
directly for values that actually exist in the distribution. To calculate percentiles, sort the data so
that x1 is the smallest value, and xn is the largest, with n = total number of observations. There
are various formulas suggested by statisticians for calculations of percentiles, below formula for
raw data is the simplest one.
Let X be the Cones Sold and Y be the Temperature, Cones sold (X) is the dependent variable,
because sale of Ice cream cones depends on temperature.
(ii)
If X is the dependent variable and Y be the independent variable, then the least squared
estimated line is given by
Y=a+bX (2.1)
The normal equations to get the estimate of coefficients a and b is given by
y na b x (2.2)
xy a x b x 2 (2.3)
Therefore the normal equation are given below, from above table calculations
1060=7a+560b (2.4)
97700=560a+47600b (2.5)
Multiplying equation (2.5) by 80, we have
84800=560a+44800b (2.6)
Subtract equation (2.6) from (2.5), we have
12900=2800b
Therefore, b=12900/2800=4.607
Substitute, b=4.07 in equation (2.2), we have
7a=1060-560b=106-560(4.46)
Therefore, a= -217.131
Hence, final least square estimate line is
Y=-217.131+4.07X (2.7)
(iii)
Using excel analysis tool for analyzing the data for significance purpose, we have regression
output below:
SUMMARY
OUTPUT
Regression
Statistics
Multiple R 0.92218168
R Square 0.850419052
Adjusted R
Square 0.820502862
Standard Error 45.72432925
Observations 7
ANOVA
Significance
df SS MS F F
Regression 1 59432.14286 59432.14286 28.42672 0.003110347
Residual 5 10453.57143 2090.714286
Total 6 69885.71429
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
- - -
Intercept -217.1428571 71.25622064 3.047352992 0.02851 400.3128035 33.97291076
Temperature (X) 4.607142857 0.864108601 5.331671105 0.00311 2.385880985 6.828404729
Since p-value is less than 0.05, therefore relation between X and Y is statistically significant.
(iv)
In order to predict sales of a 95 degree day. Put X=95 in equation Y=-217.131+4.07X.
Y=-217.131+4.07(95)
Y= 220.536
3. According to one of the recent study conducted by an academic researcher on international
placement of students from leading institutes in India there is a high variation in the salary
offered by institutes. The following details have been gathered from the placement institute of
the colleges. The researcher wants to understand the trends with regard to international
placement based on the data he has gathered.
Amount (in
USD in lakhs Marital
per annum ) Age Status Type of institute Gender
2 35 Single University Male
5 24 Married PGDM Male
3.5 29 Married University Female
5 26 Single University Male
4 26 Married PGDM Female
8 25 Single PGDM Female
15 34 Married PGDM Male
3 26 Single PGDM Male
7 23 Single PGDM Male
a. Using descriptive statistics explore salary, and identify factors that appear to influence the
amount of the salary received.
Ans:
We have frequency distribution of the Marital Status, Type of Institute and gender as given
below:
Marital Status
Frequency Percent Valid Cumulative
Percent Percent
Single 5 55.6 55.6 55.6
Marrie
Valid 4 44.4 44.4 100.0
d
Total 9 100.0 100.0
Type of institute
Frequency Percent Valid Cumulative
Percent Percent
PGDM 6 66.7 66.7 66.7
Universit
Valid 3 33.3 33.3 100.0
y
Total 9 100.0 100.0
Gender
Frequency Percent Valid Cumulative
Percent Percent
Female 3 33.3 33.3 33.3
Valid Male 6 66.7 66.7 100.0
Total 9 100.0 100.0
In order to explore, explore salary, and identify factors that appear to influence the amount of
the salary received, we use multiple regression technique, using SPSS for multiple regression
technique, we have
ANOVAa
Model Sum of df Mean F Sig.
Squares Square
Regression 50.148 4 12.537 .688 .637b
1 Residual 72.852 4 18.213
Total 123.000 8
a. Dependent Variable: Amount (in USD in lakhs per annum )
b. Predictors: (Constant), Gender, Type of institute, Marital Status, Age
Gender, Type of institute, Marital Status, Age are not statistically significant. Statistically, there
is no influence of these factors on the amount of the salary received because P-values are less
than 0.05.
Coefficientsa
Model Unstandardized Standardized t P-value
Coefficients Coefficients
B Std. Error Beta
(Constant) -4.530 10.499 -.431 .688
Age .402 .420 .438 .959 .392
Marital Status .875 3.237 .118 .270 .800
1
Type of
-4.829 3.508 -.616 -1.377 .241
institute
Gender .755 3.313 .096 .228 .831
a. Dependent Variable: Amount (in USD in lakhs per annum )
b. Do a correlation analysis between ‘Amount’ and ‘Age’ and interpreted the coefficient of
correlation.
Ans:
Correlation
In statistics, the word "correlation" has a very specific meaning. Statistical correlation
means that, given two variables X and Y measured for each case in a sample, variation in X
corresponds (or does not correspond) to variation in Y, and vice versa. That is, extreme values of
X are associated with extreme values of Y, and less extreme X values with less extreme Y
values. The correlation coefficient (Pearson r) measures the degree of this correspondence.
X and Y influence
X influences Y Y influences X A influences X and Y
each other
Important: Correlation between two variables does not prove X causes Y or Y causes X.
Example: There is a statistical correlation between the temperature of sidewalks in New York
City and the number of infants born there on any given day.
Pearson r
There is a simple and straightforward way to measure correlation between two variables. It is
called the Pearson correlation coefficient (r) – named after Karl Pearson who invented it. It's
longer name, the Pearson product-moment correlation, is sometimes used.
r > 0 indicates a positive relationship of X and Y: as one gets larger, the other gets larger.
r < 0 indicates a negative relationship: as one gets larger, the other gets smaller.
r = 0 indicates no relationship
Let's intuitively consider how this formula works. It starts by subtracting the means from X and
Y, and then multiplying the results. When we subtract the mean from a variable, some of the
resulting values will be positive and some negative. When we subtract the means from both X
and Y, that will happen with both variables.
Note also that if we calculate the Pearson correlation of X with itself, the result will be 1:
Correlation between Amount (in USD in lakhs per annum ) (X) and Age (Y) using the formula
r
( xi X )( yi Y ) , we have calculations in below table as
( xi X ) 2 ( y i Y ) 2
Amount (in USD
in lakhs per Age
annum ) (X) (Y) ( xi X ) ( yi Y ) ( xi X ) ( y i Y ) ( xi X ) 2 ( yi Y ) 2
2 35 -3.833 7.444 -28.537 14.694 55.420
5 24 -0.833 -3.556 2.963 0.694 12.642
4 29 -2.333 1.444 -3.370 5.444 2.086
5 26 -0.833 -1.556 1.296 0.694 2.420
4 26 -1.833 -1.556 2.852 3.361 2.420
8 25 2.167 -2.556 -5.537 4.694 6.531
15 34 9.167 6.444 59.074 84.028 41.531
3 26 -2.833 -1.556 4.407 8.028 2.420
7 23 1.167 -4.556 -5.315 1.361 20.753
Total 27.833= 123.000= 146.222=
( xi X )( yi Y ) ( xi X ) 2 ( yi Y ) 2
r
( xi X )( yi Y )
27.83
0.2075
( xi X ) 2 ( y i Y ) 2 123.0x146.22
We have positive correlation between Amount (in USD in lakhs per annum) and Age. Since 0.20
is near to zero as compared to 1, the positive correlation is not much strong between the two
variables.