Sei sulla pagina 1di 85

SVKMS NMIMS University

Study Material Foundation Program

Prepared by: Dr. Shweta Dixit


Faculty shweta@nmims.edu

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

Home | H

Abo NMIMS | out

Contents | Calendar | Guidelines| Declara ation| Anthem

Curriculum| m

Trivia |

Scope e

Introduc ction to Statistics Descript Statistics (Tabular & Graphical Summary) tive Measure of Central T e Tendency Measure of Dispers es sion (Variance) Measure of Skewnes e ss Correlation Analysis s Regressi Analysis ion s

5. References S by A Aczel & J Sounderpa andian Business Statistics M evin & Reuben n Statistical Methods by Le fo y weeney William ms Statistics for Business & Economics by Anderson Sw ent eeney Williams s Manageme Science by Anderson Swe 6. Int ternet Referen nces :Nil

2007, SVK KM's NMIMS UNIVERSITY All rights re Y. eserved.

This s study materi is prepar by Dr. Sh ial red hweta Dixit. It is strictly for private circu ulation not fo sale. or

UNIT 1- INTRODUCTION TO STATISTICS. Objectives: After studying this unit, student should be able to Know about statistics, branches of statistics Know how useful statistical techniques are for decision making Appreciate the help given by statistical techniques in business decisions Know about population, sample Be able to differentiate between population- sample, parameter-statistics. Under stand the concept of census verses sampling Structure 1.1 Introduction 2.2 Types of statistics 2.3 population and sample 2.3 Types of variables types of data: discrete and continuous 2.4 Population and sample 2.5 Arranging data using Array and Frequency Distribution

2.1 Introduction. Statistics is a science of data collection, compilation, analysis and interpretation of the data. Statistical tools and techniques are used in all walks of life. In business the variety of a Statistical Techniques are used. Statistics can be classified as pure statistics and applied statistics. The pure statistics deals with the developing the subject in a scientific way. The theories are introduced, developed. In applied field the tools and techniques developed by the pure statisticians are applied to variety of field in life. The pure statisticians bring about new methods for collection, analysis and interpretations. The new techniques developed by the pure statisticians are appropriately used for the support in decision -making. The role played by statistics in business life is constantly increasing day by day. The decisions in business are becoming more and more quantitative. The decision support systems (DSS) are best example that uses the quantitative analysis to support the process of decision making.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

Overview:
Statistics: Why statistics is required for many college programs? What all are the differences in the course taught in engineering course and a physiology department, arts, or a business college? The biggest difference is the examples used. Although, the course content is almost the same or to put it in other words is basically the same. In business management we are more interested in things like profits, hours worked, wages etc. In physiology department they are more interested in test scores. Engineering: They are more interested in no. of units manufactured by a particular machine. However, all of them are interested in the common value and how much is the variation in the data. Why do we require Statistics? Every day in life we come across data like articles published in the newspaper and in such a case our job is to interpret the data published or the gathered information before that we have to ask ourselves whether the information which is published in the newspaper is sufficient enough to draw any conclusion. Is the sample size is sufficient to draw conclusions. What is meant by Statistics? Collection of the numerical information is called statistics. We are also often present the information in the graphical; form because it draws the attention of the readers and it is quiet explanatory also. Therefore statistics can be defined as:

Statistics: The science of collection, organising, presenting and interpretation data to assist in making more effective decisions.

2.2 Types of statistics; It is divided into two broad classifications: Descriptive Inferential Descriptive: Method of organising summarising and presenting data in an informative way. This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 4

Ex: In XYZ garment company the average sale figures for different years are given below: YearsSale (Cr) 2000..150 2001...145 2002200 2003..358 2004545 Here only the existing characteristic (average sale) is described for particular years therefore its an example of descriptive statistics. But in the same case if we ire interested in estimating the average sales figure of say 2008 then it will not be descriptive because then we are trying to draw some inference from the data that is available to us. This will be put under the inferential statistics. Inferential Statistics: It is also called statistical inference. Our main concern regarding inferential statistics is finding some thing about the population on the basis of e sample taken from the population. Inferential Statistics: The method used to determine something about the population on the basis of the sample.

2.3 Types of Variables: Variable is any thing that takes more than one value. It can either be qualitative or quantitative. Qualitative: When the charastic to be studied is non numeric. Example: Gender, religion, type of automobile owned, state of birth, colour of eye etc. Qualitative data is often summarised in bars and charts. Quantitative data:: When the variable under study can be reported numerically then it is quantitative data. Ex: Age, life of an automobile battery, no. of children in the family? Ex: Data collected on how many males or females in the family are there? Quantitative data can be discrete as well as continuous. Quantitative variables are discrete as well as continuous. Discrete variables can assume only certain values, and there are usually gaps between the values. Examples of discrete variable are: This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 5

No. of bed rooms in the house (1, 2, 3 etc), the no. of cars arriving at a shopping mall. Notice that the no. of cars arriving can be 50, 60 but it cannot be 56.5. Typically discrete variable appears from counting. Observations of continuous variables can assume any value with in certain a certain range. Examples can be the air pressure in a tire and the weight of a shipment of tyre. Air pressure can be in decimals where as a discrete variable is in whole no. only. Typically continuous variable arise from measuring

Types of Variables

Qualitative

Quantitative

Brand of PC Marital Status Hair Colour

Discrete

Continuous

Children in the Family Strokes on a golf hole TV sets owned

Amount of income tax paid Weight of a student Yearly rainfall in Mumbai

2.4 POPULATION AND SAMPLE This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 6

To use the inferential statistics we need to use population. Population is a collection of all the elements having some particular characteristics. Population is not only the people but all the items that are chosen for study. It may also be the total outcome of an experiment. When we are considering the characteristic as student of Yashwantrao Chavan Maharashtra Open University, then the population consists of all students enrolled for a course in Yashwantrao Chavan Maharashtra Open University. The characteristics define the scope of the population. If we consider an experiment of throwing a dice then its allpossible outcomes are {1,2,3,4,5,6} will be the population. Populations can be finite it can be infinite. The Human population we consider it is infinite. If we consider the number of cars displayed in the showroom may be finite. The characteristics of the population are called as parameters. Age of students in Yashwantrao Chavan Maharashtra Open University is an attribute. If we know the age of all students in the university then we can use it to represent it in tabular form. Its average can also be found out. To know the characteristics of the population we need to record the observations on each and every element / unit of the population. The characteristics of population are called as parameter. A 100% testing of the units in the population is called as Census. When the populations are infinite, some times it may not be possible to take the measurements of all elements in the population. Some times taking an observation involves the destruction of the unit. When we want to find out the average breaking strength of a chalk we need to break the chalks. So 100% testing may not be possible. To make the measurements of all the elements in the populations more time will also be required. The cost of measurements may also be very high. So 100% enumeration is not always possible. Here we make a use of only a part of population that is called as Sample. Sample is a subset of the population having all the characteristics. It is small part of the population having the same attributes. The sample characteristic is called as statistic. Population is a whole and sample is a part or fraction of it.

Advantages of Sampling: Easy: as less no. of units are studied so approaching them will also be easy. Economic: Cost of survey to collect data is also less as only selected units are studied. Less time consuming Certainly used in situation when we have units which are destructive in nature. Ex: if the breaking strength of an alloy is to be tested.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

2.5 Arranging data using Array and Frequency Distribution Array It is the simplest way of arranging data. It arranges it in ascending or descending order. It has many advantages over the raw data: We can quickly notice the lowest and the highest value in the data. We can always divide the data into various sections. We can see whether any value appears more than once. We can observe the distance between succeeding values in the data.

Disadvantages Despite of so many advantages, sometimes data array is not very useful because it is very cumbersome as each and every unit is to be entered and listed down. Therefore to avoid this we need to compress the information and still we should be able to make interpretations out of it and can take decisions based on the interpretations. Frequency Distribution: It is also known as frequency table. Ex: Let us take the example of inventory for 20 convenience stores: Data array of 20 convenience stores of inventory (in 20 days)

Table-1 2.0 3.4 3.4 3.8 3.8 4.0 4.1 4.1 4.1 4.2 4.3 4.7 4.7 4.8 4.9 4.9 5.5 5.5 5.5 5.5 8

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

(Arranged in ascending order)

Freq Distribution: A frequency distribution is a table that organize the data into classes, that is, into groups of describing one characteristic of data. The average inventory in one characteristic of data. Table-2 Class 2.0-2.5 2.6-3.1 3.2-3.7 3.8-4.3 4.4-4.9 5.0-5.5 Frequency( No of observation in each class) 1 0 2 8 5 4

2.6

Characteristics

of

the

relative

Distribution:

We can express the frequency with each value as a fraction or total no. of observations. The frequency of the average inventory of 4.4 to 4.9 days in table 2 is 5 where as in table 3 it is 0.25. This 0.25 is a relative measure which can be obtained by dividing 5 by the total of all frequencies i.e. 20. Class 2.0-2.5 2.6-3.1 3.2-3.7 3.8-4.3 4.4-4.9 5.0-5.5 Frequency( No of observation in each class) 1 0 2 8 5 4 Relative frequencu 0.05 0.00 0.10 0.40 0.25 0.20

2. REPRESENTATION OF DATA
Objective: After studying this unit, reader should be able to Classify data Know about frequency distribution Construct a frequency distribution. Know relative and cumulative frequencies. Construct relative and cumulative frequencies.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

Know different graphs Draw the graphs for the given data

Structure 2.1 Introduction 2.2 Classification of data 2.3 Construction of frequency table 2.4 Relative frequency distribution 2.5 Cumulative frequency distribution 2.6 Diagrammatic representation of data 2.7 Key words 2.8 Review exercise

2.1 INTRODUCTION
Statistics is a science of aggregates. It is science of the group of numbers. It deals with a set of data. The collected data is called as the Raw data. Which is just the collection of the facts and figures from the population or sample. These data is arranged, displayed, summarized, compiled and analyzed. Compilation of data can be tabular, graphical. In tabular compilation we represent the data in a table form. It may be a simple classification as per some class interval. Graphs and charts are used to display the data diagrammatically

2.2. CLASSIFICATION OF DATA


The collected data is classified using a tally marks. Tally marks are the small vertical lines used as symbols to represent the number. The raw data that is collected is classified using the frequency table. Classification gives the summery of the data. It facilitates the comparison if any between the attributes of the variable under consideration. Classification highlights the characteristics of the data. Basis for classification may be area wise, it can be time wise, it can be quality wise or it may be quantity wise.

Area wise classification demands for the separation of the data as per the geographical areas. The sales of different brands of colour televisions in different parts of country. This type of data may give us information about the pattern of choices of the customer in different parts of the country. Accordingly the colour television company can develop their strategy for advertising in different parts of country in different media of advertising. Time wise classification of data is arranging the data as chronological data. The time series data is a classification of data with respect to time. Time based classification may have the basis for classification as day or month or year or a decade etc. The sensitive index of a stock market has a base of a day. The turn over of a company may have the base as a year.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

10

Qualitative classification of the data is according to some attribute or characteristic that may not be measured. Classification may be dichotomous i.e. in to categories as having and not having. Classification may have different classes. The students in a class of MBA may be classified as having graduation in Science, Commerce, Arts, and Engineering etc. Quantitative classification is used when the characteristic under consideration is measurable. The salary of employees, length of iron rods that are manufactured by a machine, quantity of a soft drink that is dispensed by the machine etc.

2.3 Construction of frequency table


The data is represented as a frequency table. First check the data type as discrete and continuous. The continuous data types will represent the data using class intervals. Case I: Discrete data When data is discrete data we classify the data as follows Consider the data giving the details about the number of employees working in small scale industries in a particular MIDC area. 21 22 26 24 28 21 23 25 26 25 25 26 28 21 24 26 23 25 21 25 26 23 24 28 26 28 28 25 26 25 21 24 29 28 23 To classify this data we use Tally marks. For each entry in the above data we put a vertical line as tally mark to find how many 21s are there, 22s are there and so on. Four are vertical lines and the fifth is a slanting that makes a bundle of five lines.

Number of employees Tally marks


21

Frequency 5 1

22

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

11

23 24 25 26 27 28 29 * Total ___

4 4 7 7 0 6 1 35

The frequencies are written counting the number of tally marks against the number. Case II: Frequency table for continuous data For continuous data we make non-overlapping sub classes called as class intervals. The minimum and the maximum observation determine the range of the class intervals. The number of classes should not be too large of too small. The number of classes should be between 8-12. But there is no hard and fast rule. The classes indicates the part of the whole range over which the observations are scattered. Class is identified by its limits that are called as class limits or class boundaries. The class limits restrict the numbers in the given range. The lower end of the class is called as lower limit and the upper end of the class is called as upper limit. Example: if the classes are 0-10; 10-20; 20-30 70-80 the lower limit of class 20-30 is 20 and the upper limit of the class 20-30 is 30. If the upper limit of the class is same as the lower limit of the class it is called as the exclusive class interval. In such case the observation that is exact as the upper limit is included in the next class where that number is a lower limit. We may get the classes where the upper class limit of a class is not same as the lower class limit of the successive class. Then we need to make the continuity correction. As in case of the continuous variable there are always observations possible between any two numbers. Non-similar upper and lower limits of the successive classes indicate that there is discontinuity that restricts the variable to take values from that particular part of the population. Continuity correction is the following process of making the class intervals continuous, with out breaks. This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 12

The continuity correction is applied by the following method. The part of the class interval, which is not included in the classes, is evenly distributed in the classes by extending the limits of the class. The correction factor is the half of the difference of the lower limit of the second class and the upper limit of the first class. Illustration: if the class intervals are 0-9 10-19 20-29 30-39 and so on. Here the random variable can take all possible values integers as well as the fractional. The values that are in 9-10 19-20 29-30 are not at all included in the class intervals so we need to make the class intervals continuous by redefining the class intervals so that there is no discontinuity and the class width also does not change. The excluded portion is 9-10 of width 1, 19-20 again of width one so we divide the un attended portion in to two halves and add one part to the upper limit and subtract from the lower limit making it as 0.5 9.5, 9.5-19.5 19.5- 29.5 and so on. So after applying continuity correction we have class intervals of the width 10 and also taking in to consideration all the values the variable can take. Activity: Make the following classes continuous by applying the continuity correction. 1. Class interval 0-9.5 10-19.5 20-29.5 30-39.5 40-49.5

After applying the continuity correction we get the classes as Class interval .. .. 2. class interval 1-2.99 3-4.99 . 5-6.99 . 7-8.99

After applying the continuity correction we get the classes as Class interval .. .. 3.Class interval101-150 150-200 . 201-250 . 251-300 300-350

After applying the continuity correction we get the classes as Class interval .. .. . .

The difference between the upper and the lower limit is called as the class width. The class width indicates the span or size of the class. The class width should not be very

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

13

small or very large. If the class is 20-30; the class width is 30-20=10. Generally the classes are formed such that the width is 5 or 10 or a multiple of 5. Class intervals may be open ended also. For example : below 10 10-20 20-30 30-40 40-50 50-60 60and above. The last two class intervals are the open ended class intervals. The open ended classes are always at the extreme of the distribution. It can not be in the middle of the distribution. The midpoint of the class is called as Class Mark. It is calculated as the sum of the upper and lower limit divided by two Class Mark =(Upper class limit +lower class limit)/2 With the continuous data we are going to make the use of class mark very often.

2.4 Relative frequency distribution. The relative frequency gives the fraction of the total portion contained in a class. The proportion of the observations lying in non overlapping class intervals is shown in the relative frequency distribution. For the data of n observations the relative frequency is calculated as
frequency of of the class observatio ns

Relative frequency of a class =

total

number

Relative frequency distribution is the tabular summery of the relative frequencies for all the classes. Illustration According to the Beverage Digest, Coke classic, Diet coke, Dr. Pepper, Pepsi cola and sprite are the five top selling soft drinks. The data below shows the drinks selected by 50 soft drink purchases.

Frequency distribution of soft drink purchases Soft drink frequency Coke classic 19 Diet coke 8 This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 14

Dr. Pepper Pepsi cola Sprite

5 13 5 Relative frequency 0.38 0.16 0.10 0.26 0.10 1.00

The relative frequency distribution is Soft drink frequency Coke classic 19 Diet coke 8 Dr. Pepper 5 Pepsi cola 13 Sprite 5 Total 50

Exercise The time in days required for completing year-end audits for a sample of 20 clients of an accounting firm. Classify the data and find the relative frequency distribution. Year-end audit times in days 12 14 19 18 23 22 21 33 15 28 15 15 18 18 17 16 20 13 27 22

We classify it in the following manner Audit time (days) frequency 10-14 4 15-19 8 20-24 5 25-29 2 30-34 1 Total 20

Relative frequency 0.2 0.4 0.25 0.10 0.05 1.00

Exercise: The doctors office staff has studied the waiting times for patients who arrive at the office with are quest for emergency service. The following data were collected over one month period. Waiting times are in minutes 2 5 10 12 4 4 5 7 11 8 9 8 12 21 6 8 7 13 18 3 Use classes of 0-4, 5-9 etc. Show the frequency distribution Show relative frequency distribution

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

15

What is the proportion of patients needing emergency service have waiting time of 1014?

2.5 Cumulative frequency distribution A variation o frequency distribution is a cumulative frequency distribution. Cumulative frequency shows the number of data items with values less than or equal to the upper class limit of each class and the number of data items with values greater than or equal to the lower class limit of each class. That means cumulative frequencies are of two types, less than and greater than equal to type. Consider following frequency distribution. Class Interval 0 10 10 20 20 30 30 40 40 50 50 60 60 70 Frequency 5 12 17 26 20 17 3 Cumulative frequency Less than equal to Greater than equal to 5 95 + 5 = 100 5 + 12 = 17 83 + 12 = 95 17 + 17 = 34 66 + 17 = 83 34 + 26 = 60 40 + 26 = 66 60 + 20 = 80 20 + 20 = 40 80 + 17 = 97 3 + 17 = 20 97 + 3 = 100 3

Less than equal to type cumulative frequency, and the frequency is less than or equal to the upper limit of the class intervals. How many observations are less than or equal to 10, they are 5. How many observations are less than or equal to 20, they are 5 + 12 = 17 and so on. How many are less than or equal to 70, they are all 100. In greater than or equal to type cumulative frequency, the number of observations which are greater than or equal to the lower limit of the class interval are considered. Consider the last class interval 60 70. How many observations are greater than or equal to 60, they are 3. How many are greater than or equal to 50, they are 3 + 17 = 20. And in the same way all 100 are greater than or equal to. Exercise For the following distribution form the cumulative frequency distribution. Audit time (days) 10-14 15-19 20-24 25-29 30-34 frequency 4 8 5 2 1

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

16

Total

20

2.6 Diagrammatic presentation of data The data presented using various diagram for better understanding of the patterns in the data. The various graphs and diagrams are used depending upon the purpose and the need. The most common are Histograms, line diagrams, bar diagrams, frequency polygon, cumulative frequency curves/ogives and pie diagrams.

Line diagrams: A Line diagram presents the two variable data. One variable is plotted on the X axis and the second variable is plotted on the Y axis. The points are joined using the lines. Line diagram gives the increase or decrease of the data. Line diagrams are used for time series data, where year is plotted on the X axis and the value of other variable is plotted on the Y axis which shows the general tendency of the data as a whole. Following data gives the wholesale price index for a certain period. Year
1994 1995 1996 1997 1998 1999 12.5 8.1 4.6 4.4 5.9 3.3

Wholesale price index

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

17

whole sale price index


14 12 10 8 Series1 6 4 2 0 1 2 3 years 4 5 6

Activity: Draw a line diagram for the following data. Monthly sales of a pharmaceutical company for an years period is given below; Month Sales in lakhs of Rupees Jan Feb March April May June 384 356 389 401 410 412 July August September October November December 415 423 425 420 418 405

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

18

Bar diagram: The data is presented using rectangular blocks, horizontal or vertical. The bars diagrams are used for presenting the facts of the data. The articles in news paper, magazines, journals etc. use these bar diagram to present the behaviour of the characteristic/ attribute in over a space. The vertical bar diagrams present the characteristics or attribute on the X axis and the corresponding values on the y axis. Horizontal bar diagram use the axes in reverse order. We illustrate how to draw bar diagrams using the following example. Example: Data below shows production of shirts in a manufacturing company is given below Year 1990 1991 1992 1993 1994 1995 1996 No. of shirts (00) 52 55 56 60 57 58 56 Draw vertical bar diagram.
62

60

70
58 60

50 40
56 Series1

30 54 20
52 10

0
50

1990

1991

1992

1993

1994

1995

1996

48 1 2 3 4 5 6 7

The histograms are very commonly used to show the comparisons of the observations. The height of the bars in the histograms is directly proportional to the quantity. The bar diagrams can be multiple bar diagrams as well as divided bar diagrams. In multiple bar diagrams two or more sets of interrelated data are represented. The method remains the same as that of simple bar diagram. Some times bars are shaded or given different colours as they are showing different items.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

19

Sub Divided bar Diagram: The parts of total are represented as small segments/parts of each in large bars. The magnitude of the segment in a bar is directly proportional to the quantity shown by it.

Illustration The data given for different commodities for two families Family income for two families is 980 and 1260 respectively The expenditures are as follows. Commodity Expenditure Family A Family B Food 300 400 Clothing 250 200 Education 50 360 Others 380 300

Expenditure of family A and B


100% 90% 80% 70% Exp(%) 60% 50% 40% 30% 20% 10% 0% Family A Family Family B Series4 Series3 Series2 Series1

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

20

Exercise The following data gives the oil-seeds crop production estimated for a season Oilseed Production in lakh tones Area A Ground nut Soya bean Sesame Total 2.00 1.25 0.75 4.00 Area B 1.70 1.25 0.20 3.15

Multiple Bar Diagram The data given for different commodities for two families Family income for two families is 980 and 1260 respectively The expenditures are as follows. Commodity Expenditure Family A Family B Food 300 400 Clothing 250 200 Education 50 360 Others 380 300

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

21

Expenditure of family A and B


450 400 350 300 Series1 Exp 250 200 150 100 50 0 Family A Family Family B Series2 Series3 Series4

Pie diagram: The circle of 360 degrees is divided in to sectors as per the share of the component. The percentage of the components is converted in to corresponding degree and is shaded or shown in different colours. Illustration: The class of Management in an institute has a constitution of students with the graduation degree as follows. Represent this data as the Pie diagram Graduation degree Commerce Science Engineering Pharmacy Others Number of student 30 15 32 10 3

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

22

composition of students

Commerce Science Engineering Pharmacy Others

Exercise Plot the pie chart for the data below: Crop Area in million hectares Wheat 16.10 Rice 18.23 Jawar 3.50 Bajra 3.64 maize 1.60 Total 43.07

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

23

Ogives: The cumulative frequency curves are called as the Ogives. The smooth curves can be drawn for the less than or equal to type or greater than or equal to type. The ogives represent how many observations lie below or above certain values in the distribution, rather than recording the numbers within interval. The general form of the ogives is as follows: On X axis we plot the class limits and along Y-axis we plot the cumulative frequencies.

Less than equal to type

Greater than equal to type

Class limits Illustration: Draw the cumulative frequency curve for the following data: Height in Cm 150-154 154-158 158-162 Number of children 10 12 20 Less than frequency curve
70 Less than or equal to Cummulative frequency 60 50 40 30 20 10 0 1 2 3 4 5 6 upper class limits

162-166 10

166-170 8

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

24

Greater than frequency curve


150 154 158 162 166 70 60 50 40 30 20 10 0 1 2 3 4 5 6 lower class limits

Exercise: Following data relate to factory size according to employment. Draw a less than curve and a more than curve for the above data. Employment size Number 0-50 50-100 100-200 200-500 500-1000 1000-2000 2000-5000 No.of factories In 1000 31 29 70 63 119 126 85

Below given is the frequency distribution of weekly wages of 100 workers in a factory: weekly wages no. of workers weekly wages no. of workers 120-124 3 145-149 10 125-129 5 150-154 8 130-134 12 155-159 5 135-139 23 160-164 3 140-144 31 This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 25

Greater than or equal to Cummulative frequency

Draw the ogive for the distribution and use it to determine the median wage of a worker and verify the result by the formula. 2.8 Key words Tally marks Tally marks are the small vertical lines used as symbols to represent the number. Class intervals :non-overlapping sub classes called as class intervals. Class limits or class boundaries Class is identified by its limits that are called as class limits or class boundaries. Class Mark: The midpoint of the class is called as Class Mark. Relative frequency: The relative frequency gives the fraction of the total portion contained in a class. Cumulative frequency: Cumulative frequency shows the number of data items with values less than or equal to the upper class limit of each class and the number of data items with values greater than or equal to the lower class limit of each class. 2.9 Review exercise 1. The following data give the income distribution of workers in two factories. Construct a relative frequency distributions and cumulative frequency distributions. Income in1000Rs. 10-12 12-14 14-16 16-18 18-20 20-22 22-24 Number of workers in Factory 1 10 15 65 73 70 17 10 Factory 2 25 34 40 50 30 30 10 2. The number of apartments in a complex in a city are as below 91 79 66 98 127 139 154 147 192 88 97 92 87 142 127 184 145 162 95 89 86 98 145 129 149 158 241 Construct a frequency distribution using intervals 66-87,88-109220 Also construct the relative frequency distribution as well as the cumulative frequency distribution. 3 For the distribution of marks, construct the frequency distribution, relative frequency distribution. Obtain the proportion of students scoring marks less than 40. 26

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

What is the proportion of students scoring marks more than 60? Marks less than No. of students 20 10 40 35 60 56 80 90 100 100 4. Draw the ogive for the data in problem no. 4. 5. For the following data related to the age of the policy holder draw the histogram. Age in years No. of policy holders 20-25 25-30 30-35 35-40 40-45 45-50 8 12 24 16 15 5

7. Life of a model of a refrigerator in a recent survey are given below Life # of years 0-2 2-4 4-6 6-8 8-10 10-12 Number of refrigerators 5 16 13 7 5 4 Draw the ogives. 8. The table below shows the annual sales ($ millions) of Speedcall mobile phones of random sample of 150 outlets . Annual sale of Speedcall Number of Outlets mobilephones ($million) 5-9 18 10-14 35 15-19 41 20-24 21 25-29 15 30-34 13 35-39 7 a) What is the proportion of outlets is having annual sales of Speedcall mobile phones at least $ 30 millions? b) What proportion of Outlets has the sales between the 20-24? c) What proportion of outlets has the annual sale of Speedcall mobile phones at the most $ 29 millions?

************

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

27

Chapter-3

3. MEASURES OF CENTRAL TENDENCY


Objectives At the end of this unit, student should be able to Know what is central tendency Compute the measures of central tendency Differentiate the measure of central tendency.

Structure 3.1 Introduction 3.2 Characteristics of averages 3.3 Measures of central tendency 3.4 Change of Scale and Origin 3.5 Arithmetic mean with change of scale and origin 3.6 Median 3.7 Fractiles 3.8 Mode 3.9 Weighted mean 3.10 Combined means 3.11 Properties of averages 3.12 Choice of an Average 3.13 Comparison of Mean Mode and Median 3.14 Review exercise

3.1 Introduction: The data is to be represented as a single number as it is difficult to remember the whole data. To reduce the complexity and make data comparable we define measures of central tendency. Measure of central tendency is average. This average must be representative of the whole data. It should over all represent the data set. The average should be near to the central value of the data set. The abnormalities of the data must be eliminated by the average value.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

28

3.2Characteristics of averages The average of the data must have some properties in order to make it more homogeneous to the maximum possible limit. a) The average must be rigidly defined, which makes the result fixed and independent of user. b) It should be based on all observations. The average taking into consideration all observations in the data set represents all the values. It changes as and when any observation in the data set is altered or changed. The changed number will also have representation in the new average. c) Ease and rapidity in calculation. Average should not be complex in calculations. Simplicity and rapidity in calculation makes average popular in use. d) Least affected by fluctuations of sampling: From population we can draw many samples. The population average is known or unknown. The averages for different samples should not be very different from each other and from the population average. If the sample average is different from population then the sample is not a true representative of the population. e) It should amenable to or submit itself to further mathematical treatments. Such as combined averages of the series. If further mathematical calculations are not possible then average becomes of limited use.

3.3Measures of central tendency: The measures are central tendency are classified as Mean Median Mode ARITHMETIC MEAN. Most popular measure of central tendency is arithmetic mean. Arithmetic mean is obtained by adding all the observations and dividing it by the number of the observations. CASE I Raw Data When the collected data is as it is , without any classification, then arithmetic mean is calculated as

sum
Arithmetic Mean A.M.=

of number

all of

observatio

ns ns

total

observatio

Let X1, X2,..Xn be n observations then This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 29

A.M. =

i =1

xi

x 1 + x 2 + ... + x n n

Example: Marks of 10 students in a class are observed. What are the average marks of students? Marks: 90 100 110 95 98 97 101 102 93 94

Average is obtained by taking the ratio of sum of all observations and total number of observations. Hence the average marks are (90 + 100+ 110+ 95+ 98+ 97+ 101+ 102+ 93+ 94) / 10 =980/10 =98 The average is 98 Exercise: Calculate the arithmetic mean of the following data 25 29 27 23 31 32 24 28

26

Case II Discrete classified data (Ungrouped) When data is classified in frequency distribution, then it becomes easy to calculate mean. Observations and its occurrence in terms of frequencies is given, then we calculate arithmetic mean as follows: X2 X3 X4 X5 Xn Observations X1 . . Frequency f1 f2 f3 f4 f5 .. .. fn

Then
A .M . =

i =1 n

fi xi fi

f 1 x 1 + f 2 x 2 + ... + f n x n f 1 + f 2 + ...... f n

i =1

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

30

Illustration: Calculate arithmetic mean of the following 5 10 15 20 25 30 Xi fi 4 6 8 12 10 5 To calculate arithmetic mean we calculate the sum of product of fi and Xi: Xi fi xi fi 5 4 20 10 6 60 15 8 120 20 12 240 25 10 250 30 5 150

Sum of fixi = 840

A .M . =

i =1 n

fi xi fi

f 1 x 1 + f 2 x 2 + ... + f n x n f 1 + f 2 + ...... f n

i =1

840 =18.66667 45

Exercise: The following data are the daily price quotations for a certain stock over aperiod of 45 days. Classify the data and find the average of the price. 10 11 10 11 12 12 13 14 16 15 11 18 19 20 15 14 14 20 13 15 14 12 10 11 19 16 18 17 13 12 15 14 12 16 18 19 14 12 13 16 18 19 17 15 14

Case III: Continuous classified data (Grouped Data): When a continuous data is given we can calculate arithmetic mean by assuming that the frequencies are concentrated at the mid point of the class interval. mid point of the class is discussed in unit 6. To calculate arithmetic mean, use class mid point as xi and the frequencies as given. Example: Calculate average weight of a student in a class.
Weight in kg No. of students

30 35 5

35 40 15

40 45 20

45 50 18

50 55 2

Solution: We calculate mid point of every class by

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

31

xi , mid point =

Lower limit + upper limit 2 37.5 15 42.5 20 47.5 18 52.5 2

xi fi

32.5 5

Arithmetic mean =

i=1 n

fixi fi

i=1

2535 = 42.25 60

This method does not give accurate value of the arithmetic mean. The reason is we do not take all the actual values but to make it simple we classify them in classes and only mid point of the class is used in place of all values in that particular range. If the actual values are used, the arithmetic mean will be more accurate. When data is very large we cannot use actual observations, hence this average gives a estimate of the population mean.

Merits and Limitations of arithmetic mean.

Merits 1. It is rigidly defined 2. It is based on all observations 3. It can be used for further mathematical calculations 4. It is least affected by the fluctuations of sampling 5. It can be used for comparisons Limitations: 1. It has to be calculated. It cannot be graphically located. 2. It is affected by extreme observations 3. It changes if any of the observations change 4. When class interval is open ended arithmetic mean cannot be calculated.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

32

3.6 Median:

To overcome the limitation of mean as it cannot be calculated for open-ended class intervals, we now define another measure of central tendency called median. Median is the middle point of the frequency distribution the point below which lie 50 % of the observations and above which lie 50 % of the observations. Median divides the distribution in two equal halves. Median is the positional average as its value depends upon the position and not on magnitude.
Case I: When raw data is given.

Let X1, X2, ..Xn be n observations given. Then arrange them in ascending order. n +1 n + 1 the If n is odd then calculate observation is the median. 2 2
th

If n is even then calculate

n n n , +1 the arithmetic mean of 2 2 2

th

n and + 1 2

th

observation is the median. Illustration: Locate the median in the following data. 19, 15, 12, 14, 18, 20, 22, 16, 17 Solution: Number of observations = 9 n + 1 As n = 9 is odd we will locate value in the ascending organised observations. 2
th

12, 14, 15, 16, 17, 18, 19, 20, 22 n +1 9 +1 = = 5. In ascending ordered values 5th value is median. 2 2
12 ,42 43, 17, 18 ,42 43 1 14 ,15 ,16 1 19 , 20 , 22
50 % below 50 % above

Example: Calculate the median of the following 20 22 18 27 32 35 25 23 Solution: Number of observation = 8

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

33

n As n = 8 is even, we will locate 2 observations in ascending order. 18 n = 4, 2 Mean of 23 and 25 is


Exercise

th

n and + 1 number after arranging all 2 32 35

th

20

22

23

25

27

n +1 = 5. 2 23 + 25 = 24. 2

4th observation is 23 and 5th observation is 25.

Locate median of the following data. a) 59, 53, 57, 56, 52, 51, 59, 60, 61, 54

b)

70,

79,

74,

69,

73,

72,

73,

76,

80

C)

For the following company price earning ratios, find the median. Arbour Software 39 Ascent 35 Compaq 17 EFI 25 LSI 30 Sierra Semiconductors 27 Teradyne 33 Texas Instruments 26

Case II: When data is discrete and classified as frequency distribution.

Here we discuss the cumulative frequency. This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 34

Cumulative frequencies are of two types, less than and greater than equal to type. Consider following frequency distribution. Less than or equal to frequencies indicate the number of observations that are less than or equal to the upper limit of the respective classes and greater than or equal to frequencies indicate the number of observations that are greater than or equal to the lower limit of the respective classes. We calculate it as in the table below:
Class Interval Frequency Cumulative frequency Less than equal to Greater than equal to 5 95 + 5 = 100 5 + 12 = 17 83 + 12 = 95 17 + 17 = 34 66 + 17 = 83 34 + 26 = 60 40 + 26 = 66 60 + 20 = 80 20 + 20 = 40 80 + 17 = 97 3 + 17 = 20 97 + 3 = 100 3

0 10 10 20 20 30 30 40 40 50 50 60 60 70

5 12 17 26 20 17 3

Less than equal to type cumulative frequency, the number of observations are less than or equal to the upper limit of the class intervals. How many observations are less than or equal to 10? They are 5. How many observations are less than or equal to 20? They are 5 + 12 = 17 and so on. How many are less than or equal to 70, they are all 100. In greater than or equal to type cumulative frequency, the number of observations, which are greater than or equal to the lower limit of the class interval are considered. Consider the last class interval 60 70. How many observations are greater than or equal to 60? They are 3. How many are greater than or equal to 50? They are 3 + 17 = 20. And in the same way all 100 are greater than or equal to 0. Exercise: Find the cumulative frequencies for the following distribution: Class interval 0-5 Frequencies 3 type type 5-10 7 10-15 15-20 20-25 25-30 30-35 10 12 18 8 2

The same is true in case of discrete data.


Marks No. of students Cumulative frequencies 10 75 23 65

20 21

10 13

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

35

22 23 24 25

15 17 11 9 75

38 55 66 75

52 37 20 9

To locate median of the discrete classified distribution We know that

i =1

n + 1 fi = N . If N is odd then observation is the median. If N is 2


th th th

n n even then average of and + 1 observation is the median. 2 2

Example: Consider the following distribution.


No. of students 10 13 15 17 11 9 75 n N is odd hence calculate + 1 2 Marks 20 21 22 23 24 25 Cumulative frequency type 10 23 38 55 66 75

76 = 38. 2

If we write these marks as raw data, first 10 numbers will be 20, next 13 numbers will be 21, next 15 numbers will be 22 and so on. So 38th observation is median 22.

Illustration: Calculate median of the following


Cumulative frequency type 2 8 18

Marks 4 5 6

No. of students 2 6 10

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

36

20 15 8 7 2 70 n N is even = 35 and 2

7 8 9 10 11

38 53 61 68 70

n th th + 1 = 36 so average of 35 and 36 observation is median. 2 7+7 = 7. Median is 7. From 19th to 38th observations are 7. So median is 2

Exercise Locate median for following data. 1. 90 92 94 No. 12 25 43 Frequency

96 40

98 20

100 10

2. XI fi

20 5

22 12

27 20

32 13

35 5

Case III: Continuous data:

When continuous data given, we calculate median using the following formula.
F Md= Median = l + 2 xh f Where median class is identified using less than cumulative frequency. The class, that has n cumulative frequency less than or equal to . Median lies in this class. 2 l= lower limit of the median class F: Cumulative frequency of the preceding class to the median class.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

37

f: frequency of the median class. h: class width of the median class.

Example: Calculate the median of the following distributions.


Class Interval 20 25 25 30 30 35 35 40 40 45 45 50 total Frequency 4 20 28 18 12 8 90

Solution: Get cumulative frequency first class Class Interval Frequency Cumulative Frequency 20 25 4 4 25 30 20 24 30 35 28 52 35 40 18 70 40 45 12 82 45 50 8 90 90 Calculate N 90 to get the median class = 45. 2 2 The first class interval with cumulative frequency greater than 45 is 30 35. So median is 30 35. Substitute in the formula. F Md Median = l+ 2 xh f N = 45, F = cumulative frequency of preceding class to median class Lower limit is 30, 2 = 24. f : frequency of median class, h = class width = 5. 45 24 Md = 30 + x5 28 = 30 + 21 x5 28 15 = 30 + 4 38

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

Md

= 33.75

Activity: Heights in cms is given below. Find the median height.


Height in cm 140 145 145 150 150 155 155 160 160 165 No. of persons 2 14 24 15 5 60

Merits and limitations of Median Merits 1. It is easy to understand and easy to calculate 2. It is rigidly defined 3. If the class interval are open ended, then also we can calculate median. 4. It is not affected by extreme values / observations 5. It can be located graphically also Limitations: 1. It cannot be used for further mathematical calculation. 2. It does not depend upon all observations 3. In case of raw data, if n is even the median is estimated by taking the average of two middle observations. 4. As compared to the mean; it is much affected by fluctuations of sampling 5. The assumption of the frequencies are uniformly distributed over the class interval may not be practical 6. It is not possible to define terms such as weighted median.
3.7. Fractiles

There are some other measures of central tendency. They are deciles, quartiles and percentiles. Quartiles are those three points, which divide the distribution in four equal parts/ quarters*. Deciles are those nine points, which divide the distribution in ten equal parts/ The first quartile is called as Q1, second quartile is Q2 and the third quartile is Q3. Below Q1 lies 25 % of observation and above Q1 lie 75 % of observations. Q2 is a point, which has 50 % observations below and 50 % observations above it. Q3 has 75 % observations below and 25 % of observations above it.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

39

Q1
25 %

Q2
75 %

Q3

1 4 4 3 1 4 4 4 4 4 2 4 4 4 4 43 42 4

144444 44444 Q2 144444 44444 2 3 2 3


50 % 50%

14 4 4 4424 4 4 4 4 3
75 %

4 3 Q3 14 244
25%

The second quartile Q2, matches/overlaps with median. The general formula is:
i. F 4 Qi = L + xh f

i = 1, 2, 3. iN value. The class that has 4

The class in which the ith quartile lies is decided using cumulative frequency is more than or equal to Qi : ith quartile i = 1, 2, 3

iN is the ith quartile class. 4

L: lower limit of the quartile class N: total number of observations

fi
i =1

F: Cumulative frequency of the class preceding quartile class. f: frequency of the quartile class h: width of the quartile class Example: The salary distribution of teachers in a university is as follows. Find the quartiles Q1, Q2 and Q3.
Class Interval Frequency Cumulative Frequency 20 47 82 124

8000 9000 9000 10000 10000 11000 11000 12000

20 27 35 42

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

40

12000 13000 13000 14000 14000 15000

33 23 10 190

157 180 190

Solution: To find Qi we calculate cumulative frequency they are: 20 47 82 124 157 180 190 To calculate first quartile Q1 we check the class in which the first quartile lie. 190 N = 47.5 = 4 4 The class interval whose cumulative frequency is 47.5 is 10000 11000. Now we use the formula for Quartile 1 Q1
F = l+ 4 xh f 47.5 47 x 1000 = 10000 + 35 0.5 x 1000 = 10000 + 35

= 10000 + 14.2857

= 10014.286

Similarly we calculate Q2 or median

N = 95. Median class is 11000 12000. 2


Q2 =
F Md = l+ 2 xh f 95 82 x 1000 = 11000 + 42 13 x 1000 = 11000 + 42

Similarly we calculate Q3 For class in which third quartile lie is with cumulative frequency = 3 x 47.5 = 142.5 The class is 12000 13000 Substituting in formula 3N 3 x 190 = 4 4

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

41

Q3

3 F 4 = l+ xh f 142.5 124 x 1000 = 12000 + 33 18.5 x 1000 = 12000 + 33

= 12560.606
Exercise:

1 Frequency distribution of the weight (pounds) for 100 persons Weight Frequency 130.5-140.5 10 140.5-150.5 20 150.5-160.5 30 160.5-170.5 20 170.5-180.5 10 180.5-190.5 10 Find the maximum weight of the lightweight 25%. Find the minimum weight of the heavy weight 25%

Deciles: Deciles are those nine points, which divide the distribution is10 equal segments. So we will have 9 points as D1, D2, ., D9. For first decile, 10 % of data lie below and 90 % of the data lie above. For decile two, 20 % lie below and 80 % lie above. Similarly for D9, 90 % data lie below and 10 % lie above.

D1

D2

D3

D4

D5

D6

D7

D8

D9

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

42

123 123 4 4 4 4
10% 10%

The general formula for deciles is


j. F 10 Dj = l+ xh f Where Di : is the ith decile

i = 1, 2, 3, .., 9

Decile class is the class that has the cumulative frequency greater than of equal to iN/10. L: lower limit of the decile class N: total number of observations

fi
i =1

F: Cumulative frequency of the class preceding decile class. f: frequency of the decile class h: width of the decile class The decile are used in situations such as when we want to find out the values such as: a) Given the distribution of salary of employees in a company are given, we can locate the minimum salary of highly paid 10 % employees or maximum salary of less paid 20 % employees or range of salary paid of middle 40 % employees. b) Given the distribution of marks of students in the examination, we can find out minimum marks required to pass if 40 % are failing. If the merit scholarship is to be awarded to 10 % of the high score students etc. Example: The distribution of daily collection of milk in different milk collection centres in a district as follows:
Milk collected No. of Cumulative Centres Frequency 3 3 15 18 25 43 17 60 10 70 5 75 75

450 500 500 550 550 600 600 650 650 700 700 750

a) Find the minimum milk collected in 10 % high milk collection centres. b) Find the maximum milk collected in 20 % low milk collection centres. Solution: a) Minimum milk collected in 10 % high milk collection centres.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

43

Here we can see the distribution is from 450 750. 10 % high milk collection centre. We calculate D9, which is a point above which 10 % of the observation lie and below which 90 % of observations lie.
9. F 10 D9 = l+ xh f

D9 lies in the class interval whose cumulative frequency is

9.N 9 x 75 = 67.5 = 10 10

The class interval whose cumulative frequency is 650 700. 67.5 60 = 650 + x 50 10 = 650 + 7.5 x 5 = 687.5 b) maximum milk collected in 20 % low milk collection centres
2 F D2 = l+ 10 xh f The class interval in which second decile D2 lie is whose cumulative frequency is greater than or equal to 2.N 2 x 75 = 15. = 10 10 That class interval is 500 550 2 F D2 = l+ 10 xh f 15 3 x 50 = 500 + 15 12 x 50 = 500 + 15 = 540.

We note here that the fifth decile matches the second quartile and median.

Exercise The following distribution of completion times for first 200 women finishers in crosscountry ski race Time in minutes frequency 100-114 4 115-129 5 130-144 5 145-159 18 160-174 18

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

44

27 175-189 190-204 44 205-219 40 220-234 39 What is the minimum time taken by the slow 20% women? What is the range of the time taken by middle 40% women?

Percentiles: As defined above quartiles and deciles are the partition points which divide distribution in 4 equal parts and ten equal parts. Percentiles are those 99 points which divide the distribution in 100 equal parts. Percentile 23 is denoted by P23 and indicates 23 % observations are below this value and 77 % of observations lie above this value.

The general formula for percentiles is Pk = l+


k F 100 xh f Where Pk : kth percentile K = 1, 2, .., 99

l: lower limit of class in which kth percentile lie (whose cumulative frequency is N: total number of observations

kN 100

fi
i =1

F: Cumulative frequency of the class preceding to the class interval in which the kth percentile lies. f: frequency of class where kth percentile lie h: class width of the class where the kth percentile lie. Example: The life distribution of the bulbs installed in a building is given below.
Life in hours No. bulbs 15 25 40 60 40 20 200 of

2000 2500 2500 3000 3000 3500 3500 4000 4000 4500 4500 5000

a) Find the maximum life of 15 % bulbs with least life span b) Find the minimum life of 22 % bulbs those have the largest life span

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

45

Solution: To find maximum life span of 15 % bulbs with least life span, we need to calculate P15 where 15 % observations are below it. P15 = l+
15 F 100 xh f Where P15 : 15th percentile

l: lower limit of class interval whose cumulative frequency is N: total number of observations

15 N 100

fi
i =1

F: Cumulative frequency of the class preceding to the class (whose cumulative frequency 15 N is ) in which the Kth percentile lies. 100 f: frequency of the percentile class h: class width of the percentile class. Example: The life distribution of the bulbs installed in a building is given below. Life in hours No. of Cumulative bulbs frequency 2000 2500 15 15 2500 3000 25 40 3000 3500 40 80 3500 4000 60 140 4000 4500 40 180 4500 5000 20 200 200 15 N 15 x 200 = 30 = 100 100 The class interval whose cumulative frequency is greater than or equal to 30 is 2500 3000. 30 15 P15 = 2500 + x 500 25 = 2500 + 15 x 20 P15 = 2800. a) Minimum life span of 22 % bulbs whose life span is the highest. We have to calculate here P78 to locate class interval in which P78 lie. We calculate 78.N 78 x 200 = 156 = 100 100 The class interval whose cumulative frequency is 156 is the class where P78 lie. That is 4000 4500.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

46

P78 = 4000 +

156 140 x 500 40 16 x 50 = 4000 + 4 = 4200.

We note here that there are some fractiles, which are overlapping such as Median = Deciles = Percentile 50 = Quartile 2 Quartile 1 = Percentile 25 Quartile 3 = Percentile 75 Decile 1 = Percentile 10 etc.

Exercise: The following data relate to the sales of 100 companies. Find the maximum sales that low earning 15% companies have. Sales Below 60 60-62 62-64 64-66 66-68 68-70 70-72 No. Of companies 12 18 25 30 10 3 2

3.8 Mode Mode is that value in the distribution, which occurs maximum number of times. The mode of the distribution is with the maximum frequency. The occurrence of a particular observation is repeated most of the times indicates occurrence of mode. Mode indicates the concentration of observations around some specific observations. There may be concentration of observation around one or more values; accordingly the distribution is called as unimodel or bimodel or multimodel. Case I: Raw data: Ungrouped data. For finding mode of the raw distribution, observe the data. The observation, which occurs the most number of times, is the mode.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

47

Another method is to group the data and locate mode as the observation with maximum frequency. Example: Locate the mode of the following data 3 4 5 7 8 3 16 18 10 8 9 8 9 11 Solution Writing it as frequency distribution Observation 3 4 5 6 7 Frequency 2 1 1 1 3 13 11 12 14 15 8 7 7 9 6

8 4

9 3

10 11 12 13 14 15 16 17 18 1 2 1 1 1 1 1 1

The maximum frequency is 4 for observation 8. Hence, the mode of the distribution is 8. We note here that if the frequencies for all observations have the same frequency, there is no mode and if more than two observations have the same maximum frequency then the distribution is multimodel.
Case II: Discrete data: If frequency of the observations are given, then the observation with the highest frequency is the mode of the distribution.

Example: Locate the mode of the distribution. Observation 12 13 14 15 16 17 Frequency 8 15 28 20 13 6 The observation with maximum frequency is 14 with frequency 28. Mode of the distribution is 14.
Case III: Continuous data

For grouped continuous data, mode is calculated using the following formula. Mode = MO = l+
f f1 xh 2 f f1 f 2 Where MO : mode of the distribution Modal class: is the class with maximum frequency l : lower limit of the modal class f : frequency of the modal class f1 : frequency of the class preceding the modal class f2 : frequency of the class succeeding the modal class h : Class width

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

48

Example: Frequency distribution of weekly wages of employees is given below. Find the mode of the distribution
Weekly (Rs.) 200 250 250 300 350 400 400 450 450 500 500 550 550 600 wages No. of workers 4 7 19 25 15 7 3 80

Solution: First we decide about the modal class. Modal class is with the maximum frequency. So 400 450 is a modal class. f f1 MO = l+ xh 2 f f1 f 2 25 19 x 50 = 400 + 50 19 15 6 = 400 + x 50 16 = 418.75 Illustration: Daily profits of 100 shops are given below Profits 0-100 100-200 200-300 300-400 400-500 500-600 Number of shops 12 18 27 20 17 6

Merits and Limitations of Mode

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

49

Merits 1. It is easy to understand and easy to calculate 2. It is not affected by extreme observations 3. If can be calculated even if the class interval are open ended 4. Unequal class intervals does not affect the mode Limitations: 1. It is not based on all observations 2. It is not capable of mathematical treatment 3. It is affected by the fluctuations of sampling Merits and Limitations of Median. Merits 1.It is based on all observations 2. It is useful in special cases in averaging rates. Limitations: 1. It is not as easy as A.M. 2. It gives large weights to the smallest observations hence it is not a good measure of central tendency. 3. it is very rarely used

3.9 Weighted Mean:

The averages we have discussed so far gives equal importance to all the observations. But there may be some situation where the observations have unequal importance. As per the importance of the observation there is some weight assigned to it. If all observations are not of equal importance then the simple means will not be proper measures for central tendency. Here the weighted means are used to give due importance to various observations. Weighted arithmetic mean Let X1, X2, , Xn be the observations with respective weights as W1, W2, , Wn then

Weighted arithmetic mean =

x w
i =1 n i

w
i =1

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

50

3.10 Combined means:

When more than one data set, are available and their averages are known then, if we consider all the observations in all data sets as one then we can get the overall average of all the observations. This average is called as combined average or average of averages.
Combined arithmetic mean

Let the first data set be X1, X2, , Xn of n observations with X as mean. Let the second data set be Y1, Y2, , Yn of m observations with Y then average or mean of these two data sets is Combined mean =

n X + mY m+n

In general if x1 , x 2 ,....., x n of n1, n2, , nn observations respectively, then

X =

n1 x1 + n2 x 2 + ....... + nn x n n1 + n2 + ....nn

3.11 Properties of averages

After knowing various measures of central tendency, we now state some of the properties of these measures. 1. If a constant number K is added or subtracted from every observation, then the arithmetic mean also gets added or subtracted by K. Arithmetic mean is not independent of change of origin and scale. 2. When all observations are equal than arithmetic mean, geometric mean and harmonic mean coincide i.e. AM = GM = HM otherwise AM > GM > HM

3.12 Choice Of An Average

When extreme values are present, median or mode and not the mean is used When certain characteristics like ability skill, efficiency of employees, ad campaign, measurement on endurance test and intelligence test etc are to be studied median is more appropriate.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

51

Measurements of readymade articles like size of a certain dress and certain components or parts in common use, wages, rates of pay rents, prices, sizes of supplies may require the use of mode . When a quick rough estimate is needed, median/mode can be rapidly obtained by inspection. When repeated samples are to be taken , mean is the most suitable because of its stability in respect of sampling variable. Most statistical operations need mean.

3.13 Comparison Of Mean Mode And Median

1. A.M. takes in to account the value of each observation and as such truly represents an average value. The median does not take in to account the actual value but only there positions. Thus remaining values can be changed to any values as long as the position of the median value is not affected. For locating mode a frequency table is necessary. 2. A.M. is more stable than the median and mode. ie. Its value does not change considerably even if we calculate it for other samples from the same population. Median and mode vary largely for other samples from the same population. 3. A.M. is not satisfactory in open end classes. When data contains extreme value, median or mode should be preferred to A.M. 4. Only A.M. is amenable to mathematical treatments, eg if we are given group means and their sizes we can find out the combined mean. Median and mode are not amenable to mathematical treatments.

3.14 Review exercise

1. The frequency distribution below represents the weights in pounds of a sample of packages carried last month by a small airfreight company. Class 10.0 - 10.9 11.0 11.9 12.0 12.9 13.0 13.9 14.0 14.9 Frequency 1 4 6 8 12 Class 15.0 - 15.9 16.0 16.9 17.0 - 17.9 18.0 18.9 19.0 19.9 Frequency 11 8 7 6 2

a) Compute the sample mean.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

52

b) Compute the sample mean using the coding method with 0 assigned to the fourth class. c) Repeat part (b) with 0 assigned to the sixth class. d) Explain why your answers in parts (b) and (c) are the same. 2. Davis Furniture Company has a revolving credit agreement with the First National Bank. The loan showed the following ending monthly balances last year: Jan. $121,300 Apr. Feb. $112,300 May Mar. $ 72,800 June $ 72,800 July $58,700 $ 72,800 Aug. $61,100 $ 57,300 Sept. $50,400 Oct. Nov. Dec. $ 52,800 $ 49,200 $ 46,100

The company is eligible for a reduced rate of interest if its average monthly balance is over $ 65,000. Does it qualify?. 3. Child - Care Community Nursery is eligible for a country social services grant as long as the average age of its children stays below 9. If these data represent the ages of all the children currently attending Child Care, do they qualify for the grant 8 5 9 10 9 12 7 12 13 7 8

4. Child Care Community Nursery can continue to be supported by the country social services office as long as the average annual income of the families whose children attend the nursery is below $12,500. The family incomes of the attending children are $14,500 $15,600 $12,500 $ 6,500 $ 5,900 $10,200 $ 8,600 $ 8,800 $ 7,800 $13,900 $ 14,300

a) Does Child Care qualify now for county support? b) If the answer to part (a) is no, by how much must the average family income fall for it to qualify? c) If the answer to part (a) is yes, by how much can average family income rise and Child Care still stay eligible? 5. These data represent the ages of patients admitted to a small hospital on February 28,1996: 85 88 89 87 75 80 83 83 66 56 65 52 43 56 53 44 40 67 75 48

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

53

a) b) c) d)

Construct a frequency distribution with classes 40-49, 50-59etc. Compute the sample mean from the frequency distribution Compute the sample mean from the raw data Compare parts (b) and (c) and comment on your answer

6. The frequency distribution below represents the time in seconds needed to serve a sample of customers by cashiers at BullsEye Discount Store in December 1996. Time (in seconds) 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 100-109 110-119 120-129 Frequency 6 16 21 29 25 22 11 7 4 0 2

a) Compute the sample mean Compute the sample mean using the coding method with O assigned to the 70 79 class. 7. The owner of Pets R Us is interested in building a new store. The owner will build if the average number of animals sold during the first 6 months of 1995 is at least 300 and the overall monthly average for the year is at least 285. The data for 1995 are as follows:

Jan 234

Feb. 216

Mar. 195

Apr. 400

May 315

June 274

July 302

Aug. 291

Sept 275

Oct. 300

Nov. 375

Dec. 450

What is the owners decision and why? 8. A cosmetics manufacturer recently purchased a machine to fill 3 ounce cologne bottles. To test the accuracy of the machines volume settling, 18 trail bottles were run. The resulting volumes (in ounces) for the trials were as follows: 3.02 3.01 2.89 2.97 2.92 2.95 2.84 2.90 2.90 2.94 2.97 2.96 2.95 2.99 2.94 2.99 2.93 2.97

The company does not normally recalibrate the filling machine for this cologne if the average volume is within 0.4 of 3.00 ounces. Should it recalibrate?.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

54

9. The production manager of Hinton Press is determining the average time needed to photograph one printing plate. Using a stopwatch and observing the plate makers, he collects the following times (in seconds) 20.4 22.0 20.0 24.7 22.2 25.7 23.8 24.9 21.3 22.7 25.1 24.4 21.2 24.3 22.9 33.6 28.2 23.2 24.3 21.0

An average per plate time of less than 23.0 seconds indicates satisfactory productivity. Should the production manager be concerned? 10. National Tire Company holds reserve funds in short term marketable securities. The ending daily balance (in millions) of the marketable securities account for 2 weeks is shown below: Week 1 Week 2 $1.973 1.969 $ 1.970 1.82 $ 1.972 1.893 $ 1.975 1.887 $ 1.976 1.895

What was the average (mean) amount invested in marketable securities during a) b) c) d) The first week The second week The 2 week period An average balance over the 2 weeks of more than $ 1.970 million would qualify National for higher interest rates. Does it qualify e) If the answer to part (c) is less than $1.970 million, by how much would the last days invested amount have to rise to qualify the company for the higher interest rates? f) If the answer to part (c) is more than $ 1.970 million, how much could the company treasurer withdraw from reserve funds on the last day and still qualify for the higher interest rates?.

11. M. T. Smith travels the eastern United States as a Sales representative for a textbook publisher. She is paid on a commission basis related to volume. Her quarterly earnings over the last 3 years are given below. Year 1 Year 2 Year 3 1st Quarter $ 10,000 20,000 30,000 2nd Quarter $ 5,000 10,000 15,000 3rd Quarter $25,000 20,000 45,000 4th Quarter $ 15,000 10,000 50,000

a) Calculate separately M. T.s average earnings in each of the four quarters. b) Calculate separately M. T.s average quarterly earnings in each of the 3 years. c) Show that the mean of the four numbers you found in part (a) is equal to the mean of the three numbers you found in part (b). Further more, show that

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

55

both these numbers equal the mean of all 12 numbers in the data table. (This is M.T.s average quarterly income over 3 years.) 12. Lilian Tyson has been the chairperson of the county library committee for 10 years. She contends that during her tenure she has managed the book mobile repair budget better than her predecessor did. Here are data for bookmobile repair for 15 years. Year 1992 1991 1990 1989 1988 Town Budget $30,000 28,000 25,000 27,000 26,000 Year 1987 1986 1985 1984 1983 Town Budget $24,000 19,000 21,000 22,000 24,000 Year 1982 1981 1980 1979 1978 Town Budget $ 30,000 20,000 15,000 10,000 9,000

a) Calculate the average annual budget for the last 5 years. (1988 1992) b) Calculate the average annual budget for her first 5 years in office (1983 1987) c) Calculate the average annual budget for the 5 years before she was elected (1978 1982). d) Based on the answers you found for parts (a), (b) and (c) do you think that there has been a decreasing or increasing trend in the annual budget?. Has she been saving the county money?.

*********

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

56

Chapter-4 4. MEASURES OF VARIABILITY DISPERSE


Objectives
At the end of this unit, a student should be able to Know what is dispersion Know the measures of dispersion Know coefficients of measures of dispersion Compute the measure of dispersion Know the Skewness Know kurtosis Use the measures of variability Apply in business and economics Appreciate the measures of variability.
Structure

4.1 Introduction 4.2 Measures of variability 4.3 Standard deviation of combined series 4.4 Skewness 4.5 Kurtosis 4.6 Key words 4.7 Review exercise
4.1 Introduction

Measure of central tendency gives a single value that can be representative of the data set. It is alone not sufficient to describe a distribution of data. It may happen that the averages of two distributions are same but their scatter or dispersion may be totally different. For example, the average, of marks of two classes is same, but more students are around the average value in one class where as other class has more students away on both sides of mean. These two may have the same average but their variation is totally different. Here we discuss the measures of dispersion or variability. Measures of central tendency do not reveal their differences from the values and their averages. Variability is an important consideration in statistics. It is the measure of variability or scattered ness of the data. A measure of dispersion tells the degree of lack of uniformity and consistency of the data set. The more homogeneous data the lesser is the variability.
4.2 Measures of variability

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

57

The measures of dispersion are: a) Range b) Quartile deviation c) Average / mean deviation d) Standard deviation 1. Range: Range is the difference between the maximum observation and minimum observation. Range = max value min. value The range has a original units assigned to it. For example, if the data related to weights of 5 persons the range will have unit attached to it as kilograms or pounds etc. When we have to compare two or more series, one series in kilogram and another in say meters cannot be compared. To make comparison we define relative measure. Relative measures are in the form of a ratio or percentage and in independent of the original units. Relative range/ Coefficient of range Illustration: The range of the data 6, 9, 12, 15, Range = 30 6 = 24 = max. observation - min. observation max. observation + min. observation

30 is

Relative Range =

30 6 24 2 = = = 0.66 30 + 6 36 3

Range is simple to calculate, and rigidly defined. but is depends upon only two extreme values. It does not consider the individual values in data.
Exercise:

Compare the ranges of annual payments received of two organizations as below Organization A: Organization B 683 986 523 765 594 526 451 721 426 798 789 458 910 693 965 812 1025 465 698 345 843 489 1504 791 489 897 1325 1250 652 316 1005 1235

1256 758 735 469

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

58

2. Standard deviation: It is the most widely used measure of dispersion. It overcomes all limitations of the other measures. It is defined as Standard deviation =
1 n xi x n i =1

For grouped distribution Standard deviation = 1

fi
i =1

fi(x
n i =1

The square of standard deviation is called as variance and is denoted by 2 .

2=

1 n xi x n i =1

It can be simplified as

=
2

x
n

2
i

coefficient of variation CV =

SD x100 X

Coefficient of variation is used for comparison. Coefficient of variation is a measure of consistency or homogeneity. The lesser the value of coefficient of variations higher is the consistency or homogeneity. The larger the value of coefficient of variation the less is consistency of data. Illustration: The runs scored by two players A and B in last 6 one day matches are as follows. Who has performed consistently? A: B: 82 50 85 55 39 49 10 62 25 69

Here we compare the coefficient of variation for both.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

59

X =

Y=

=
C.V . =

=
C.V . =

As the coefficient of variation for ____ is less than for ___ is more consistent than Exercise The mileage covered by 6 tyres of 2 brands of tyres is given blow. A owner of a fleet of truck wants to buy a better tyre, what do you suggest him? Why? Tyre brand A 540 630 365 400 600 900 B 600 630 720 610 560 815

4.3 Standard deviation of combined series

Let X 1 , X 2 ,....... X n be the arithmetic means of the data sets with sizes n1 , n2 ,.......nn respectively. The standard deviations are given by 1 , 2 ,....... n respectively. Then their combined mean and standard deviation is given by

X =

n1 X 1 + n2 X 2 + ......... + nn X n n1 + n2 + ......... + nn
n1 d1 + 1 + n 2 d 2 + 2 + ......... + nn d n + n n1 + n2 + ......... + nn
2 2 2 2 2

2=

where di = Xi X when there is same mean of all data sets the dis are zero and we get

n1 1 + n 2 2 + ......... + n n n n1 + n2 + ......... + n n
2 2

Illustration The average of 20 girls in a class is 42 with standard deviation of 3 . the average of 30 boys in the class is 40 with standard deviation of 5. Find the mean of the class and its standard deviation.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

60

Given n1 = 20

X 1 = 42

n2 = 30 X =

X 2 = 40

1 = 3 2 = 5

n1 X 1 + n2 X 2 20 x 42 + 30 x 40 = n1 + n2 20 + 30 840 + 1200 = 50 2040 = 50

=
2

n1 d1 + 1 + n2 d 2 + 2 n1 + n2
2 2 2

Exercise: For the information below find the mean and standard deviation of combined series. n Mean Standard deviation
Series A 30 30 10 Series B 45 20 5

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

61

4.4 Skewness Skewness refers to the symmetry of the distribution. The symmetric distribution the mean mode and median coincides. The values are one and the same. The distribution graphs as

Mean = mode = median Lack of symmetry is skewness. The asymmetry induces the difference in values of mean, mode and median. The distribution becomes tilted towards either of the ends. If mean > median > mode then the distribution is positively skewed. If mean < median < mode, the distribution is negatively skewed.

Mode Median

Mean

Mean

Mode Median

In case of symmetrical distributions the quartiles are equidistant from mean. In skewed distribution they are not. The coefficient of skew ness, is defined by Karl Pearson as Sk =
Mean mod e S .D.

The value usually lies between 1 to + 1 If mode is ill defined, then we use the relation Mode = 3 Median 2 Mean To find skew ness

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

62

Sk =

3(Mean median ) S .D.

Coefficient of skew ness based on quartiles is defined by Bowley as follows Bowleys coefficient of skew ness Sk =
Q3 + Q1 2 Md Q3 Q1

It is also known as quartile coefficient of skewness. When the distribution is symmetrical the coefficient of skewness is 0. It may take value +ve or ve depending upon where the values are concentrated at.

Illustration: Find Karl Pearsons coefficient of skewness


CI 5 15 15 25 25 35 35 45 45 55 55 65 Frequency 3 9 12 18 10 8

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

63

4.5 Kurtosis

Kurtosis: Kurtosis is the property of distribution related to the flatness or peaked ness of the distribution. Kurtosis gives us the idea of convexity of curve. If the curve has a moderate peak it is called normal curve or mesokurtic curve. If the curve has a flat curve than normal curve it is platykurtic curve. If the curve has a steep peak than normal curve, it is Leptokurtic curve. It indicates that more frequencies are concentrated at the mode portion. High concentration of values near mode, show sharper peak.

4.6 Review exercise

1.The following data give the income distribution of workers in two factories. Which distribution shows more variability? Income in1000Rs. 10-12 12-14 14-16 16-18 18-20 20-22 22-24 Factory 1 10 15 65 73 70 17 10 Factory 2 25 34 40 50 30 30 10 2.The following data are available for two groups of workers. Group I Group II Number of workers 400 500 Average daily wages 50 41 Standard deviation 5 The standard deviation of the 900 workers combined together is Rs.37. Find the standard deviation of the second group. 3. Lives of the two models of refrigerator in a recent survey are Life # of years 0-2 2-4 4-6 6-8 8-10 Model A 5 16 13 7 5 Model B 2 7 12 19 9 Which model has a greater uniformity? 10-12 4 1

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

64

Chapter-5 5.Correlation Analysis


Objectives: At the end of this unit, a student should be able to Understand correlation Know various types of relationships Draw the scatter diagram Know coefficient of determination Understand the use of correlation in business Appreciate the use in decision making

Structure 5.1 Introduction 5.2 Types of relations 5.3 Scatter diagram 5.4 Karl Pearsons coefficient of correlation 5.5 Coefficient of correlation zero 5.6 Spearmans rank correlation coefficient 5.7 Coefficient of determination 5.7 Key words 5.8 Review exercise

5.1 Introduction

We are familiar with functional relationships as area of a circle to that of its radius, between Fahrenheit and Celsius. These are fixed relationships shown by some equations. Many times, the different sets of variables change with the change in any of the variables. In general we know that the height and the weight is associated but there is no perfect formula that is describing the relationships. Similarly many such situations can be thought of where the relationship or association is found but they are not fixed relations. As the set of observation changes, so is the representation of relationships. The correlation and regression defines such relationships. The correlation is the degree or extent of relationship and regression is about the estimating the equation of the relationships. 5.2Types of Relationship Two variables are associated, may share either a direct relationship or inverse relationships. The increase in the value of one variable brings about the increase in the second variable; this is direct relationship (decrease in one variable will induce a
This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

65

reduction in other variable). The increase in the value of one variable decreases the value of the second variable; this is inverse relationship (decrease in one variable will increase the value of other variable). The relationships of the variables may be linear or non-linear. The linear relationships can be expressed as an equation of degree 1. The non-linear type of relationships will be expressed as a polynomial with degree more than or equal to 2. Due to these two types of classification we get different combinations as 1. 2. 3. 4. Direct linear Inverse linear Direct non-linear Inverse non-linear

Here we are interested only in linear relationships.


5.3 Scatter Diagram: If we plot the values of two variables say X and Y along X axis and Y axis respectively we will get the plot that is called as scatter diagram. The scatter diagram will give an idea about type of relationships these variables have. The scatter diagram can also be used to get the relationship equation. The pair of (x, y) is denoted as a point on the graph. All the points on this graph will evolve a pattern. Observing this pattern one can identify the type of relationship.

Direct relationship
16 14 12 10 8 6 4 2 0 0 10 20
18 16 14 12 10 8 6 4 2 0 0

Inverse relationship

10

20

Direct non-linear

Inverse non-linear

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

66

35 30 25 20 15 10 5 0 0 10 20

35 30 25 20 15 10 5 0 0 10 20

No Correlation
70 60 50 40 30 20 10 0 0 10 20

Exercise

Plot the scatter diagram and state the type of relationship of the variables 1. X 12 15 18 21 24 27 30 Y 18 20 21 23 26 27 29

2. X 4

10 12 15 67

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

Y 15 14 13 12 11 10 9

5.3 Karl Pearsons coefficient of correlation

After knowing type of relationship between the two variables, we would like to know the degree or extent of relationship. The degree or extent of relationship or association is given by coefficient of correlation. The coefficient of correlation we denote by r. r= cov( x. y ) It is ratio of the covariance of (x, y) and standard deviation of X x.y and standard deviation of Y. this can be further simplified as r= cov( x. y ) x.y
1 n X i X Yi Y n i =1

)(

) )
2

n 2 1 1 n Xi X . Xi X n i =1 n i =1

on simplification we get, r=

( X

X Y
2 i

i i 2

nXY

nX

)( Y

nY

Coefficient of correlation is also as product moment coefficient of correlation. The coefficient of correlation r is an absolute number independent of unit or measure. If the data we have is in kg and cm, there will not be any unit attached to it. The value of coefficient of correlation lies between 1 and + 1.
This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

68

The negative value of coefficient of correlation indicates the relationship is inverse. The positive value of coefficient of correlation indicates the relationship is direct relationship. r = + 1 indicates that the relationship is perfect. If it is r = 1 it is positive perfect and if r = - 1 it is negative perfect. If two variables are uncorrelated or un-associated then value of coefficient of correlation will be zero. But if the value of r = 0, we should not conclude that the variables are uncorrected. The coefficient of correlation measures the linear relationships. There may be some non liner relation between the two variables.
Illustration: The marks obtained by 11 students in Mathematics and Statistics are given below. Compute the coefficient of correlation for this data and comment.

No. of student Mathematics: Statistics:

1 45 56

2 55 50

3 56 48

4 58 60

5 60 62

6 65 64

7 68 65

8 70 70

9 75 74

10 80 82

11 85 90

Coefficient of correlation is given by r=

( X

X Y
2 i

i i 2

nXY

nX

)( Y

nY

)
3136 2500 2304 3600 3844 4096 4225 4900 5476 6724 8100 48905

To find this we tabulate the terms 45 56 2520 2025 55 50 2750 3025 56 48 2688 3136 58 60 3480 3364 60 62 3720 3600 65 64 4160 4225 68 65 4420 4624 70 70 4900 4900 75 74 5550 5625 80 82 6560 6400 85 90 7650 7225 717 721 48398 48149

X =

71.7

Y = 72.1

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

69

substituting in r we get r =

( X

X Y
2 i

i i 2

nXY

nX

)( Y

nY

r =0.881813 The value of coefficient of correlation is positive; we can conclude that the relationship is direct.

Exercise
The height and weights of student in a class are given below. Compute the coefficient of correlation and comment. Heights in 150 cm 70 Weight in kg 152 61 155 50 160 48 162 62 158 67 154 69 159 64

5.4 Coefficient of Correlation Zero

We have mentioned in the previous section that if the value of the coefficient of correlation is zero, we should not conclude that the variables are uncorrelated. We illustrate this with following examples. Illustration: Compute value of coefficient of correlation and comment
X: Y:

-2 4

-1 1

0 0

1 1

2 4

We compute the coefficient of correlation using the formulae X iYi n X Y r= 2 2 2 2 X i n X Yi nY

)(

X -2 -1 0 1 2 TOTAL 0

Y 4 1 0 1 4 10

XY -8 -1 0 1 8 0

X2 4 1 0 1 4 10

Y2 16 1 0 1 16 34

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

70

r=

0 5x0x 2 (10 5 x 0 2 )(34 5 x 4)

=0

We note here that though value of r is zero but there is relation between values of X and Y. The values of Y are square of X. Hence, if the variables are uncorrelated then we get value of coefficient of correlation is zero. But converse is not true. That means even if the value of coefficient of correlation is zero, the variables may have some relationship which may be a non-linear.

5.5 Spearmans Rank Correlation Coefficient

Various items are assigned ranks according two characteristics. Rank correlation finds the correlation between these ranks. Ranks are given when the measurements are qualitative for examples the tea tasters give ranks the 5 brands of tea, 3 experts judge the performance of 10 participants in a music competition. The coefficient of correlation obtained on the basic of ranks is called as rank correlation coefficients. It is due to Spearmans correlation coefficient is given by r=
6 d i
2

n(n 2 1)

where di = Xi Yi

Ranking is done as follows: Assign rank 1 to the highest observation Assign rank 2 to the next highest observation If there are more than one observations having same value, both should be given the average of ranks, other wise would have given to them. e.g. If we have to rank the following observations 20, 21, 21, 24, 25, 25, 26 etc. There are 7 observations. The largest is 26 rank it 1. Then we have 2 observations 25, 25. If they would not have been same, we would have given them ranks 2 and 3. Average of ranks is 2.5. 25 gets ranks 2.5 each. Next 24 we rank as 4. 21 gets rank of 5.5 in the same way and finally 20 gets rank 7. Observation 26 1 Rank 25 2.5 25 2.5 24 4 21 5.5 21 5.5 20 7

Limits of rank correlation are also same as the Karl Pearsons coefficient i.e. 1 to +1. 71

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

Illustration: Ten completions in a beauty contest are ranked by three judges in the following order. Competitors Rank by first judge Second judge Third judge
1 1 3 6 2 6 5 4 3 5 8 9 4 10 4 8 5 3 7 1 6 2 10 2 7 4 2 3 8 9 1 10 9 7 6 5 10 8 9 7

Use rank correlation coefficient to discuss which pair of judges have the nearest approach to common tastes in beauty. Here we have 3 judges, we can find out the correlation between only 2 judges. So we have judge 1 and judge 2, judge 2 and judge 3 and judge 1 and judge 3 for comparisons.
Competitor

Judges First X 1 6 5 10 3 2 4 9 7 8
TOTAL

d2xy Second Y 3 5 8 4 7 10 2 1 6 9 Third Z 6 4 9 8 1 2 3 10 5 7

d2yz

d2xz

4 1 9 36 16 64 4 64 1 1 200

9 1 1 16 36 64 1 81 1 1 214

25 4 16 4 4 0 1 1 4 1 60

rxy = 1

6 d 2 xy n(n 1)
2

= 1

6 x 200 = - 0.212 10 x 99 6 x 214 = - 0.297 10 x 99 6 x 60 = 0.636 10 x 99

ryz = 1

6 d 2 yz n(n 1)
2

= 1

rxz = 1

6 d 2 xz
n(n 1)
2

= 1

The coefficient of correlation between X and Y i.e. judge one and judge two is negative. Their approach to beauty is not similar in the same way the judge 2 and judge 3 also do not share the similar tastes in beauty.
This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

72

The pair of judges 1 and 3 share similar approach to beauty as the coefficient of correlation of ranks given by them is positive.
Rank Correlation in Case of Ties

When two or more share the same rank the formula used in the previous section cannot be used. We modify it as
1 3 6 di 2 + m13 m1 + m2 m2 + ...... 12 r = 1 2 n(n 1) Where mi: numbers of items/ Persons having the same rank. The judges give same ranks to more than one items or persons then ties occur.
Illustration:

{(

) (

Calculate Spearsons rank correlation between the two series. Series A: Series B: 57 13 59 117 62 126 63 126 64 130 65 129 55 111 58 116 57 112

First we rank these series


di2 0.25 0 0.25 0.25 1 1 0.0 0.0 0.25 3.0

Series A 57 59 62 63 64 65 55 58 57 TOTAL

Rank 7.5 5 4 3 2 1 9 6 7.5

Series B 113 117 126 126 130 129 111 116 112

Rank 7 5 3.5 3.5 1 2 9 6 8

di = Xi Yi 0.5 0 0.5 0.5 1 1 0 0 0.5

In series A we have one tie; two observations have same ranks i.e. 7.5. m1 = 2 In series B, we have another tie, m2 = 2

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

73

1 3 6 di 2 + m13 m1 + m2 m2 + ...... 12 = 1 2 n(n 1) 1 63 + {(8 2) + (8 2 )} 12 = 1 9(81 1) = 1 6[3 + 1] 9(80)

{(

) (

r = 0.967
5.6 Coefficient of determination Square of coefficient of correlation is called as the coefficient of determination r2. r2 measures only the strength of a linear relationship between two variables. r2 lies between 0 and 1.Coefficient of determination is the amount of the variation that is explained by the regression line. It is a measure of proportion of variation in Y , the dependent variable, that is explained by the regression line, that is , by Ys relationship with the independent variable. 5.7Key words Correlation analysis: A technique to determine the degree to which the variables are linearly correlated. Coefficient of correlation: The correlation is the degree or extent of relationship. Coefficient of determination: Square of coefficient of correlation is called as the coefficient of determination r2. Scatter diagram: A graph of points on rectangular grid showing the spread of the data to predict the type of relationship

5.8 Review Exercise:

1. A computer while calculating correlation coefficient between two variables X and Y from 25 pairs of observations obtained the following N = 25

X = 125 X Y = 100 Y

2 2

= 650
= 460

XY = 508

Find the correlation coefficient of X and Y. This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale. 74

2. A furniture retailer in a locality is interested in studying whether some relationship exists between the number of building permits issued in that locality in the past years and the volume of sales in those years. He has accordingly collected the data for the sales (y) and the number of building permits issued (X) in the past 10 years. The results are as follows X=200 Y= 2200 XY= 45800 X2= 4600 and Y2 =-490400. Find the coefficient of correlation. 3. Many small companies buy advertising without considering its effect. Hamburger Wars (substantial price rivalry with special value meals) have cut the profits of Ethiopian Burgers of Santa Cruz, California, a small regional chain. The marketing manager is trying to make the case that you have spent money to make money. Spending on billboard advertisements, in the mangers opinion, has a direct result on sales. There are records of 7 months: Monthly expenditure 25 16 42 34 10 21 19 (1000$) Monthly sales revenue 34 14 48 32 26 29 20 ($ 100000) Do you support the managers claim? Why? 4. Zippy Cola is studying the effect of its latest advertising campaign. People chosen at random were called and asked how many cans of Zippy Cola they had bought in the past week and how many Zippy Cola advertisements they had either read or seen in the past week. Calculate the sample coefficient of determination and the sample coefficient of correlation. X (number of ads) 3 7 4 2 0 4 1 2 Y (cans purchased) 11 18 9 4 7 6 3 8 5. The following table gives the aptitude test scores and the productivity indices of 10 workers selected at random. Calculate Karl Pearson coefficient of correlation Aptitude test scores 60 62 65 70 72 48 53 73 65 82 Productivity Index 68 60 62 80 85 40 62 62 60 81 5. Calculate the Spearmans coefficient of correlation Ranks by 2 judges Contestent no. 1 2 3 4 5 Rank by A 8 5 3 1 6 Rank by B 6 5 4 2 3 And Comment.

6 4 1

7 2 8

8 7 7

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

75

Chapter 6 6. REGRESSION
Objectives: At the end of this unit, a student should be able to Understand the regression Estimate the regression equation Understand the relation of correlation and regression Use regression analysis Understand the application of regression analysis in business Appreciate the regression analysis. Structure 6.1 Introduction 6.2 Equation of a straight line 6.3Two Lines of Regression 6.4Properties of Regression Lines 6.5 Key words 6.1 Introduction

The coefficient of correlation gives the magnitude of the association of two variables. The next is to obtain the expression of relationship of the variables. We derive the equation that defines the relationship, which is linear, as we have defined linear correlation in the previous sections. The functional relation between the variables is called as regression equation. The meaning of regression is a tendency of returning to the mean. For example, in the correlation of heights of fathers and sons, a tendency of human race to return to or regress to the average height is observed.

6.2 Equation of a straight line

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

76

The equation of the straight line is Y=a+bX Where a and b are constants, a is the y intercept i.e. the point where the line y = a + bx cuts the y axis, b is the slope of the line. It gives the rate of change of y with respect to X. We can find the values of a and b using the following normal equations. From Y = a + bX Taking sum of both sides, we get Y = na + bX Multiply equation XY = aX + bX2 Solving equations b= and we get as a is constant a = na. by X and take sum of both sides we get

XY n X Y X nX
2 2

a = Y bX

After obtaining the values of a & b we get an estimating equation.


y = a + bx where y is estimated value of Y when value of X is given.

Illustration: Obtain the regression equation for the following data. X: Y:

10 6

9 3

7 2

8 4

11 5

We find out values of a and b using the above data

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

77

X 10 9 7 8 11 TOTAL 45

Y 6 3 2 4 5 20

XY 60 27 14 32 55 188

X2 100 81 49 64 121 415

45 =9 5 20 Y= =4 5
X = b=

XY n X Y X nX
2 2

= 0.8

a = Y b X = 4 9 x 0.8 = - 3.2 y = 3.2 + 0.8 X is an estimating equation

Exercise:

Obtain an estimating equation for the data given below:


X: Y:

5 8

3 6

7 8

4 5

8 9

2 6

10 8

6 5

8 11

7 7

9 8

11 10

6.3 Two Lines of Regression

For a bivariate data (Xi, Yi), the relationship may be Y depends on X or X depends on Y. If Y depends on X then the regression line is Y on X. Y is dependent variable and X is independent variable. If X depends on Y, then regression line is X on Y and X is dependent variable and Y is independent variable. The regression equation Y on X is Y = a + bx, is used to estimate value of Y when X is known. The regression equation X on Y is X = c + dy is used to estimate value of X when Y is given and a, b, c and d are constant. Y = a + bx can also be interpreted as a is the average value of Y when X is zero. X = c + dy, value c is the average value of X, when Y is zero.
This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

78

The slopes of the equation Y on X and X on Y are denoted as byx and bxy respectively. The values of byx and bxy are byx = cov( X .Y ) var .x bxy = cov( X .Y ) var . y

Simplifying we get, byx =

XY n X Y X nX
2 2

bxy =

XY n X Y Y nY
2 2

byx and bxy are the coefficient of regression. After we obtain values of byx and bxy we obtain the regression equations by substituting in the following equation. Y on X and X on Y

(Y Y ) = b (X X ) (X X ) = b (Y Y )
yx xy

The value of b in the previous section is same as byx.


Illustration:

The table below gives the stopping distance of an automobile at speed mils per hour at the distant danger is sighted.
Speed V (miles per hour) Stopping distance d(ft)

20 54

30 90

40 138

50 206

60 292

70 396

Estimate distance when speed is 45 miles per hour. Estimate the speed when distance traveled before stopping the automobile is 100 feet. We have to obtain the estimating equations. We calculate byx and bxy.
Speed X Distance Y

20 54

30 90

40 138

50 206

60 292

70 396

XY 1080

X2 400

Y2 2916

20

54

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

79

30 40 50 60 70
270

90 130 206 292 396


1168

2700 5200 10300 17520 27720 64520

900 1600 2500 3600 4900 13900

8100 16900 42436 85264 156816 312432

X = 45

Y = 194.6667

byx = bxy =

XY n X Y X nX XY n X Y Y nY
2 2 2 2

= 6.834286 = 0.140604

Substituting in the regression equations Y on X and X on Y

(Y Y ) = b (X X ) (X X ) = b (Y Y )
yx xy

we get, (Y-194.6667) = 6.834286(X-45) Simplifying Y=6.834286X+112.876 And (X-45)=0.140604(Y-194.6667) X=0.140604Y+17.629 Observe that the value of byx and bxy have the same sign.
Exercise:

For the data below, construct a scatter diagram. Find the least square regression lines Y on X and X on Y.
Grade on first quiz X Grade on second quiz Y

6 8

5 7

8 7

8 10

7 5

6 8

10 10

4 6

9 8

7 6

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

80

11.4 Properties of Regression Lines The regression equations Y on X and X on Y has following properties a) The lines of regression meet in a point whose co-ordinates are X , Y . The averages of both X and Y will lie on both the lines of regression. b) The regression coefficients byx, bxy and correlation coefficient r will have the same sign. The relationship will remain the same in any of the coefficients. c) There is an angle formed between the two lines of regression. Let the angle be denoted by . The correlation is perfect then the angle is 0. The lines exactly coincide as the correlation becomes weaker and weaker the increases. d) The correlation coefficient r is geometric mean of the regression coefficients. The sign + or given to r, that exists for byx and bxy.

r=

b yx . b xy

e) byx =

y x

and

bxy =

x y

Illustration: The two lines of regression are

5x + 6y = 160 and 2x + 4y = 80 Find 1. Find mean values of X and Y 2. Find regression coefficients 3. Find correlation coefficients 4. Find variance of Y if standard deviation of X is 1. We have 5x + 6y = 160 and 2x + 4y = 80
This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

60

81

First we solve these equations simultaneously. To eliminate X 5x + 6y = 160 2x + 4y = 80

multiply by 2 multiply by 5

10x + 12y = 320 - 10x + 20y = 400 - 8y = - 80 y = 10 Substituting in any equation we get X = 20 X = 20 and Y = 10 2. The regression equations are known. But we dont know which is Y on X and X on Y. we assume that and 5x + 6y = 160 be Y on X 2x + 4y = 80 be X on Y

so we rearrange them to find regression coefficients in Y = a + bx and X = c + dy 6y = - 5x + 160 160 5 y= x+ 6 6 byx =


5 6

2x = - 4y + 80 x = - 2y + 40

and

bxy = - 2
b yx . b xy
5 x2 6

Correlation Coefficient r =
=

>1 Which is wrong. As, 1 r 1. Our assumption is wrong. We revert our assumption. Now let 5x + 6y = 160 be X on Y and 2x + 4y = 80 be Y on X 5x = - 6y + 160 x= then

4y = - 2x + 80 y=1 x + 40 2

6 160 y+ 5 5

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

82

byx = -

1 2

and

bxy =

6 5
b yx . b xy
6 1 x 5 2

Correlation Coefficient r = =

= - 0 .774597 Substituting in equation byx = r .

y , squaring both sides we get x y 2 x 2

b 2 yx = r 2 . 1 3 y 2 = . 4 5 60

y 2 = 25
Exercise:

In a partially destroyed laboratory record of analysis of correlation data, the following results only are legible: Variance of X = 9 Regression equations are 8x 10y + 66 = 0; 40x 18y = 214 What were (a) mean values of x and y (b) standard deviation of y (c) the coefficient of correlation between x and y.
6.5 Key words Regression: A general process of predicting one variable from another by statistical means using previous data Regression line: A line fitted to set of data points to estimate the relationship between the variables.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

83

Dependent variable: The variable we are trying to predict Independent variable: The known variable in regression analysis.

**********

Bibliography:

A.D.Aczel and J. Sounderpandian, Complete Business Statistics, 2002, Tata McGraw Hill , New Delhi, India Anderson et al, Statistics for business and economics, eighth edition, 2002, Thomson Asia Pvt. Ltd. Singapore Frank and Althoen, Statistics concept and applications, 1994, Cambridge university press, Cambridge R. Levin and D. Rubin, Statistics for management, seventh edition, 1997,Prntice Hall of India, New Delhi.

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

84

Statistical methods by S.P. Gupta(11th Edition); published by: Sultan Chand & Sons; Chapter No. 15; Page No. E15.1 W.J.Stevenson, Business Statistics concept and applications, 1978, Harper and Row publishers, New York, USA. http://www.tpub.com/search.cgi?q=extrapolation&ps=10&o=0&m=all&wm=wrd &ul=&wf=222210 http://www.tpub.com/content/aerographer/14010/css/14010_215.htm

This study material is prepared by Dr. Shweta Dixit. It is strictly for private circulation not for sale.

85

Potrebbero piacerti anche