Sei sulla pagina 1di 40

CFA Quantitative Analysis E-Book 4 of 8

Quantitative Analysis E-Book


Part 4 of 8

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Sampling and Estimation

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

1. Introduction.

CFA Quantitative Analysis E-Book 4 of 8

In investment analysis, it is often impossible to study every member of the population. Even if analysts could examine the entire population, it may not be economically efficient to do so. Sampling is the process of obtaining a sample. A simple random sample is a sample obtained in such a way that each element of the population has an equal probability of being selected. The selection of any one element has no impact on the chance of selecting another element. A sample is random if the method for obtaining the sample meets the criterion of randomness (each element having an equal chance at each draw). The word 'simple' tells you that the process is not difficult, and the word 'random' tells you that you don't know in advance which observations will be selected in the sample. The actual composition of the sample itself does not determine whether or not it's a random sample. Example Suppose that a company has 30 directors, and you wish to choose 10 of them to serve on a committee. You could place the names of the 30 directors on separate pieces of paper, and draw them out one by one, until you have drawn a sample of size 10. Note: that the conditions for simple random sampling have been satisfied in that every one of the 30 directors has an equal (non-zero) chance of being selected in the sample.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

In this example, it makes no sense to sample with replacement, as this would mean that once you have drawn a name, that name goes back into the hat (i.e. it is replaced), and can be drawn again. If the same persons name is drawn more than once, then you won't end up with a sample of size 10 if you draw 10 names, so this experiment should be done without replacement. A biased sample is one in which the method used to create the sample results in samples that are systematically different from the population. For instance, consider a research project on attitudes toward cricket. Collecting the data by publishing a questionnaire in a magazine and asking people to fill it out and send it in would produce a biased sample. People interested enough to spend their time and energy filling out and sending in the questionnaire are likely to have different attitudes toward cricket than those not taking the time to fill out the questionnaire. It is important to realize that it is the method used to create the sample not the actual make up of the sample itself that defines the bias. A random sample that is very different from the population is not biased: it is by definition not systematically different from the population. It is randomly different.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

SAMPLING ERROR The sample taken from a population is used to infer conclusions about the population. However, it's unlikely that the sample statistic would be identical to the population parameter. Suppose there is a class of 100 students, and a sample of size 10 from that class is chosen. If by chance most of the brightest students in this sample are selected, then there is a misguided idea of what the population looks like, because the sample mean x-bar will be much higher than the population mean in this case. Equally, a sample comprising mainly weaker students could be chosen, and then the opposite would have applied. he ideal is to have a sample, which comprises a few bright students, a few weaker students, and mainly average students, as this will give a good idea of the composition of population. However, because which items go into the sample cannot be controlled, you are dependent to some degree on chance as to whether the results are favorable or not.

Sampling error (also called error of estimation) is the difference between the observed value of a statistic and the quantity it is intended to estimate. For example, sampling error of the mean equals sample mean minus population mean.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Sampling error can apply to statistics such as the mean, the variance, the standard deviation or any other values that can be obtained from the sample. The sampling error varies from sample to sample. A good estimator is one whose sample error distribution is highly concentrated about the population parameter value. Sampling error of the mean would be: Sample mean - population mean = x-bar

Sampling error of the standard deviation would be: Sample standard deviation - population standard deviation = s - .

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Sampling distribution A sample statistic itself is a random variable, which varies depending upon the composition of the sample. It therefore has a probability distribution. The sampling distribution of a statistic is the distribution of all the distinct possible values that the statistic can assume when computed from samples of the same size randomly drawn from the same population. The most commonly used sample statistics include mean, Variance and standard deviation. If you compute the mean of a sample of 10 numbers, the value you obtain will not equal the population mean exactly; by chance it will be a little bit higher or a little bit lower. If you sampled sets of 10 numbers over and over again (computing the mean for each set), you would find that some sample means come much closer to the population mean than others. Some would be higher than the populations mean and some would be lower. Imagine sampling 10 numbers and computing the mean over and over again, say about 1,000 times, and then constructing a relative frequency distribution of those 1,000 means. This distribution of means is a very good approximation to the sampling distribution of the mean. The sampling distribution of the mean is a theoretical distribution that is approached as the number of samples in the relative frequency distribution increases. With 1,000 samples, the relative frequency distribution is quite close; with 10,000 it is even closer. As the number of samples approaches infinity, the relative frequency distribution approaches the sampling distribution.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

The sampling distribution of the mean for a sample size of 10 is just an example; there is a different sampling distribution for other sample sizes. Also, keep in mind that the relative frequency distribution approaches a sampling distribution as the number of samples increases, not as the sample size increases since there is a different sampling distribution for each sample size. A sampling distribution can also be defined as the relative frequency distribution that would be obtained if all possible samples of a particular sample size were taken. For example, the sampling distribution of the mean for a sample size of 10 would be constructed by computing the mean for each of the possible ways in which 10 scores could be sampled from the population and creating a relative frequency distribution of these means. Although these two definitions may seem different, they are actually the same: Both procedures produce exactly the same sampling distribution. Statistics other than the mean have sampling distributions too. The sampling distribution of the median is the distribution that would result if the median instead of the mean were computed in each sample. Sampling distributions are very important since almost all inferential statistics are based on sampling distributions.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Simple random vs. stratified random sampling In stratified random sampling, the population is subdivided into subpopulations (strata) based on one or more classification criteria. Simple random samples are then drawn from each stratum (The sizes of the samples are proportional to the relative size of each stratum in the population). These samples are then pooled.

It is important to note that the size of the data in each stratum does not have to be the same or even similar, and frequently isn't. Stratified random sampling guarantees that population subdivisions of interest are represented in the sample. The estimates of parameters produced from stratified sampling have greater precision (i.e. smaller variance or dispersion) than estimates obtained from simple random sampling.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

For example, investors may want to fully duplicate a bond index by owning all the bonds in the index in proportion to their market value weights. This is known as pure bond indexing. However, it's difficult and costly to implement because a bond index typically consists of thousands of issues. If simple sampling is used, the sample selected may not accurately reflect the risk factors of the index. Stratified random sampling can be used to replicate the bond index. Divide the population of index bonds into groups with similar risk factors (e.g. issuer, duration/maturity, coupon rate, credit rating, call exposure, etc.). Each group is called a stratum or cell. Select a sample from each cell proportional to the relative market weighting of the cell in the index. A stratified sample will ensure that at least one issue in each cell is included in the sample.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Time-series and cross-sectional data. Data come in many different shapes and sizes, and measure many different things at different times. Often financial analysts are interested in particular types of data such as time-series data or cross-sectional data. Time-series data is a set of observations collected at usually discrete and equally spaced time intervals. For example, the daily closing price of a certain stock recorded over the last six weeks is an example of time series data. Note that a too long or too short time period may lead to time-period bias. Refer to subject g for details.

Other examples of time-series would be staff numbers at a particular institution taken on a monthly basis in order to assess staff turnover rates, weekly sales figures of ice-cream sold during a holiday period at a seaside resort and the number of students registered for a particular course on a yearly basis. All of the above would be used to forecast likely data patterns in the future.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Cross-sectional data are observations that coming from different individuals or groups at a single point in time. For example, if one considered the closing prices of a group of 20 different tech stocks on December 15, 1986 this would be an example of cross-sectional data. Note that the underlying population should consist of members with similar characteristics. For example, suppose you are interested in how much companies spend on research and development expenses. Firms in some industries such as retail spend little on research and development (R&D), while firms in industries such as technology spend heavily on R&D. Therefore, it's inappropriate to summarize R&D data across all companies. Rather, analysts should summarize R&D data by industry, and then analyze the data in each industry group. Other examples of cross-sectional data would be: an inventory of all ice creams in stock at a particular store, a list of grades obtained by a class of students for a specific test.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

2. The Central Limit Theorem.

CFA Quantitative Analysis E-Book 4 of 8

The central limit theorem states that given a distribution with a mean and variance 2, the sampling distribution of the mean x-bar approaches a normal distribution with a mean () and a variance 2/N as N, the sample size, increases.

The amazing and counter-intuitive thing about the central limit theorem is that no matter what the shape of the original distribution, x-bar approaches a normal distribution.

If the original variable X has a normal distribution, then x-bar will be normal regardless of the sample size. If the original variable X does not have a normal distribution, then x-bar will be normal only if N 30. This is called a distribution free result. This means that no matter what distribution X has, will still be normal for sufficiently large n.

Keep in mind that N is the sample size for each mean and not the number of samples. Remember in a sampling distribution the number of samples is assumed to be infinite. The sample size is the number of scores in each sample; it is the number of scores that goes into the computation of each mean.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Two things should be noted about the effect of increasing N: 1. 2. The distributions become more and more normal. The spread of the distributions decreases.

Based on the central limit theorem, when the sample size is large, you can: 1. 2. Use the sample mean to infer the population mean. Construct confidence intervals for the population mean based on the normal distribution.

Note that the central limit theorem does not prescribe that the underlying population must be normally distributed. Therefore, the central limit theorem can be applied on a population with any probability distribution.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

3. Standard Error of the Sample Mean.

CFA Quantitative Analysis E-Book 4 of 8

The standard error of a statistic is the standard deviation of the sampling distribution of that statistic. Standard errors are important because they reflect how much sampling fluctuation a statistic will show. The inferential statistics involved in the construction of confidence intervals and significance testing are based on standard errors. The standard error of a statistic depends on the sample size. In general, the larger the sample size, the smaller the standard error. The standard error of a statistic is usually designated by the Greek letter sigma () with a subscript indicating the statistic. The standard error of the mean is designated as: m. It is the standard deviation of the sampling distribution of the mean. The formula for the standard error of the mean is: m = /N1/2, where is the standard deviation of the original distribution and N is the sample size (the number of scores each mean is based upon). This formula does not assume a normal distribution. However, many of the uses of the formula do assume a normal distribution. The formula shows that the larger the sample size, the smaller the standard error of the mean. More specifically, the size of the standard error of the mean is inversely proportional to the square root of the sample size

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Example 1 Suppose that the mean grade of students in a class is 62%, with a standard deviation of 10%. A sample of 30 students is taken from the class. Calculate the standard error of the sample mean, and interpret your results. You are given that = 62, and =10. Since n = 30, the standard error of the sample mean is: m = 10/301/2 = 1.8257. This means that if you took all possible samples of size 30 from the class, the mean of all those samples would be 62, and the standard error would be 1.8257.

Note that if you took a sample of size 50, the standard error would then be: m = 10 / 501/2 = 1.4142.
So, the standard error would drop as the sample size increased, which agrees with the information above. When sample standard deviation (s) is used as an estimate of (when it is unknown), the estimated standard error of the mean is s/N1/2. In most practical applications analysts need to use this formula because the population standard deviation is almost never available.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Example 2 Suppose that the mean grade of students in a class is unknown, but a sample of 30 students is taken from the class, and the mean from the sample is found to be 60%, with a standard deviation of 9%. Calculate the standard error of the sample mean, and interpret your results. Now, and are unknown, but m is given as 60 and s Now, and are unknown, but m is given as 60 and s is given as 9. Since n = 30, you can estimate the standard error of the sample mean as: 9/301/2 = 1.6432. This means that if you took all possible samples of size 30 from the class, you would estimate the standard error to be 1.6432. It is important to note that when you have , you must use it; but when you don't, you use its sample equivalent s.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

4. Estimators.

CFA Quantitative Analysis E-Book 4 of 8

Very often, there are a number of different estimators that can be used to estimate unknown population parameters. When faced with such a choice, it is desirable to know that the estimator chosen is the "best" under the circumstances, that is, it has more desirable properties than any of the other options available to us. There are three desirable properties of estimators: 1. Unbiasedness An estimator's expected value (the mean of its sampling distribution) equals the parameter it is intended to estimate. For example, the sample mean is an unbiased estimator of the population mean, because the expected value of the sample mean is equal to the population mean. 2. Efficiency An estimator is efficient if no other unbiased estimator of the sample parameter has a sampling distribution with smaller variance. That is, in repeated samples, analysts expect the estimates from an efficient estimator to be more tightly grouped around the mean than estimates from other unbiased estimators. For example, the sample mean is an efficient estimator of the population mean, and the sample variance is an efficient estimator of the population variance. Consistency A consistent estimator is one for which the probability of accurate estimates (estimates close to the value of the population parameter) increases as sample size increases. In other words, a consistent estimator's sampling distribution becomes concentrated on the value of the parameter it is intended to estimate as the sample size approaches infinity. For example, as the sample size increases to infinity, the standard error of the sample mean declines to 0, and the sampling distribution concentrates around the population mean. Therefore, the sample mean is a consistent estimator of the population mean.

3.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

The single estimate of an unknown population parameter calculated as a sample mean is called point estimate of the mean. The formula used to compute the point estimate is called an estimator. The specific value calculated from sample observations using an estimator is called an estimate. For example, the sample mean is a point estimate of the population mean. Suppose two samples are taken from a population, and the sample means are 16 and 21 respectively. Therefore, 16 and 21 are two estimates of the population mean. Note that an estimator will yield different estimates as repeated samples are taken from the sample population. A confidence interval is an interval for which one can assert with a given probability 1 - , called the degree of confidence, that it will contain the parameter it is intended to estimate. This interval is often referred to as the (1 - )% confidence interval for the parameter, where is referred to as the level of significance. The end points of a confidence interval are called the lower and upper confidence limits.

For example, suppose that a 95% confidence interval for the population mean is 20 to 40. This means that There is a 95% probability that the population mean lies in the range of 20 to 40; "95%" is the degree of confidence; "5%" is the level of significance; 20 and 40 are the lower and higher confidence limits, respectively.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

5. Confidence Intervals for the Population Mean.

CFA Quantitative Analysis E-Book 4 of 8

Confidence intervals are typically constructed by using the following structure: Confidence Interval = Point Estimate Reliability Factor x Standard Error Point estimate is the value of a sample statistic of the population parameter. Reliability factor is a number based on the sampling distribution of the point estimate and the degree of confidence (1 - ). Standard error refers to the standard error of the sample statistic that is used to produce the point estimate. Whatever the distribution of the population, the sample mean is always the point estimate used to construct the confidence intervals for the population mean. The reliability factor and the standard error, however, may vary depending on three factors: 1. Distribution of population: normal or non-normal. 2. Population variance: known or unknown. 3. Sample size: large or small.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

z-Statistic: a standard normal random variable If a population is normally distributed with a known variance, z-statistic is used as the reliability factor to construct confidence intervals for the population mean. In practice, the population standard deviation is rarely known. However, learning how to compute a confidence interval when the standard deviation is known is an excellent introduction to how to compute a confidence interval when the standard deviation has to be estimated. Three values are used to construct a confidence interval for : 1. The sample mean (m); 2. The value of z (which depends on the level of confidence), and 3. The standard error of the mean ()m. The confidence interval has m for its center and extends a distance equal to the product of z and in both directions. Therefore, the formula for a confidence interval is: m - z m = = m + z m

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

For a (1 - )% confidence interval for the population mean, the z-statistic to be used is Z/2. Z /2 denotes the points of the standard normal distribution such that /2 of the probability falls in the right-hand tail. Effectively, what is happening is that the (1 - )% of the area that makes up the confidence interval falls in the center of the graph, that is, symmetrically around the mean. This leaves % of the area in both tails, or /2 % of area in each tail.

Commonly used reliability factors are as follows: 90% confidence intervals: z0.05 = 1.645. is 10%, with 5% in each tail. 95% confidence intervals: z0.025 = 1.96. is 5%, with 2.5% in each tail. 99% confidence intervals: z0.005 = 2.575. is 1%, with 0.5% in each tail.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Example Assume that the standard deviation of SAT verbal scores in a school system is known to be 100. A researcher wishes to estimate the mean SAT score and compute a 95% confidence interval from a random sample of 10 scores. The 10 scores are: 320, 380, 400, 420, 500, 520, 600, 660, 720, and 780. Therefore, m = 530, N = 10, and m= 100 / 101/2 = 31.62. The value of z for the 95% confidence interval is the number of standard deviations one must go from the mean (in both directions) to contain .95 of the scores.

It turns out that one must go 1.96 standard deviations from the mean in both directions to contain .95 of the scores the value of 1.96 was found using a z table. Since each tail is to contain .025 of the scores, you find the value of z for which 1 - 0.025 = 0.975 of the scores are below. This value is 1.96.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

All the components of the confidence interval are now known: m = 530, m = 31.62, z = 1.96. Lower limit = 530 - (1.96)(31.62) = 468.02

Upper limit = 530 + (1.96)(31.62) = 591.98


Therefore, 468.02 591.98. This means that the experimenter can be 95% certain that the mean SAT in the school system is between 468 and 592. This also means if the experimenter repeatedly took samples from the population and calculated a number of different 95% confidence intervals using the sample information, on average 95% of those intervals would contain . Notice that this is a rather large range of scores. Naturally, if a larger sample size had been used, the range of scores would have been smaller.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

The computation of the 99% confidence interval is exactly the same except that 2.58 rather than 1.96 is used for z. The 99% confidence interval is: 448.54 = = 611.46. As it must be, the 99% confidence interval is even wider than the 95% confidence interval.

Summary of Computations 1. Compute m = X/N. 2. Compute m = /N1/2 3. Find z (1.96 for 95% interval; 2.58 for 99% interval) 4. Lower limit = m - z m 5. Upper limit = m + z m 6. Lower limit = = Upper limit Assumptions: 1. Normal distribution 2. is known 3. Scores are sampled randomly and are independent

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

There are three other points worth mentioning here: The point estimate will always lie exactly at the midway mark of the confidence interval. This is because it is the "best" estimate for , and so the confidence interval expands out from it in both directions. The higher the percentage of confidence, the wider the interval will be. This is because as the percentage is increased, a wider interval is needed to give us a greater chance of capturing the unknown population value within that interval. The width of the confidence interval is always twice the part after the positive or negative sign, that is, twice the reliability factor x standard error. The width is simply the upper limit minus the lower limit. It is very rare for a researcher wishing to estimate the mean of a population to already know its standard deviation. Therefore, the construction of a confidence interval almost always involves the estimation of both and .

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

STUDENTS' T-DISTRIBUTION When is known, the formula m - z m = = m + z m is used for a confidence interval. When is not known, m = s/N1/2 (N is the sample size) is used as an estimate of and . Whenever the standard deviation is estimated, the t rather than the normal (z) distribution should be used. The values of t are larger than the values of z so confidence intervals when is estimated are wider than confidence intervals when is known. The formula for a confidence interval for when is estimated is: m - t sm = = m + t sm

Where m is the sample mean, sm is an estimate of m, and t depends on the degrees of freedom and the level of confidence.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

The t-distribution is a symmetrical probability distribution defined by a single parameter known as degrees of freedom (df). Each value for the number of degrees of freedom defines one distribution in this family of distributions. Like a standard normal distribution (e.g. a z-distribution), the t-distribution is symmetrical around its mean. Unlike a standard normal distribution, the t-distribution has the following unique characteristics. It is an estimated standardized normal distribution. When n gets larger, t approximates z (s approaches ). The mean is 0, and the distribution is bell shaped. There is not one t-distribution, but a family of t-distributions. All t-distributions have the same mean of 0. Standard deviations of these t-distributions differ according to the sample size, n. The shape depends on degrees of freedom (n - 1). The t-distribution is less peaked than a standard normal distribution, and has fatter tails (i.e. more probability in the tails). t/2 tends to be greater than z/2 for a given level of significance, . Its variance is v/(v-2) (for v > 2), where v = n-1. It is always bigger than 1. As v increases, the variance approaches 1.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

The value of t can be determined from a t table. The degrees of freedom for t is equal to the degrees of freedom for the estimate of m which is equal to N-1.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

A portion of t-table is presented as below: Level of significance (a) for two-Tailed Test Cff 0.20 0.10 0.05 0.02 0.01

1
2

3.078
1.886 1.311

6.314
2.920 1.699

12.706
4.303 2.045

31.821
6.965 2.462

63.657
9.925 2.756

29

30

1.310

1.697

2.042

2.457

2.750

Suppose the sample size (n) is 30, and the level of significance () is 5%. df = n - 1 = 29. t/2 = t0.025 = 2.045 (Find the 29 df row, and then move to the 0.05 column).

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Example Assume a researcher is interested in estimating the mean reading speed (number of words per minute) of high-school graduates and computing the 95% confidence interval. A sample of 6 graduates was taken and the reading speeds were: 200, 240, 300, 410, 450, and 600. For these data, m = 366.6667 sm = 60.9736 df = 6-1 = 5 t = 2.571

Therefore, the lower limit is: m - (t) (sm) = 209.904 and the upper limit is: m + (t) (sm) = 523.430. Therefore, the 95% confidence interval is: 209.904 = = 523.430 Thus, the researcher can be 95% sure that the mean reading speed of high-school graduates is between 209.904 and 523.430.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Summary of Computations 1. Compute m = X/N. 2. Compute s 3. Compute m= s/N1/2 4. Compute df = N-1 5. Find t for these df using a t table 6. Lower limit = m - t s m 7. Upper limit = m + t sm 8. Lower limit = = Upper limit Assumptions: 1. Normal distribution 2. Scores are sampled randomly and are independent

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Discuss the issues surrounding selection of the appropriate sample size It's all starting to become a little confusing. Which distribution do you use?

When a large sample size (generally bigger than 30 samples) is used, a z table can always be used to construct the confidence interval. It does not matter if the population distribution is normal, or if the population variance is known or not. This is because the central limit theorem assures that when the sample is large, the distribution of the sample mean is approximately normal. However, the t-statistic is more conservative because the t-statistic tends to be greater than the z statistic, and therefore using t-statistic will result in a wider confidence interval.
However, if there is only a small sample size, a t table has to be used to construct the confidence interval when the population distribution is normal and the population variance is not known.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

If the population distribution is not normal, there is no way to construct a confidence interval from a small sample (even if the population variance is known). Therefore, all else equal, you should try to select a sample larger than 30. The larger the sample size, the more precise the confidence interval. In general, at least one of the following is needed: A normal distribution for the population. A sample size that is greater than or equal to 30. If one or both of the above occur, then a z-table or t-table is used, dependent upon whether is known or unknown. If neither of the above occurs, then the question cannot be answered.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

A summary of the situation is as follows: If the population is normally distributed, and the population variance is known, use a z-score irrespective of sample size. If the population is normally distributed, and the population variance is unknown, use a t-score irrespective of sample size. If the population is not normally distributed, and the population variance is known, use a z score only if n >= 30, otherwise it cannot be done. If the population is not normally distributed, and the population variance is unknown, use a tscore only if n >= 30, otherwise it cannot be done.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

6. Common biases in sampling methods.

CFA Quantitative Analysis E-Book 4 of 8

As has already been mentioned repeatedly, if there are problems with the choice of sample, then the conclusions that are drawn from a sample could be in error. There are a number of different types of bias that can creep into samples. It is important to be aware of them, and have the ability to comment on their possible appearance in the data where appropriate. Data-snooping bias is the bias in the inference drawn as a result of prying into the empirical results of others to guide your own analysis. Finding seemingly significant but in fact spurious patterns in the data is a serious problem in financial analysis. Although it afflicts all non-experimental sciences, data-snooping is particularly problematic for financial analysis because of the large number of empirical studies performed on the same datasets. Given enough time, enough attempts, and enough imagination, almost any pattern can be teased out of any dataset. In some cases, these spurious patterns are statistically small, almost unnoticeable in isolation. But because small effects in financial calculations can often lead to very large differences in investment performance, data-snooping biases can be surprisingly substantial.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

For example, after examining the empirical evidence from 1986 to 2002, Professor Minard concludes that a growth investment strategy produces superior investment performance. After reading about Professor Minard's study, Monica decides to conduct a research of growth versus value investing based on the same or related historical data used by Professor Minard. Monica's research is subject to data-snooping bias because, among other things, the data used by Professor Minard may be spurious. The best way to avoid data-snooping bias is to examine new data. However, data-snooping bias is difficult to avoid because investment analysis is typically based on historical or hypothesized data. Data-snooping bias can easily lead to data-mining bias. Data-mining is the practice of finding forecasting models by extensive searching through databases for patterns or trading rules (i.e. repeatedly "drilling" in the same data until you find something). It has a very specific definition: continually mixing and matching the elements of a database until one "discovers" two more or more data series that are highly correlated. Data-mining also refers more generically to any of a number of practices in which data can be tortured into confessing anything.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Two signs may indicate the existence of data-mining in research findings about profitable trading strategies: 1. 2. Many of the variables actually used in the research are not reported. These terms may indicate that the researchers were searching through many unreported variables. There is no plausible economic theory available to explain why those strategies work.

To avoid data-mining, analysts should use out-of sample data to test a potentially profitable trading rule. That is, analysts should test the trading rule on a data set other than the one used to establish the rule.
Sample selection bias occurs when data availability leads to certain assets being excluded from the analysis. The discrete choice has become a popular tool for assessing the value of non-market goods. Surveys used in these studies frequently suffer from large non-response which can lead to significant bias in parameter estimates and in the estimate of mean

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

Survivorship bias is the most common type of sample selection bias. It occurs when studies are conducted on databases that have eliminated all companies that have ceased to exist (often due to bankruptcy). The findings from such studies most likely will be upwardly biased, since the surviving companies will look better than those that no longer exist For example many mutual those that no longer exist. For example, many mutual fund databases provide historical data about only those funds that are currently in existence. As a result, funds that have ceased to exist due to closure or merger do not appear in these databases. Generally, funds that have ceased to exist have lower returns relative to the surviving funds. Therefore, the analysis of a mutual fund database with survivorship bias will overestimate the average mutual fund return because the database only includes the better-performing funds. Another example is the return data on stocks listed on an exchange as it is subject to survivorship bias: it's difficult to collect information on delisted companies and these companies often have poor performance. Look-ahead bias exists when studies assume that fundamental information is available when it is not. For example, researchers often assume a person had annual earnings data in January; in reality the data might not be available until March. This usually biases results upwards. Time period bias occurs when a test design is based on a time period that may make the results time-period specific. Even the worst performers have months or even years in which they look wonderful. After all, stopped clocks are right twice a day. To eliminate strategies that have just been lucky, research must encompass many years. However, if the time period is too long, the fundamental economic structure may have changed during the time frame resulting in two data changed during the time frame, resulting in two data sets that reflect different relationships.

All Rights Reserved. Corporate Bridge TM

www.educorporatebridge.com

CFA Quantitative Analysis E-Book 4 of 8

For FREE Resources


https://www.educorporatebridge.com/freebies.php

Corporate Bridge Blog


Finance News, Articles, Interview Tips etc
https://www.educorporatebridge.com/blog

For Online Finance Courses


https://www.educoporatebridge.com

For any other enquiry / information


Email support@cbacademy.in

Disclaimer Please refer to the updated curriculum of CFA level 1 for further information
All Rights Reserved. Corporate Bridge
TM

www.educorporatebridge.com

Potrebbero piacerti anche