Sei sulla pagina 1di 19

MB-40

Q-1Statistics is the backbone of descision-making: comment a) statistics is the backbone of decision-making


Due to advanced communication network, rapid changes in consumer behaviour, varied expectations of variety of consumers and new market openings, modern managers have a difficult task of making quick and appropriate decisions. Therefore, there is a need for them to depend more upon quantitative techniques like mathematical models, statistics, operations research and econometrics.

Decision making is a key part of our day-to-day life. Even when we wish to purchase a television, we like to know the price, quality, durability, and maintainability of various brands and models before buying one. As you can see, in this scenario we are collecting data and making an optimum decision. In other words, we are using Statistics.

Again, suppose a company wishes to introduce a new product, it has to collect data on market potential, consumer likings, availability of raw materials, feasibility of producing the product. Hence, data collection is the back-bone of anydecision making process.

Many organisations find themselves data-rich but poor in drawing information from it. Therefore, it is important to develop the ability to extract meaningful information from raw data to make better decisions. Statistics play an important role in this aspect.

Statistics is broadly divided into two main categories. Below Figure illustrates the two categories. The two categories of Statistics are descriptive statistics and inferential statistics. Descriptive Statistics: Descriptive statistics is used to present the general description of data which is summarised quantitatively. This is mostly useful in clinical research, when communicating the results of experiments.

Inferential Statistics: Inferential statistics is used to make valid inferences from the data which are helpful in effective decision making for managers or professionals. Statistical methods such as estimation, prediction and hypothesis testing belong to inferential statistics. The researchers make deductions or conclusions from the collected data samples regarding the characteristics of large population from which the samples are taken. So, we can say Statistics is the backbone of decision-making.

Statistics is as good as the user. Comment.

Statistics is used for various purposes. It is used to simplify mass data and to make comparisons easier. It is also used to bring out trends and tendencies in the data as well as the hidden relations between variables. All this helps to make decision making much easier. Let us look at each function of Statistics in detail. 1. Statistics simplifies mass data

The use of statistical concepts helps in simplification of complex data. Using statistical concepts, the managers can make decisions more easily. The statistical methods help in reducing the complexity of the data and consequently in the understanding of any huge mass of data.

2.Statistics

makes

comparison

easier

Without using statistical methods and concepts, collection of data and comparison cannot be done easily. Statistics helps us to compare data collected from different sources. Grand totals, measures of central tendency, measures of dispersion, graphs and diagrams, coefficient of correlation all provide ample scopes for comparison. 3. Statistics brings out trends and tendencies in the data

After data is collected, it is easy to analyse the trend and tendencies in the data by using the various concepts of Statistics. 4. Statistics brings out the hidden relations between variables

Statistical analysis helps in drawing inferences on data. Statistical analysis brings out the hidden relations between variables.

5. Decision

making power

becomes

easier

With the proper application of Statistics and statistical software packages on the collected data, managers can take effective decisions, which can increase the profits in a Seeing all these functionality we can say Statistics is as good as the user. business.

Q-2Distinguish between the following with example a) Inclusive and discrete data
Class intervals are of two types; exclusive and inclusive. The class interval that does not include upper class limit is called an exclusive type of class interval. The class interval that includes the upper class limit is called an inclusive type of class interval. Example: Inclusive series is the one which doesn't consider the upper limit, for example, 00-10 10-20 20-30 30-40 40-50 In the first one (00-10), we will consider numbers from 00 to 9.99 only. And 10 will be considered in 10-20. So this is known as inclusive series. Exclusive series is the one which has both the limits included, for example, 00-09 10-19 20-29 30-39 40-49 Here, both 00 and 09 will come under the first one (00-09). And 10 will come under the next one.

b)Continous and discrete data


Discrete data only take on particular values and no values in between. Data like the number of siblings a person has or the number of cars a person owns is discrete becuase you can either have 0 cars or 1 car or 2 cars and so on, but you can't own 1.5 cars.

Continuous data can take on any value on a range. Temperature and height are continuous becuase you can be 69.32894... inches tall. You can be any fraction of an inch tall in that case.

A type of data is discrete if there are only a finite number of values possible or if there is a space on the number line between each 2 possible values. Ex. A 5 question quiz is given in a Math class. The number of correct answers on a student's quiz is an example of discrete data. The number of correct answers would have to be one of the following : 0, 1, 2, 3, 4, or 5. There are not an infinite number of values, therefore this data is discrete. Also, if we were to draw a number line and place each possible value on it, we would see a space between each pair of values. Ex. In order to obtain a taxi license in Las Vegas, a person must pass a written exam regarding different locations in the city. How many times it would take a person to pass this test is also an example of discrete data. A person could take it once, or twice, or 3 times, or 4 times, or . So, the possible values are 1, 2, 3, . There are infinitely many possible values, but if we were to put them on a number line, we would see a space between each pair of values. Discrete data usually occurs in a case where there are only a certain number of values, or when we are counting something (using whole numbers).

Continuous data makes up the rest of numerical data. This is a type of data that is usually associated with some sort of physical measurement. Ex. The height of trees at a nursery is an example of continuous data. Is it possible for a tree to be 76.2" tall? Sure. How about 76.29"? Yes. How about 76.2914563782"? You betcha! The possibilities depends upon the accuracy of our measuring device. One general way to tell if data is continuous is to ask yourself if it is possible for the data to take on values that are fractions or decimals. If your answer is yes, this is usually continuous data. Ex. The length of time it takes for a light bulb to burn out is an example of continuous data. Could it take 800 hours? How about 800.7? 800.7354? The answer to all 3 is yes.

b) Class limits and class intervals

Qualitative data
Qualitative data is a categorical measurement expressed not in terms of numbers, but rather by means of a natural language description. In statistics, it is often used interchangeably with "categorical" data.

For example: favorite color = "yellow" height = "tall"

Although we may have categories, the categories may have a structure to them. When there is not a natural ordering of the categories, we call these nominal categories. Examples might be gender, race, religion, or sport.

When the categories may be ordered, these are called ordinal variables. Categorical variablesthat judge size (small, medium, large, etc.) are ordinal variables. Attitudes (strongly disagree, disagree, neutral, agree, strongly agree) are also ordinal variables, however we may not know which value is the best or worst of these issues. Note that the distance between these categories is not something we can measure.

Quantitative data
Quantitative data is a numerical measurement expressed not by means of a natural language description, but rather in terms of numbers. However, not all numbers are continuous and measurable. For example, the social security number is a number, but not something that one can add or subtract.

For example: favorite color = "450 nm" height = "1.8 m"

Quantitative data always are associated with a scale measure.

Probably the most common scale type is the ratio-scale. Observations of this type are on a scale that has a meaningful zero value but also have an equidistant measure (i.e., the difference between 10 and 20 is the same as the difference between 100 and 110). For example, a 10 year-old girl is

twice as old as a 5 year-old girl. Since you can measure zero years, time is a ratio-scale variable. Money is another common ratio-scale quantitative measure. Observations that you count are usually ratio-scale (e.g., number of widgets).

A more general quantitative measure is the interval scale. Interval scales also have a equidistant measure. However, the doubling principle breaks down in this scale. A temperature of 50 degrees Celsius is not "half as hot" as a temperature of 100, but a difference of 10 degrees indicates the same difference in temperature anywhere along the scale. The Kelvin temperature scale, however, constitutes a ratio scale because on the Kelvin scale zero indicates absolute zero in temperature, the complete absence of heat. So one can say, for example, that 200 degrees Kelvin is twice as hot as 100 degrees Kelvin.

d)
Class Limits Class limits are the smallest and largest observations (data, events etc) in each class. Therefore, each class has two limits: a lower and upper. Example:

Class 200 299 300 399 400 499 500 599 600 699 700 799 800 899 Total Frequency

Frequency 12 19 6 2 11 7 3 60

Using the frequency table above, what are the lower and upper class limits for the first three classes? For the first class, 200 299 The lower class limit is 200 The upper class limit is 299 For the second class, 300 399 The lower class limit is 300 The upper class limit is 399 For the third class, 400 499 The lower class limit is 400 The upper class limit is 499

Class Intervals Class interval is the difference between the upper and lower class boundaries of any class. Example:

Class 200 299 300 399 400 499 500 599 600 699 700 799

Frequency 12 19 6 2 11 7

800 899 Total Frequency

3 60

Using the table above, determine the class intervals for the first class. For the first class, 200 299 The class interval = Upper class boundary lower class boundary Upper class boundary = 299.5 Lower class boundary = 199.5 Therefore, the class interval = 299.5 199.5 = 100

Q-4 List down varios measures of cebtral tendency and explain the different between them? Measures of Central Tendency
Introduction A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as, the median and the mode. The mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections we will look at the mean, mode and median and learn how to calculate them and under what conditions they are most appropriate to be used. Mean (Arithmetic) The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data (see our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x 1, x2, ..., xn, then the sample mean, usually denoted by (pronounced x bar), is:

This formula is usually written in a slightly different manner using the Greek capitol letter,

, pronounced "sigma", which means "sum of...":

You may have noticed that the above formula refers to the sample mean. So, why call have we called it a sample mean? This is because, in statistics, samples and populations have very different meanings and these differences are very important, even if, in the case of the mean, they are calculated in the same way. To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter "mu", denoted as :

The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set. However, one of its important properties is that it minimises error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set. An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero. When not to use the mean

The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below:

Staff Salary

1 15k

2 18k

3 16k

4 14k

5 15k

6 15k

7 12k

8 17k

9 90k

10 95k

The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation. Another time when we usually prefer the median over the mean (or mode) is when our data is skewed (i.e. the frequency distribution for our data is skewed). If we consider the normal distribution - as this is the most frequently assessed in statistics - when the data is perfectly normal then the mean, median and mode are identical. Moreover, they all represent the most typical value in the data set. However, as the data becomes skewed the mean loses its ability to provide the best central location for the data as the skewed data is dragging it away from the typical value. However, the median best retains this position and is not as strongly influenced by the skewed values. This is explained in more detail in the skewed distribution section later in this guide. Median The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:

65

55

89

56

35

14

56

55

87

45

92

We first need to rearrange that data into order of magnitude (smallest first):

14

35

45

55

55

56

56

65

87

89

92

Our median mark is the middle mark - in this case 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below:

65

55

89

56

35

14

56

55

87

45

We again rearrange that data into order of magnitude (smallest first):

14

35

45

55

55

56

56

65

87

89

92

Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5. Mode The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:

Normally, the mode is used for categorical data where we wish to know which is the most common category as illustrated below:

We can see above that the most common form of transport, in this particular data set, is the bus. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:

We are now stuck as to which mode best describes the central tendency of the data. This is particularly problematic when we have continuous data, as we are more likely not to have any one value that is more frequent than the other. For example, consider measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people with exactlythe same weight, e.g. 67.4 kg? The answer, is probably very unlikely - many people might be close but with such a small sample (30 people) and a large range of possible weights you are unlikely to find two people with exactly the same weight, that is, to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data. Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set, as depicted in the diagram below:

In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is not representative of the data, which is mostly concentrated around the 20 to 30 value range. To use the mode to describe the central tendency of this data set would be misleading. Skewed Distributions and the Mean and Median We often test whether our data is normally distributed as this is a common assumption underlying many statistical tests. An example of a normally distributed set of data is presented below:

When you have a normally distributed sample you can legitimately use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency as it is the measure that includes all the values in the data set for its calculation, and any change in any of the scores will affect the value of the mean. This is not the case with the median or mode. However, when our data is skewed, for example, as with the right-skewed data set below:

we find that the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean. A classic example of the above right-skewed distribution is income (salary), where higher-earners provide a false representation of the typical income if expressed as a mean and not a median. If dealing with a normal distribution, and tests of normality show that the data is non-normal, then it is customary to use the median instead of the mean. This is more a rule of thumb than a strict guideline however. Sometimes, researchers wish to report the mean of a skewed distribution if the median and mean are not appreciably different (a subjective assessment) and if it allows easier comparisons to previous research to be made. Summary of when to use the mean, median and mode Please use the following summary table to know what the best measure of central tendency is with respect to the different types of variable.

Type of Variable Nominal Ordinal Interval/Ratio (not skewed) Interval/Ratio (skewed)

Best measure of central tendency Mode Median Mean Median

Q-5 Define population and sampling unit for selecting a random sample in each of the following cases. a)Hundred voters from a constituency b) Twenty stocks of National stock Exchange c) Fifty account holders of state Bank of India

d) Twenty rmployees of tata motors

Successful statistical practice is based on focused problem definition. In sampling, this includes defining the population from which our sample is drawn. A population can be defined as including all people or items with the characteristic one wishes to understand. Because there is very rarely enough time or money to gather information from everyone or everything in a population, the goal becomes finding a representative sample (or subset) of that population.

Sometimes that which defines a population is obvious. For example, a manufacturer needs to decide whether a batch of material fromproduction is of high enough quality to be released to the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the batch is the population.

Although the population of interest often consists of physical objects, sometimes we need to sample over time, space, or some combination of these dimensions. For instance, an investigation of supermarket staffing could examine checkout line length at various times, or a study on endangered penguins might aim to understand their usage of various hunting grounds over time. For the time dimension, the focus may be on periods or discrete occasions.

In other cases, our 'population' may be even less tangible. For example, Joseph Jagger studied the behaviour of roulette wheels at a casino inMonte Carlo, and used this to identify a biased wheel. In this case, the 'population' Jagger wanted to investigate was the overall behaviour of the wheel (i.e. the probability distribution of its results over infinitely many trials), while his 'sample' was formed from observed results from that wheel. Similar considerations arise when taking repeated measurements of some physical characteristic such as the electrical conductivity of copper.

This situation often arises when we seek knowledge about the cause system of which the observed population is an outcome. In such cases, sampling theory may treat the observed population as a sample from a larger 'superpopulation'. For example, a researcher might study the success rate of a new 'quit smoking' program on a test group of 100 patients, in order to predict the effects of the program if it were made available nationwide. Here the superpopulation is "everybody in the country, given access to this treatment" - a group which does not yet exist, since the program isn't yet available to all.

Note also that the population from which the sample is drawn may not be the same as the population about which we actually want information. Often there is large but not complete overlap between these two groups due to frame issues etc. (see below). Sometimes they may be entirely separate - for instance, we might study rats in order to get a better understanding of human health, or we might study records from people born in 2008 in order to make predictions about people born in 2009.

Time spent in making the sampled population and population of concern precise is often well spent, because it raises many issues, ambiguities and questions that would otherwise have been overlooked at this stage.

q.6-

In statistics, a confidence interval (CI) is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval (i.e. it is calculated from the observations), in principle different from sample to sample, that frequently includes the parameter of interest, if the experiment is repeated. How frequently the observed interval contains the parameter is determined by the confidence level or confidence coefficient.

A confidence interval with a particular confidence level is intended to give the assurance that, if the statistical model is correct, then taken over all the data that might have been obtained, the procedure for constructing the interval would deliver a confidence interval that included the true value of the parameter the proportion of the time set by the confidence level.
[clarification needed]

More specifically, the meaning of the term "confidence level" is that, if

confidence intervals are constructed across many separate data analyses of repeated (and possibly different) experiments, the proportion of such intervals that contain the true value of the parameter will approximately match the confidence level; this is guaranteed by the reasoning underlying the construction of confidence intervals.

A confidence interval does not predict that the true value of the parameter has a particular probability of being in the confidence interval given the data actually obtained. (An interval intended to have such a property, called a credible interval, can be estimated using Bayesianmethods; but such methods bring with them their own distinct strengths and weaknesses).

The purpose of confidence of intervalsis to determine a series of values from recurring samples of data so that the series of values of the specific population parameter is more likely to happen within the specified probability. Heres my Statistics for Dummies interpretation and example. Lets say that the population parameter of matter is the population average and that the series of values has an 80% confidence interval. The confidence of interval is not a probability that there is 80% possibility of the confidence interval being the population average, the confidence interval is the 80% of when sampled data from the specific range of the population parameter happens again and again from the population, thus, the percentage of these intervals will have the population average. Also, another purpose of use and why would be the amount of data that the provider believes as factual with a high degree of Confidence; that is more certain about a part of the data than perhaps some of the secondary data gathered. Confidence Intervals can be the expected range of outcome.

Null Hypothesis and Confidence Intervals


Confidence intervals are used to reject a null hypothesis. If I set my confidence level for my test at 80%, I have a 20% chance of being wrong about the null hypothesis. Of course I can't completely reject the possibility of being wrong. Toss a penny 100 times and it's a 50/50 chance that it's going to come up heads. The actual results may vary by five one way or the other, but still lie within the parameters. Confidence interval lets me predict how close to 50/50 the results are going to be and how often.

Confidence Intervals: The Skinny


Confidence Intervals measure the probability of something likely to occur within a population based on the values or data gathered from repeated testing of that specific population. For example, as in a weather prediction, if a certain weather condition presents itself and can illustrate to produce a Thunderstorm, then the confidence interval would be significant that a storm will occur. Another Example: Surfing the Wave In Hawaii, Surfing is a popular sport. The most most most important factor for a good day of surf is the size of a swell (wave height), the average sizes of swells throughout a given timeframe, and the consistency (ride length and wave direction) of a swell. How are Confidence Intervals used in surfing to determine wave height? Confidence Intervals measure the probability of how high and what wave direction, waves will travel at a given timeframe. By testing and gathering the data and values that a range of waves provides, including atmospheric and climate conditions, the values of these wave intervals will average a numbered wave height, providing a probability that during the same timeframe of the next day, a similar average wave height will occur.

The confidence level tells you how sure you can be. It is expressed as a percentage and represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level. Q-6What is a confidence interval, and why it is useful? What is s confidence level?

Confidence interval
From Wikipedia, the free encyclopedia

This article is about the confidence interval. For Confidence distribution, see Confidence Distribution.

In statistics, a confidence interval (CI) is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval (i.e. it is calculated from the observations), in principle different from sample to sample, that frequently includes the parameter of interest, if the experiment is repeated. How frequently the observed interval contains the parameter is determined by the confidence level or confidence coefficient.

A confidence interval with a particular confidence level is intended to give the assurance that, if the statistical model is correct, then taken over all the data that might have been obtained, the procedure for constructing the interval would deliver a confidence interval that included the true value of the parameter the proportion of the time set by the confidence level.[clarification needed] More specifically, the meaning of the term "confidence level" is that, if confidence intervals are constructed across many separate data analyses of repeated (and possibly different) experiments, the proportion of such intervals that contain the true value of the parameter will approximately match the confidence level; this is guaranteed by the reasoning underlying the construction of confidence intervals.

A confidence interval does not predict that the true value of the parameter has a particular probability of being in the confidence interval given the data actually obtained. (An interval intended to have such a property, called a credible interval, can be estimated using Bayesianmethods; but such methods bring with them their own distinct strengths and weaknesses).

In this bar chart, the top ends of the bars indicate observation means and the red line segments represent the confidence intervals surrounding them

[edit]Introduction Interval estimates can be contrasted with point estimates. A point estimate is a single value given as the estimate of a population parameter that is of interest, for example the mean of some quantity. An interval estimate specifies instead a range within which the parameter is estimated to lie.

Confidence intervals are commonly reported in tables or graphs along with point estimates of the same parameters, to show the reliability of the estimates.

For example, a confidence interval can be used to describe how reliable survey results are. In a poll of election voting-intentions, the result might be that 40% of respondents intend to vote for a certain party. A 90% confidence interval for the proportion in the whole population having the same intention on the survey date might be 38% to 42%. From the same data one may calculate a 95% confidence interval, which might in this case be 36% to 44%. A major factor determining the length of a confidence interval is the size of the sample used in the estimation procedure, for example the number of people taking part in a survey. [edit]Relationship

with other statistical topics

[edit]Statistical hypothesis testing Confidence intervals are closely related to statistical significance testing. For example, if for some estimated parameter one wants to test the null hypothesis that = 0 against the alternative that 0, then this test can be performed by determining whether the confidence interval for contains 0. More generally, given the availability of a hypothesis testing procedure that can test the null hypothesis = 0 against the alternative that 0 for any value of 0, then a confidence interval with confidence level = 1 can be defined as containing any number 0 for which the corresponding null hypothesis is not rejected at significance level .[1] In consequence,[clarification needed] if the estimates of two parameters (for example, the mean values of a variable in two independent groups of objects) have confidence intervals at a given value that do not overlap, then the difference between the two values is significant at the corresponding value of . However, this test is too conservative. If two confidence intervals overlap, the difference between the two means still may be significantly different.[2][3] [edit]Confidence region Confidence regions generalize the confidence interval concept to deal with multiple quantities. Such regions can indicate not only the extent of likely sampling errors but can also reveal whether (for example) it is the case that if the estimate for one quantity is unreliable then the other is also likely to be unreliable. See also confidence bands. In applied practice, confidence intervals are typically stated at the 95% confidence level.[4] However, when presented graphically, confidence intervals can be shown at several confidence levels, for example 50%, 95% and 99%. [edit]Statistical

theory

[edit]Definition Let X be a random sample from a probability distribution with parameters , which is a quantity to be estimated, and , representing quantities not of immediate interest. A confidence interval for the parameter , with confidence level or confidence coefficient , is an interval with random endpoints , determined by the pair of statistics (i.e., observable random variables) u(X) and v(X), with the property:

The quantities in which there is no immediate interest are called nuisance parameters, as statistical theory still needs to find some way to deal with them. The number , with typical values close to but not greater than 1, is sometimes given in the form 1 (or as a percentage 100%(1 )), where is a small nonnegative number, close to 0. Here Pr, is used to indicate the probability when the random variable X has the distribution characterised by (, ). An important part of this specification is that the random interval (U, V) covers the unknown value with a high probability no matter what the true value of actually is.

Note that here Pr, need not refer to an explicitly given parameterised family of distributions, although it often does. Just as the random variable X notionally corresponds to other possible realizations of x from the same population or from the same version of reality, the parameters (, ) indicate that we need to consider other versions of reality in which the distribution of X might have different characteristics. In a specific situation, when x is the outcome of the sample X, the interval (u(x),v(x)) is also referred to as a confidence interval for . Note that it is no longer possible to say that the (observed) interval (u(x),v(x)) has probability to contain the parameter . This observed interval is just one realization of all possible intervals for which the probability statement holds. [edit]Intervals for random outcomes Confidence intervals can be defined for random quantities as well as for fixed quantities as in the above. See prediction interval. For this, consider an additional single-valued random variable Y which may or may not be statistically dependent on X. Then the rule for constructing the interval (u(x), v(x)) provides a confidence interval for the as-yet-to-be observed value y of Y if

Here Pr, is used to indicate the probability over the joint distribution of the random variables (X, Y) when this is characterised by parameters (, ). [edit]Approximate confidence intervals For non-standard applications it is sometimes not possible to find rules for constructing confidence intervals that have exactly the required properties. But practically useful intervals can still be found. The coverage probability c(, ) for a random interval is defined by

and the rule for constructing the interval may be accepted as providing a confidence interval if

to an acceptable level of approximation. [edit]Comparison to Bayesian interval estimates A Bayesian interval estimate is called a credible interval. Using much of the same notation as above, the definition of a credible interval for the unknown true value of is, for a given ,[5]

Validity. This means that the nominal coverage probability (confidence level) of the confidence interval should hold, either exactly or to a good approximation.

Optimality. This means that the rule for constructing the confidence interval should make as much use of the information in the data-set as possible. Recall that one could throw away half of a dataset and still be able to derive a valid confidence interval. One way of assessing optimality is by the length of the interval, so that a rule for constructing a confidence interval is judged better than another if it leads to intervals whose lengths are typically shorter.

Invariance. In many applications the quantity being estimated might not be tightly defined as such. For example, a survey might result in an estimate of the

median income in a population,

but it might equally be considered as providing an estimate of the logarithm of the median income, given that this is a common scale for presenting graphical results. It would
be desirable that the method used for constructing a confidence interval for the median income would give equivalent results when applied to constructing a confidence interval for the logarithm of the median income: specifically the values at the ends of the latter interval would be the logarithms of the values at the ends of former interval.

he confidence interval is the plus-or-minus figure usually reported in newspaper or television opinion poll results. For example, if you use a confidence interval of 4 and 47% percent of your sample picks an answer you can be "sure" that if you had asked the question of the entire relevant population between 43% (47-4) and 51% (47+4) would have picked that answer. The confidence level tells you how sure you can be. It is expressed as a percentage and represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level. When you put the confidence level and the confidence interval together, you can say that you are 95% sure that the true percentage of the population is between 43% and 51%. The wider the confidence interval you are willing to accept, the more certain you can be that the whole population answers would be within that range. For example, if you asked a sample of 1000 people in a city which brand of cola they preferred, and 60% said Brand A, you can be very certain that between 40 and 80% of all the people in the city actually do prefer that brand, but you cannot be so sure that between 59 and 61% of the people in the city prefer the brand. Factors that Affect Confidence Intervals There are three factors that determine the size of the confidence interval for a given confidence level. These are: sample size, percentageand population size.

Sample Size The larger your sample, the more sure you can be that their answers truly reflect the population. This indicates that for a given confidence level, the larger your sample size, the smaller your confidence interval. However, the relationship is not linear (i.e., doubling the sample size does not halve the confidence interval). Percentage Your accuracy also depends on the percentage of your sample that picks a particular answer. If 99% of your sample said "Yes" and 1% said "No" the chances of error are remote, irrespective of sample size. However, if the percentages are 51% and 49% the chances of error are much greater. It is easier to be sure of extreme answers than of middle-of-the-road ones. When determining the sample size needed for a given level of accuracy you must use the worst case percentage (50%). You should also use this percentage if you want to determine a general level of accuracy for a sample you already have. To determine the confidence interval for a specific answer your sample has given, you can use the percentage picking that answer and get a smaller interval. Population Size How many people are there in the group your sample represents? This may be the number of people in a city you are studying, the number of people who buy new cars, etc. Often you may not know the exact population size. This is not a problem. The mathematics of probability proves the size of the population is irrelevant, unless the size of the sample exceeds a few percent of the total population you are examining. This means that a sample of 500 people is equally useful in examining the opinions of a state of 15,000,000 as it would a city of 100,000. For this reason, thesample calculator ignores the population size when it is "large" or unknown. Population size is only likely to be a factor when you work with a relatively small and known group of people . Note: The confidence interval calculations assume you have a genuine random sample of the relevant population. If your sample is not truly random, you cannot rely on the intervals. Non-random samples usually result from some flaw in the sampling procedure. An example of such a flaw is to only call people during the day, and miss almost