Sei sulla pagina 1di 8

3.

4 Bayesian Testing
CSI 972 and 973 James E. Gentle A Paradigm for Statistical Testing of Hypotheses In statistical hypothesis testing, the basic problem is to decide whether or not to reject a statement about the distribution of a random variable. The statement must be expressible in terms of membership in a well-dened class. The hypothesis can therefore be expressed by the statement that the distribution of the random variable X is in the class PH = {P : H }. An hypothesis of this form is called a statistical hypothesis. This kind of statement is usually broken into two pieces, one part an assumption, assume the distribution of X is in the class P = {P : }, and the other part the hypothesis, H , where H . Given the assumptions, and the denition of H , we often denote the hypothesis as H, and write it as H : H . While, in general, to reject the hypothesis H would mean to decide that H , it is generally more / convenient to formulate the testing problem as one of deciding between two statements: H0 : 0 and H1 : 1 , where 0 1 = . An hypothesis H0 is testable versus an alternative H1 if and only if 0 0 and 1 1 implies P0 = P1 . For convenience, we will often use the phase H0 is true or H1 is true with the obvious meaning about the distribution of the random variable. We do not treat H0 and H1 symmetrically; H0 is the hypothesis to be tested and H1 is the alternative. This distinction is important in developing a methodology of testing. We sometimes also refer to H0 as the null hypothesis and to H1 as the alternative hypothesis. An hypothesis H : H in which #H = 1 is called a simple hypothesis; if #H > 1 it is called a composite hypothesis. Of course we are often interested in the case where = 0 1 . The Test For a statistical hypothesis that involves the distribution of the random variable X, a nonrandomized test procedure is a rule (X) that assigns two decisions to two disjoint subsets, C0 and C1 , of the range of X. In general, we require C0 C1 be the support of X. We equate those two decisions with the real numbers d0 and d1 , so T (X) is a real-valued function, (x) = d0 d1 for x C0 for x C1 .

For simplicity, we choose d0 = 0 and d1 = 1. Note for i = 0, 1, Pr(T (X) = i) = Pr(X Ci ). We call C1 the critical region, and generally denote it by just C.

4.1

The Bayesian Approach to Testing

In the Bayesian framework, we are interested in the probability that H0 is true. The prior distribution provides an a priori probability, and the posterior distribution based on the data provides a posterior probability that H0 is true. Clearly, we would choose to reject H0 when the probability that it is true is small.

A First, Simple Example Suppose we wish to test H0 : P = P 0 versus H1 : P = P 1 , and suppose that known probabilities p0 and p1 = 1 p0 can be assigned to H0 and H1 prior to the experiment. We see The overall probability of an error resulting from the use of the test is p0 E0 ((X)) + p1 E1 (1 (X)). The Bayes test that minimizes this probability is given by 1 when p1 (x) > k p0 (x) (x) = 0 when p1 (x) < k p0 (x), for k = p0 /p1 . The conditional probability of Hi given X = x, that is, the posterior probability of Hi is pi pi (x) p0 p0 (x) + p1 p1 (x) and the Bayes test therefore decides in favor of the hypothesis with the larger posterior probability. Testing as an Estimation Problem In the general setup above, we can dene an indicator function I0 (). The testing problem, as we have described it, is the problem of estimating I0 (). Let us use a statistic T (X) as an estimator of I0 (). The estimand is in {0, 1}, and so T (X) should be in {0, 1}, or at least in [0, 1]. Notice the relationship of T (X) to (X). For the estimation approach using T (X) to be equivalent to use of the test rule (X), it must be the case that T (X) = 1 (X) = 0 and T (X) = 0 (X) = 1 (i.e., reject) Following a decision-theoretic approach to the estimation problem, we dene a loss function. In the classical framework of Neyman and Pearson, the loss function is 0-1. Under this loss, using T (X) = t as the rule for the test we have 0 if t = I0 () L(, t) = 1 otherwise. The Bayes estimator of I0 () is the function that minimizes the posterior risk, E|x (L(, t)). The risk is just the posterior probability, so the Bayesian solution using this loss is T (x) = / 1 if Pr( 0 |x) > Pr( 0 |x) 0 otherwise, (i.e., dont reject)

where Pr() is evaluated with respect to the posterior distribution P|x .

The 0-1-c Loss Function In a Bayesian approach to hypothesis testing using the test (X) {0, 1}, we often formulate a loss function of the form cd for 0 L(, d) = bd for 1 where c1 > c0 and b0 > b1 . A common loss function has c0 = b1 = 0, b0 = 1, and c1 = c > 0. This is called a 0-1-c loss function. A Bayesian solution to hypothesis testing with a 0-1-c loss function is fairly easy to determine. The posterior risk for choosing (X) = 1, that is, for rejecting the hypothesis, is cPr( H0 |X = x), and the posterior risk for choosing (X) = 0 is Pr( H1 |X = x), hence the optimal decision is to choose (X) = 1 if cPr( H0 |X = x) < Pr( H1 |X = x), which is the same as 1 . 1+c In other words, the Bayesian approach says to reject the hypothesis if its posterior probability is small. The Bayesian approach has a simpler interpretation than the frequentist approach. It also makes more sense for other loss functions. Pr( H0 |X = x) < The Weighted 0-1 or a0 -a1 Loss Function Another approach to account for all possibilities and to penalize errors dierently when the null hypothesis is true or false, is to dene a weighted 0-1 loss function. Using the estimator T (X) = t {0, 1}, as above, we dene 0 if t = I0 () L(, t) = a0 if t = 0 and 0 a1 if t = 1 and 0 . / This is sometimes called a a0 -a1 loss. The 0-1-c loss and the a0 -a1 loss could be dened either in terms of the test rule or the estimator T ; I chose to do one one way and the other another way just for illustration. The Bayes estimator of I0 () using this loss is T (x) = 1 if Pr( 0 |x) > 0 otherwise,
a1 a0 +a1

where again Pr() is evaluated with respect to the posterior distribution. To see that this is the case, we write the posterior loss L(, t)dP|x = a0 Pr( 0 |x)I{0} (t) + a1 Pr( 0 |x)I{1} (t), /

and then minimize it. Under a a0 -a1 loss, the null hypothesis H0 is rejected whenever the posterior probability of H0 is too small. The acceptance level, a1 /(a0 +a1 ), is determined by the specic values chosen in the loss function. The Bayes test, which is the Bayes estimator of I0 (), depends only on a0 /a1 . The larger a0 /a1 is the smaller the posterior probability of H0 that allows for it to be accepted. This is consistent with the interpretation that the larger a0 /a1 is the more important a wrong decision under H0 is relative to H1 .

Examples Let us consider two familiar easy pieces using a a0 -a1 loss. First, let X| binomial(n, ) and assume a prior on of U(0, 1) (a special case of the conjugate beta prior). Suppose 0 = [0, 1/2]. The posterior probability that H0 is true is (n + 1)! x!(n x)!
1/2

x (1 )nx d.
0

This is computed and then compared to the acceptance level. (Note that the integral is a sum of fractions.) For another familiar example, consider X| N(, 2 ), with 2 known, and N(, 2 ). We recall that |x N((x), 2 ), where 2 2 2 + 2 x and 2 = 2 . (x) = 2 + 2 + 2 To test H0 , we compute the posterior probability of H0 . Suppose the null hypothesis is H0 : < 0. Then Pr(H0 |x) = = Pr (x) (x) < ((x)/).

The decision depends on the a1 /(a0 + a1 ) quantile of N(0, 1). Let za0 ,a1 be this quantile; that is, (za0 ,a1 ) = a1 /(a0 + a1 ). The H0 is accepted if (x) > za0 ,a1 . Rewriting this, we see that the null hypothesis is rejected x is greater than 2 2 1+ 2 2 za0 ,a1 .

Notice a very interesting aspect of these tests. There is no predetermined acceptance level. The decision is based simply on the posterior probability that the null hypothesis is true. A diculty of the a0 -a1 loss function, of course, is the choice of a0 and a1 . Ideally, we would like to choose these based on some kind of utility considerations, but sometimes this takes considerable thought.

4.2

The Bayes Factor

Given a prior distribution P , let p0 be the prior probability that H0 is true, and p1 be the prior probability that H1 is true. The prior odds then is p0 /p1 . Similarly, let p0 be the posterior probability that H0 is true given x, and p1 be the posterior probability that H1 is true, yielding the posterior odds p0 /1 . p The posterior probability of the event can be related to the relative odds. The posterior odds is p0 p0 = p1 p1 The term BF(x) = pX| (x|0 ) . pX| (x|)dP pX| (x|0 ) pX| (x|)dP

is called the Bayes factor. The Bayes factor clearly depends on the prior p (). Rather than computing the posterior odds directly, we emphasize the Bayes factor, which for any stated prior odds yields the posterior odds. The Bayes factor is the posterior odds in favor of the hypothesis if p0 = 0.5. Note that, for the simple hypothesis versus a simple alternative, the Bayes factor simplies to the likelihood ratio: pX| (x|0 ) . pX| (x|1 ) One way of looking at this likelihood ratio is to use MLEs under the two hypotheses: sup0 pX| (x|) . sup1 pX| (x|) 4

This approach, however, assigns Dirac masses at the MLEs, 0 and 1 . The Bayes factor is more properly viewed as a Bayesian likelihood ratio, BF(x) = p0 p1
0 1

pX| (x|0 )d pX| (x|1 )d

and, from a decision-theoretic point of view, it is entirely equivalent to the posterior probability of the null hypothesis. Under the a0 -a1 loss function, H0 is accepted when BF(x) > a1 p0 / a0 p1

From this, we see that the Bayesian approach eectively gives an equal prior weight to the two hypotheses, p0 = p1 = 1/2 and then modies the error penalties as ai = ai pi , for i = 0, 1, or alternatively, incorporates the weighted error penalties directly into the prior probabilities: p0 = a0 p0 a0 p0 + a1 p1 p1 = a1 p1 . a0 p0 + a1 p1

Jereys (1961) suggested a subjective scale to judge the evidence of the data in favor of or against H0 . He suggested if 0 < log10 (BF) < 0.5, the evidence against H0 is poor, if 0.5 log10 (BF) < 1, the evidence against H0 is substantial, if 1 log10 (BF) < 2, the evidence against H0 is strong, and if 2 log10 (BF), the evidence against H0 is decisive. While this scale makes some sense, the separations are somewhat arbitrary, and the approach is not based on a decision theory foundation. Given such a foundation, however, we still have the subjectivity inherent in the choice of a0 and a1 , or in the choice of a signicance level. Kass and Raftery (1995) discussed Jereyss scale and other issues relating to the Bayes factor. Kass and Raftery (1995) also gave an interesting example illustrating the Bayesian approach to testing of the hot hand hypothesis in basketball. They formulate the null hypothesis (that players do not have a hot hand) as the distribution of good shots by a given player, Yi , out of ni shots taken in game i as binomial(ni , ), for games i = 1, . . . , g; that is, the probability for a given player, the probability of making a shot is constant in all games (within some reasonable period). A general alternative is H1 : Yi binomial(ni , i ). We choose a at U(0, 1) conjugate prior for the H0 model. For the H1 model, we choose a conjugate prior beta(, ) with = / and = (1 )/. Under this prior, the prior expectation E(i |, ) has an expected value of , which is distributed as U(0, 1) for xed . The Bayes factor is is very complicated, involving integrals that cannot be solved in closed form. Kass and Raftery use this to motivate and to compare various methods of evaluating the integrals that occur in Bayesian analysis. One simple method is Monte Carlo. Often, however, the Bayes factor can be evaluated relatively easily for a given prior, and then it can be used to investigate the sensitivity of the results to the choice of the prior, by computing it for another prior. From Jereyss Bayesian viewpoint, the purpose of hypothesis testing is to evaluate the evidence in favor of a particular scientic theory. Kass and Raftery make the following points in the use of the Bayes factor in the hypothesis testing problem: Bayes factors oer a straightforward way of evaluating evidence in favor of a null hypothesis. Bayes factors provide a way of incorporating external information into the evaluation of evidence about a hypothesis. Bayes factors are very general and do not require alternative models to be nested. Several techniques are available for computing Bayes factors, including asymptotic approximations that are easy to compute using the output from standard packages that maximize likelihoods. In nonstandard statistical models that do not satisfy common regularity conditions, it can be technically simpler to calculate Bayes factors than to derive non-Bayesian signicance tests. 5

The Schwarz criterion (or BIC) gives a rough approximation to the logarithm of the Bayes factor, which is easy to use and does not require evaluation of prior distributions. The BIC is BIC = 2 log(L(m |x)) + k log n, where m is the value of the parameters that specify a given model, k is the number of unknown or free elements in m , and n is the sample size. The relationship is BIC/2 log(BF) 0, log(BF) as n . When one is interested in estimation or prediction, Bayes factors may be converted to weights to be attached to various models so that a composite estimate or prediction may be obtained that takes account of structural or model uncertainty. Algorithms have been proposed that allow model uncertainty to be taken into account when the class of models initially considered is very large. Bayes factors are useful for guiding an evolutionary model-building process. It is important, and feasible, to assess the sensitivity of conclusions to the prior distributions used. The Bayes Risk Set A risk set can be useful in analyzing Bayesian procedures when the parameter space is nite. If = {1 , . . . , k }, the risk set for a procedure T is a set in IRk : {(z1 , ..., zk ) : zi = R(i , T )}. In the case of 0-1 loss, the risk set is a subset of the unit hypercube; specically, for = {0, 1}, it is a subset of the unit square: [0, 1] [0, 1].

4.3

Bayesian Tests of a Simple Hypothesis

Although the test of a simple hypothesis versus a simple alternative, as in the rst example in this section, is easy to understand and helps to direct our thinking about the testing problem, it is somewhat limited in application. In a more common application, we may have a dense parameter space , and hypotheses that specify dierent subsets of . A common situation is the one-sided test for H0 : 0 versus H1 : > 0 . We can usually develop meaningful approaches to this problem, perhaps based on some boundary point of H0 . A two-sided test, in which, for example, the alternative species l = { : < 0 } u = { : > 0 }, presents more problems for the development of reasonable procedures. In a Bayesian approach, when the parameter space is dense, but either hypothesis is simple, there is a particularly troubling situation. This is because of the Bayesian interpretation of the problem as one in which a probability is to be associated with a statement about a specic value of a continuous random variable. Consider the problem in a Bayesian approach to deal with an hypothesis of the form H0 : = 0 , that is 0 = {0 }; versus the alternative H1 : = 0 . A reasonable prior for with a continuous support would assign a probability of 0 to = 0 . One way of getting around this problem may be to modify the hypothesis slightly so that the null is a small interval around 0 . This may make sense, but it is not clear how to proceed. Another approach is, as above, to assign a positive probability, say p0 , to the event = 0 . Although it may not appear how to choose p0 , just as it would not be clear how to choose an interval around 0 , we can

at least proceed to simplify the problem following this approach. We can write the joint density of X and as if = 0 , p0 pX| (x|0 ) pX, (x, ) = (1 p0 )pX| (x|) if = 0 . There are a couple of ways of simplifying. Let us proceed by denoting the prior density of over \0 as . We can write the marginal of the data (the observable X) as pX (x) = p0 pX| (x|0 ) + (1 p0 ) We can then write the posterior density of as p|x (|x) = where p1 = p1 p (x|) (1 p1 ) X| (x) pX p0 pX| (x|0 ) . pX (x) if = 0 , if = 0 , pX| (x|)d().

This is the posterior probability of the event = 0 .

4.4

Interpretations of Probability Statements in Statistical Inference

In developing methods for hypothesis testing and for setting condence regions, we can assume a model P for the state of nature and develop procedures by consideration of probabilities of the form Pr(T (X)S()|), where T (X) is a statistic, S() is some region determined by the true (unknown) value of , and is some relationship. The forms of T (X) and S() vary depending on the statistical procedure. The procedure may be a test, in which case we may have T (X) = 1 or 0, according to whether the hypothesis is rejected or not, or it may by a procedure to dene a condence region, in which case T (X) is a set. For example, if is given to be in H , and the procedure T (X) is an -level test of H, then Pr(T (X) = 1| H ) . In a procedure to dene a condence set, we may be able to say Pr(T (X) ) = 1 . These kinds of probability statements are somewhat awkward, and a person without training in statistics may nd them particularly dicult to interpret. Instead of a statement of the form Pr(T (X)|), many people would prefer a statement of the form Pr( H |X = x). In order to make such a statement, however, we rst must think of the parameter as a random variable and then we must formulate a conditional distribution for , given X = x. The usual way is to use a model that has several components: a marginal (prior) probability distribution for the unobservable random variable ; a conditional probability distribution for the observable random variable X, given = ; and other assumptions about the distributions. We denote the prior density of as p , and the conditional density of X as pX| . The procedure is to determine the conditional (posterior) distribution of , given X = x. If M is the model or hypothesis and D is the data, the dierence is between Pr(D|M ) (a frequentist interpretation), and Pr(M |D) (a Bayesian interpretation). People who support the latter interpretation will sometimes refer to the prosecutors fallacy in which Pr(E|H) is confused with Pr(H|E), where E is some evidence and H is some hypothesis. Some Bayesians do not think hypothesis testing is an appropriate statistical procedure. Because of the widespread role of hypothesis testing in science and in regulatory activities, however, statistical testing procedures must be made available.

4.5

Least Favorable Prior Distributions

In testing composite hypotheses, we often ask what is the worst case within the hypothesis. In a sense, this is the attempt to reduce the composite hypothesis to a simple hypothesis. This is the idea behind a p-value. In a Bayesian testing problem, this corresponds to a bound on the posterior probability. Again, consider the problem of testing H0 : = 0 versus the alternative H1 : = 0 .

4.6

Lindleys Paradox

Consider a null hypothesis H0 , the result of an experiment x, and a prior distribution that favors H0 weakly. Lindleys paradox occurs when the result x is signicant by a frequentist test, indicating sucient evidence to reject H0 at a given level, but the posterior probability of H0 given x is high, indicating strong evidence that H0 is in fact true. This can happen at the same time when the prior distribution is the sum of a sharp peak at H0 with probability p and a broad distribution with the rest of the probability 1 p. It is a result of the prior having a sharp feature at H0 and no sharp features anywhere else. Consider the problem of testing **********************************************************

Notes
Berger (1985) and Robert (2001) provide extensive coverage of statistical inference from a Bayesian perspective. Both of these books compare the frequentist and Bayesian approaches and argue that the Bayesian paradigm is more solidly grounded. Many of the ideas in The Bayesian approach derive from Jereys (1961) book on probability, which emphasized a subjective view. ***** stu to add: improper priors pseudo-Bayes factors training sample arithmetic intrinsic Bayes factor geometric intrinsic Bayes factor median intrinsic Bayes factor

Potrebbero piacerti anche