Sei sulla pagina 1di 33

Alphabet Soup

Blurring the Distinctions Between ps and s in Psychological Research

Raymond Hubbard
Drake University
Abstract. Confusion over the reporting and interpretation of results of commonly employed classical statistical tests is recorded in a sample of 1,645 papers from 12 psychology journals for the period 1990 through 2002. The confusion arises because researchers mistakenly believe that their interpretation is guided by a single unied theory of statistical inference. But this is not so: classical statistical testing is a nameless amalgamation of the rival and often contradictory approaches developed by Ronald Fisher, on the one hand, and Jerzy Neyman and Egon Pearson, on the other. In particular, there is extensive failure to acknowledge the incompatibility of Fishers evidential p value with the Type I error rate, , of NeymanPearson statistical orthodoxy. The distinction between evidence (ps) and errors (s) is not trivial. Rather, it reveals the basic differences underlying Fishers ideas on signicance testing and inductive inference, and NeymanPearson views on hypothesis testing and inductive behavior. So complete is this misunderstanding over measures of evidence versus error that it is not viewed as even being a problem among the vast majority of researchers and other relevant parties. These include the APA Task Force on Statistical Inference, and those writing the guidelines concerning statistical testing mandated in APA Publication Manuals. The result is that, despite supplanting Fishers signicance-testing paradigm some fty years or so ago, recognizable applications of NeymanPearson theory are few and far between in psychologys empirical literature. On the other hand, Fishers inuence is ubiquitous. Key Words: Fisher, hybrid statistical model, inductive behavior, inductive inference, NeymanPearson It is my personal belief that an objective look at the record will show that Fisher contributed a large number of statistical methods, but that Neyman contributed the basis of statistical thinking. (Lucien LeCam, quoted in Reid, 1982, p. 268)

Extensive confusion prevails among psychologists concerning the reporting and interpretation of results of classical statistical tests. The reason for this
Theory & Psychology Copyright 2004 Sage Publications. Vol. 14(3): 295327 DOI: 10.1177/0959354304043638 www.sagepublications.com

296

THEORY

&

PSYCHOLOGY

14(3)

confusion is that textbooks on statistical methods in psychology and the social sciences usually present the subject matter as a single, comprehensive, uncontroversial theory of statistical inference. These texts rarely allude to the fact that classical statistical inference as it is generally portrayed is in fact an anonymous hybrid consisting of the union of the ideas developed by Ronald Fisher, on the one hand, and Jerzy Neyman and Egon Pearson, on the other (Gigerenzer, 1993; Gigerenzer & Murray, 1987; Gigerenzer et al., 1989; Huberty, 1993; Huberty & Pike, 1999). It is a union that neither side would have agreed to, given the pronounced philosophical and methodological differences between them. In fact, a bitter debate raged over the years between the Fisherian and NeymanPearson camps. The seminal work of Gigerenzer and his colleagues (Gigerenzer, 1993; Gigerenzer & Murray, 1987; Gigerenzer et al., 1989) and Huberty (1993; Huberty & Pike, 1999) notwithstanding, most researchers in psychology and elsewhere remain uninformed about the historical development of methods of statistical inference, and of the mixing of Fisherian and NeymanPearson concepts. In particular, there is widespread failure to acknowledge the incompatibility of Fishers evidential p value (actually, Karl Pearson [1900] introduced the modern p value, but Fisher popularized it) with Neyman Pearsons Type I error rate, (Goodman, 1993). The difference between evidence (ps) and errors (s) is not some semantic splitting of hairs. Rather, it points to the basic distinctions between Fishers notions of signicance testing and inductive inference, versus NeymanPearsons ideas on hypothesis testing and inductive behavior. But since statistics textbooks often surreptitiously blend concepts from both sides, misunderstandings concerning the reporting and interpretation of statistical tests are virtually guaranteed. Adding insult to injury, the confusion over measures of evidence versus errors is so completely ingrained that it is not even seen as being a problem among the rank and le of researchers. As proof of this, even critics of statistical testing often fail to distinguish between ps and s, and the repercussions this has on the meaning of empirical results. Thus, many of these critics (e.g. Carver, 1978; Kirk, 1996; Krueger, 2001; Rozeboom, 1960; Wilkinson & the APA Task Force on Statistical Inference, 1999) unwittingly adopt a Fisherian stance inasmuch as they talk almost exclusively in terms of p, as opposed to , values. Still other critics or discussants of statistical testing (e.g. American Psychological Association, 1994, 2001; Cohen, 1990, 1994; Dar, Serlin, & Omer, 1994; Falk & Greenbaum, 1995; Loftus, 1996; Mulaik, Raju, & Harshman, 1997; Nickerson, 2000; Rosnow & Rosenthal, 1989; Schmidt, 1996) are inclined, erroneously, to use ps and s interchangeably. Additional examples of recent studies critiquing the merits of statistical testing that nonetheless continue to offer incorrect advice regarding the interpretation of ps and s are readily adduced from the literature (e.g. Chow, 1996, 1998; Clark, 1999; Daniel, 1998; Dixon & OReilly, 1999; Finch, Cumming, & Thomason,

HUBBARD: DISTINCTIONS BETWEEN PS AND

297

2001; Grayson, Pattison, & Robins, 1997; Hyde, 2001; Macdonald, 1997; Nix & Barnette, 1998). The varying levels of confusion exhibited in so many articles dealing with the meaning and interpretation of classical statistical tests points to the need to become familiar with their historical development. Krantz (1999) would surely agree with this assessment. In light of the above concerns, the present paper addresses how the confusion between ps and s came about. I do this by rst reporting on the major differences in the structure of the Fisherian and NeymanPearson schools of thought. In doing so, I typically let the protagonists speak for themselves. This is an absolute necessity given that their own names are conspicuously absent from the textbooks used to help teach psychologists about statistical methods. Because textbook authors almost uniformly do not cite and discuss Fishers and NeymanPearsons respective contributions to the statistics literature, it is hardly surprising to learn that present researchers are not familiar with them. Second, I show how the competing ideas from the two camps have been inadvertently merged. The upshot is that although NeymanPearson theory claimed the mantle of statistical orthodoxy some fty or so years ago (Hogben, 1957; LeCam & Lehmann, 1974; Nester, 1996; Royall, 1997; Spielman, 1974), it is Fishers inuence which dominates statistical testing procedures in psychology today. Third, empirical evidence is gathered from a random sample of articles in 12 psychology journals for the period 19902002 detailing the widespread confusion among researchers caused by the mixing of Fisherian and NeymanPearson perspectives. This evidence is manifested in how researchers, through their misunderstandings of the differences between ps and s, almost universally misreport and misinterpret the outcomes of statistical tests. They can scarcely help it, for such misreporting and misinterpretation is virtually sanctioned in the advice found in APA Publication Manuals (1994, 2001). The end result is that applications of classical statistical testing in psychology are largely meaningless. And this signals the need for changes in the way in which it is taught in the classroom. More specically, hopes for eliminating (or at least drastically reducing) the mass confusion over the meanings of ps and s must rest on acquainting students with the fundamentals of the historical development of Fisherian and Neyman Pearson statistical testing. The present paper attempts to do this.

Comparing and Contrasting the Fisherian and NeymanPearson Paradigms Fishers Paradigm of Signicance Testing Fishers ideas on signicance testing, popularized in the many editions of his widely inuential books Statistical Methods for Research Workers (1925)

298

THEORY

&

PSYCHOLOGY

14(3)

and The Design of Experiments (1935a), were enthusiastically received by practitioners. At the heart of his conception of inductive inference is what he termed the null hypothesis, H0. Although briey dabbling with Bayesian approaches (Zabell, 1992), Fisher quickly renounced the methods of inverse probability, or the probability of a hypothesis (H) given the data (x), Pr(H | x), instead championing the direct probability, Pr(x | H). In particular, Fisher used disparities in the data to reject the null hypothesis, that is, the probability of the data conditional on a true null hypothesis, or Pr(x | H0). Consequently, a signicance test is a means of determining the probability of a result, in addition to more extreme ones, on a null hypothesis of no effect or relationship. In Fishers model the researcher proposes a null hypothesis that a sample comes from a hypothetical innite population with a known sampling distribution. As Gigerenzer and Murray (1987) comment, the null hypothesis is rejected if our sample statistic deviates from the mean of the sampling distribution by more than a criterion, which corresponds to alpha, the level of signicance (p. 10)1 In other words, the p value from a signicance test is regarded as a measure of the implausibility of the actual observations (as well as more extreme and unobserved ones) obtained in an experiment or other study, assuming a true null hypothesis. The rationale for the signicance test is that if the data are seen as being rare or highly discrepant under H0 this constitutes inductive evidence against H0. Fisher (1966) noted that It is usual and convenient for experimenters to take 5 per cent as a standard level of signicance, in the sense that they are prepared to ignore all results which fail to reach this standard (p. 13). Thus, Fishers signicance testing revolves around the rejection of the null hypothesis at the p .05 level. If, in an experiment, the researcher obtains a p value of, say, .05 or .01 on a true null hypothesis, it would be interpreted to mean that the probability of obtaining such an extreme (or more extreme) value is only 5% or 1%. (Hence, Fisher is a frequentist, but not in the same sense as Neyman Pearson.) For Fisher (1966), then, Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis (p. 16). In the Fisherian paradigm, an event is deemed established when we can conduct experiments that rarely fail to yield statistically signicant (p .05) results. As mentioned earlier, Fisher considered p values from single experiments as supplying inductive evidence against the null hypothesis, with smaller p values indicating greater evidence (Johnstone, 1986, 1987b; Spielman, 1974). According to Fishers famous disjunction, a p value .05 on the null hypothesis shows that either a rare event has occurred or else the null hypothesis is false (Seidenfeld, 1979). Fisher was sure that statistics could play a major role in fostering inductive inference, that is, drawing inferences from the particular to the general, from samples to populations. According to him, Inductive in-

HUBBARD: DISTINCTIONS BETWEEN PS AND

299

ference is the only process known to us by which essentially new knowledge comes into the world (Fisher, 1966, p. 7). But Fisher (1958) was wary that mathematicians (certainly Neyman) did not necessarily subscribe to his inductivist viewpoint:
In that eld of deductive logic, at least when carried out with mathematical symbols, [mathematicians] are of course experts. But it would be a mistake to think that mathematicians as such are particularly good at the inductive logical processes which are needed in improving our knowledge of the natural world, in reasoning from observational facts to the inferences which those facts warrant. (p. 261)

Fisher never wavered in his belief that inductive reasoning was the chief mechanism of knowledge development, and for him the p values from signicance tests were evidential. NeymanPearsons Paradigm of Hypothesis Testing The NeymanPearson (1928a, 1928b, 1933) statistical paradigm is widely accepted as the norm in classical statistical circles (Carlson, 1976; Hogben, 1957; LeCam & Lehmann, 1974; Nester, 1996; Oakes, 1986; Royall, 1997; Spielman, 1974). Their work on hypothesis testing, terminology they preferred to distinguish it from Fishers signicance testing, was quite distinct from the latters framework of inductive inference. The NeymanPearson approach postulates two competing hypotheses: the null hypothesis (H0) and the alternative hypothesis (HA). In justifying the need for an alternative hypothesis, Neyman (1977) wrote:
. . . in addition to H [the null hypothesis] there must exist some other hypotheses, one of which may conceivably be true. Here, then, we come to the concept of the set of all admissible hypotheses which is frequently denoted by the letter . Naturally, must contain H. Let H denote the complement, say H=H. It will be noticed that when speaking of a test of the hypothesis H, we really speak of its test against the alternative H. This is quite important. The fact is that, unless the alternative H is specied, the problem of an optimal test of H is indeterminate. (p. 104)

NeymanPearson considered Fishers usage of the occurrence of rare or implausible results to reject H0 to be an inadequate vehicle for hypothesis testing. Something more was needed. They wanted to see whether this same improbable outcome under H0 is more likely to occur under a competing hypothesis. As Pearson later explained: The rational human mind did not discard a hypothesis until it could conceive at least one plausible alternative hypothesis (E.S. Pearson, 1990, p. 82). Even William S. Gosset (Student, of t test fame), a man whom Fisher admired, saw the need for an alternative hypothesis. In response to a letter from Pearson, Gosset wrote that the only valid reason for rejecting any statistical hypothesis, no matter how unlikely, is that some alternative hypothesis explains the observed events with a

300

THEORY

&

PSYCHOLOGY

14(3)

greater degree of probability (quoted in Reid, 1982, p. 62). The inclusion of an alternative hypothesis by NeymanPearson critically distinguishes their approach from Fishers, and this was an issue of great contention between the two camps over the years. In NeymanPearson theory, the investigator selects a (typically point) null hypothesis and tests it against the alternative hypothesis. Their work introduced the probabilities of committing two kinds of error, namely false rejection (Type I error) and false acceptance (Type II error) of the null hypothesis. The former probability is called , while the latter probability is called . Eschewing Fishers ideas about hypothetical innite populations, NeymanPearson results are predicated on the assumption of repeated random sampling from a dened population (Gigerenzer & Murray, 1987). Therefore, NeymanPearson theory is best equipped for handling situations where repeated random sampling has meaning, such as in the case of quality-control experiments. In these narrow circumstances, the Neyman Pearson frequentist interpretation of probability makes sense: is the longrun relative frequency of Type I errors conditional on the null being true and is the counterpart for Type II errors. The NeymanPearson theory of hypothesis testing introduced the entirely new concept of the power of a statistical test. The power of a test, or (1), is the probability of rejecting a false null hypothesis. Since the power of a test to detect a particular effect size in the population can be calculated before conducting the research, it is useful in the design of experiments. In Fishers signicance-testing scheme, however, there is no alternative hypothesis (HA), making the ideas about Type II errors and the power of the test irrelevant. Fisher (1935b) pointed this out when rebuking Neyman and Pearson without naming them: In fact . . . errors of the second kind are committed only by those who misunderstand the nature and application of tests of signicance (p. 474). And he subsequently added:
The notion of an error of the so-called second kind, due to accepting the null hypothesis when it is false . . . has no meaning with respect to simple tests of signicance, in which the only available expectations are those which ow from the null hypothesis being true. (Fisher, 1966, p. 17)

Fisher denied the need for an alternative hypothesis, and strenuously opposed its incorporation by NeymanPearson (Gigerenzer & Murray, 1987; Hacking, 1965). Fisher (1966), however, touches upon the concept of the power of a test when discussing the sensitiveness of an experiment:
By increasing the size of the experiment we can render it more sensitive, meaning by this that it will allow of the detection of a lower degree of sensory discrimination, or, in other words, of a quantitatively smaller departure from the null hypothesis. Since in every case the experiment is capable of disproving, but never of proving this hypothesis, we may say

HUBBARD: DISTINCTIONS BETWEEN PS AND

301

that the value of the experiment is increased whenever it permits the null hypothesis to be more readily disproved. (pp. 2122)

And Neyman (1967) was, of course, familiar with this: The consideration of power is occasionally implicit in Fishers writings, but I would have liked to see it treated explicitly (p. 1459). Whereas Fishers view of inductive inference centered on the rejection of the null hypothesis, Neyman and Pearson had no time at all for the very idea of inductive reasoning. Their concept of inductive behavior sought to provide rules for making decisions between two hypotheses, regardless of the researchers belief in either one. Neyman (1950) made this quite explicit:
Thus, to accept a hypothesis H means only to decide to take action A rather than action B. This does not mean that we necessarily believe that the hypothesis H is true. . . [while rejecting H] . . . means only that the rule prescribes action B and does not imply that we believe that H is false. (pp. 259260)

NeymanPearson theory, therefore, substitutes the idea of inductive behavior for that of inductive inference. According to Neyman (1971):
The description of the theory of statistics involving a reference to behavior, for example, behavioristic statistics, has been introduced to contrast with what has been termed inductive reasoning. Rather than speak of inductive reasoning I prefer to speak of inductive behavior. (p. 1)

And The term inductive behavior means simply the habit of humans and other animals (Pavlovs dogs, etc.) to adjust their actions to noticed frequencies of events, so as to avoid undesirable consequences (Neyman, 1961, p. 148; see also Neyman, 1962). Further defending his preference for inductive behavior over inductive inference, Neyman (1957) acknowledged his suspicions about the latter because of its dogmatism, lack of clarity, and because of the absence of consideration of consequences of the various actions contemplated (p. 16). In presenting his decision rules for taking action A rather than B, Neyman (1950) emphasized that the theory of probability and statistics both play an important role, and there is a considerable amount of reasoning involved. As usual, however, the reasoning is all deductive (p. 1). The deductive character of the NeymanPearson model proceeds from the general to the particular. They came up with a rule of behavior for selecting between two alternative courses of action, accepting or rejecting the null hypothesis, such that in the long run of experience, we shall not be too often wrong (Neyman & Pearson, 1933, p. 291). Whether to accept or reject the hypothesis in their framework depends on the cost trade-offs involved with committing a Type I or Type II error. These costs are independent of statistical theory. They must be estimated by the researcher

302

THEORY

&

PSYCHOLOGY

14(3)

in the context of each particular problem. Neyman and Pearson (1933) advised:
. . . in some cases it will be more important to avoid the rst [type of error], in others the second [type of error]. . . . From the point of view of mathematical theory all we can do is to show how the risk of errors may be controlled or minimised. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator. (p. 296)

After heeding such advice, the researcher would design an experiment to control the probabilities of the and error rates, with the best test being the one that minimizes subject to a bound on (Lehmann, 1993). In determining what this bound on should be, Neyman (1950) later stated that the control of Type I errors was more important than that of Type II errors:
The problem of testing statistical hypotheses is the problem of selecting critical regions. When attempting to solve this problem, one must remember that the purpose of testing hypotheses is to avoid errors insofar as possible. Because an error of the rst kind is more important to avoid than an error of the second kind, our rst requirement is that the test should reject the hypothesis tested when it is true very infrequently. . . . To put it differently, when selecting tests, we begin by making an effort to control the frequency of the errors of the rst kind (the more important errors to avoid), and then think of errors of the second kind. The ordinary procedure is to x arbitrarily a small number . . . and to require that the probability of committing an error of the rst kind does not exceed . (p. 265)

Consequently, is specied or xed prior to the collection of the data. Because of this, NeymanPearson methodology is sometimes labeled the xed (Huberty, 1993), xed level (Lehmann, 1993) or xed size (Seidenfeld, 1979) approach. This contrasts with Fishers p value, which is a random variable whose distribution is uniform over the interval [0, 1] under the null hypothesis. The and error rates dene a critical or rejection region for the test statistic, say z or t > 1.96. If the test statistic falls in the critical region, H0 is rejected in favor of HA, otherwise H0 is retained (Goodman, 1993; Huberty, 1993). Furthermore, descriptions of NeymanPearson theory refer to the rejection of H0 when H0 is truethe Type I error probability, as the signicance level of a test. As we shall see below, calling the Type I error probability the signicance level of a statistical test was something quite unacceptable to Fisher. It has also helped to create enormous confusion among researchers concerning the meaning and interpretation of statistical signicance. Recall that Fisher regarded his signicance tests as constituting inductive evidence against the null hypothesis in single experiments (Johnstone, 1987a; Kyburg, 1974; Seidenfeld, 1979). NeymanPearson hypothesis tests, on the other hand, do not permit an inference to be made about the outcome

HUBBARD: DISTINCTIONS BETWEEN PS AND

303

of any individual hypothesis that the researcher is examining. Neyman and Pearson (1933) were unequivocal about this: We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis (pp. 290291). But since scientists are in the business of gleaning evidence from individual studies, this limitation of NeymanPearson theory is acute. Nor, for that matter, does the Neyman Pearson model allow an inference to be made in the case of ongoing, repetitive studies. Thus, Grayson, Pattison and Robins (1997) were incorrect when they stated that one implication of a strictly frequentist [Neyman Pearson] approach is that we can only make inferences on the basis of a long run of repeated trials (p. 68). NeymanPearson theory is strictly behavioral; it is non-evidential in both the short and long runs. As Neyman (1942) wrote: it will be seen that the theory of testing hypotheses has no claim of any contribution to . . . inductive reasoning (p. 301). Fisher (1959) recognized this, commenting that the NeymanPearson procedure is devised for a whole class of cases. No particular thought is given to each case as it arises, nor is the testers capacity for learning exercised (p. 100). Instead, the investigator is only allowed to make a decision about the likely outcome of a hypothesis as if it had been subjected, as Fisher (1956) observed, to an endless series of repeated trials which will never take place (p. 99). In the vast majority of applied work, repeated random sampling does not occur; empirical ndings are usually limited to a single sample. Fisher conceded that Neyman and Pearsons contribution, which he referred to as an acceptance procedures approach, had merit in the context of quality-control decisions. For example, he acknowledged: I am casting no contempt on acceptance procedures, and I am thankful, whenever I travel by air, that the high level of precision and reliability required can really be achieved by such means (Fisher, 1955, p. 69). This concession aside, Fisher (1959) was resolute in his objections to NeymanPearson ideas about hypothesis testing as an appropriate method for guiding scientic research:
The Theory of Testing Hypotheses was a later attempt, by authors who had taken no part in the development of [signicance] tests, or in their scientic application, to reinterpret them in terms of an imagined process of acceptance sampling, such as was beginning to be used in commerce; although such processes have a logical basis very different from those of a scientist engaged in gaining from his observations an improved understanding of reality. (pp. 45)

He insisted that
. . . the logical differences between [acceptance procedures] and the work of scientic discovery by physical or biological experimentation seem to me so wide that the analogy between them is not helpful, and the

304

THEORY

&

PSYCHOLOGY

14(3)

identication of the two sorts of operation is decidedly misleading. (Fisher, 1955, pp. 6970)

In further distancing himself from NeymanPearson methodology, Fisher (1955) drew attention to the fact that:
From a test of signicance, however, we learn more than that the body of data at our disposal would have passed an acceptance test at some particular level; we may learn, if we wish to, and it is to this that we usually pay attention, at what level it would have been doubtful; doing this we have a genuine measure of the condence with which any particular opinion may be held, in view of our particular data. From a strictly realistic viewpoint we have no expectation of an unending sequence of similar bodies of data, to each of which a mechanical yes or no response is to be given. What we look forward to in science is further data, probably of a somewhat different kind, which may conrm or elaborate the conclusions we have drawn; but perhaps of the same kind, which may then be added to what we have already, to form an enlarged basis for induction. (p. 74)

The above discussion shows that Fisher and NeymanPearson disagreed vehemently over both the nature of statistical methods and their approaches to the conduct of science per se. Indeed, ongoing exchanges of a frequently acrimonious nature passed between Fisher and NeymanPearson as both sides promulgated their respective conceptions of statistical analysis and the scientic method. Minding Ones ps and s Users of statistical techniques in the social and medical sciences are almost totally unaware of the distinctions, described above, between Fishers ideas on signicance testing and NeymanPearson thoughts on hypothesis testing (Gigerenzer, 1993; Goodman, 1993, 1999; Huberty, 1993; Royall, 1997). This is through no fault of their own; after all, they have been taught from numerous well-regarded textbooks on statistical methods. Unfortunately, many of these same textbooks combine, without acknowledgement, incongruous ideas from both the Fisherian and NeymanPearson camps. This is something that both sides found appalling. Ironically, as will be seen, the end result of this unintentional mixing of Fisherian with NeymanPearson ideas is that although the latters work came to be accepted as statistical orthodoxy about fty years ago (Hogben, 1957; Spielman 1974), it is Fishers methods that ourish today. As Royall (1997) observed:
The distinction between NeymanPearson tests and [Fishers] signicance tests is not made consistently clear in modern statistical writing and teaching. Mathematical statistical textbooks tend to present Neyman Pearson theory, while statistical methods textbooks tend to lean more towards signicance tests. The terminology is not standard, and the same

HUBBARD: DISTINCTIONS BETWEEN PS AND

305

terms and symbols are often used in both contexts, blurring the differences between them. (p. 64)

Johnstone (1986) and Keuzenkamp and Magnus (1995) maintain that statistical testing usually follows NeymanPearson formally, but Fisher philosophically. For instance, Fishers notion of disproving the null hypothesis is taught along with the NeymanPearson concepts of alternative hypotheses, Type II errors and the power of a statistical test. In addition, textbook descriptions of NeymanPearson theory often refer to the Type I error probability as the signicance level (Goodman, 1999; Kempthorne, 1976; Royall, 1997). But the quintessential example of the bewilderment caused by the forging of Fishers ideas on inductive inference with the NeymanPearson principle of inductive behavior is the widely unappreciated fact that the formers p value is incompatible with the NeymanPearson hypothesis test in which it has become embedded (Goodman, 1993). Despite this fundamental incompatibility, the end result of this mixing is that the p value is now indelibly linked in researchers minds with the Type I error rate, . And this is precisely what Fisher (1955) had earlier complained about when he accused NeymanPearson of attempting to assimilate a test of signicance to an acceptance procedure (p. 74). Because of this assimilation, much empirical work in psychology and the social and biological sciences proceeds in the following manner: the investigator states the null (H0) and alternative (HA) hypotheses, the Type I error rate/signicance level, , and presumablybut rarelycalculates the statistical power of the test (e.g. t). These steps are in accordance with NeymanPearson convention. After this, the test statistic is computed for the sample data, and in an effort to have the best of both worlds, an associated p value (signicance probability) is calculated. The p value is then erroneously interpreted as a frequency-based observed Type I error rate, (Goodman, 1993), and at the same time as an incorrect (i.e. p < ) measure of evidence against H0. The p Value as a Type I Error Rate Even staunch critics of, and other commentators on, statistical testing in psychology occasionally commit this error. Thus Dar et al. (1994) noted: The sample p value, in the context of null hypothesis testing, is involved . . . in determining whether the predetermined criterion of Type I error, the alpha level, has been surpassed (p. 76). Likewise, Meehl (1967) misreported that the investigator gleefully records the tiny probability number p < .001, and there is a tendency to feel that the extreme smallness of this probability of a Type I error is somehow transferable (p. 107, my emphasis). Nickerson (2000) also makes the mistake of drawing a parallel between p values and Type I error rates: The value of p that is obtained as the result of NHST is the probability of a Type I error on the assumption that the null

306

THEORY

&

PSYCHOLOGY

14(3)

hypothesis is true (p. 243). He later goes on to compound this mistake by adding: Both p and represent bounds on the probability of Type I error . . . p is the probability of a Type I error resulting from a particular test if the null hypothesis is true (p. 259, my emphasis). Here, Nickerson misinterprets a p value as an observed Type I error rate, something which is impossible since the latter applies only to long-run frequencies, not to individual instances. Neyman (1971) expressed this as follows:
It would be nice if something could be done to guard against errors in each particular case. However, as long as the postulate is maintained that the observations are subject to variation affected by chance (in the sense of frequentist theory of probability), all that appears possible to do is to control the frequencies of errors in a sequence of situations. (p. 13)

The p value is not a Type I error rate, long-run or otherwise; it is a measure of inductive evidence against H0. Type I errors play no role in Fishers paradigm. This misinterpretation of his evidential p value as a NeymanPearson Type I error rate severely upset Fisher, who was adamant that the signicance level of a statistical test had no ongoing sampling interpretation. With regard to the .05 level, Fisher (1929) early on warned that this does not mean that the researcher allows himself to be deceived once in every twenty experiments. The test of signicance only tells him what to ignore, namely all experiments in which signicant results are not obtained (p. 191). The signicance level, for Fisher, was a measure of evidence for the objective disbelief in the null hypothesis; it had no long-run frequentist characteristics. Again, Fisher (1950) protested that his tests of signicance
. . . have been most unwarrantably ignored in at least one pretentious work on Testing Statistical Hypotheses . . . Pearson and Neyman have laid it down axiomatically that the level of signicance of a test must be equated to the frequency of a wrong decision in repeated samples from the same population. This idea was foreign to the development of tests of signicance given by the author in 1925. (p. 35.173a, my emphasis)

Seidenfeld (1979) exposed the difference between the two schools of thought on this crucial matter:
. . . such a frequency property has little or no connection with the interpretation of the [Fisherian signicance] test. To repeat, the correct interpretation is through the disjunction, either a rare event has occurred or the null hypothesis is false. (p. 79)

In highlighting the discrepancies between ps and s, Gigerenzer (1993) offered the following:
For Fisher, the exact level of signicance is a property of the data (i.e., a relation between a body of data and a theory); for Neyman and Pearson,

HUBBARD: DISTINCTIONS BETWEEN PS AND

307

alpha is a property of the test, not of the data. Level of signicance [p value] and alpha are not the same thing (p. 317, my emphasis)

Despite the above cautions about p values not being Type I error rates, it is sobering to note that even well-known statisticians such as Barnard (1985), Gibbons and Pratt (1975) and Hinkley (1987) nevertheless make the mistake of equating them. Yet, as Berger and Delampady (1987) warn, the interpretation of the p value as an error rate is strictly forbidden:
P-values are not a repetitive error rate . . . A NeymanPearson error probability, , has the actual frequentist interpretation that a long series of level tests will reject no more than 100% of true H0, but the datadependent-P-values have no such interpretation. (p. 329)

At the same time it must be underlined that NeymanPearson would not endorse an inferential or epistemic interpretation of statistical testing, as manifested in a p value. Their theory is behavioral, not evidential, and they would likewise complain that the p value is not a Type I error rate. It should therefore be pointed out that in his effort to partially resolve discrepancies between the Fisherian and NeymanPearson programs, Lehmann (1993) similarly fails to distinguish between measures of evidence versus error. He refers to the Type I error rate as the signicance level of the test, when for Fisher this was determined by p values and not s. And we have seen that misconstruing the evidential p value as a NeymanPearson Type I error rate was anathema to both Fisher and NeymanPearson. The p Value as a Quasi-measure of Evidence Against H0 (p < ) While the p value is being erroneously reported as a NeymanPearson Type I error rate, it will be interpreted simultaneously in an incorrect quasiFisherian manner as evidence against H0. If p < , a statistically signicant nding is announced, and the null hypothesis is disproved. For example, Leavens and Hopkins (1998) declaration that alpha was set at p < .05 for all tests (p. 816) reects a common tendency among researchers to confuse ps and s in a statistical signicance testing framework. Clark-Carter (1997) goes further in this regard by invoking the great man himself: According to Fisher, if p were greater than then the null hypothesis could not be accepted (p. 71). Fisher, of course, would have taken umbrage at such a statement, just as he would have with Clarks (1999) assertion that in Fishers original work in agriculture, alpha was set a priori (p. 283), with Huberty (1993, p. 328) and Huberty and Pikes (1999, p. 11) suggestions that Fisher encouraged the use of = .05, with Cortina and Dunlaps (1997) recommendation to compare observed probabilities with predetermined cut-off values, and with Chows (1996) error of investing his (Fishers) tests with both ps and s. Fisher had no use for the concept . Again, in an otherwise thoughtful article titled The Appropriate Use of Null Hypothesis Testing, Frick (1996) nonetheless makes the mistake of using ps and s

308

THEORY

&

PSYCHOLOGY

14(3)

interchangeably: Finally, the obtained value of p is compared to a criterion alpha, which is conventionally set at .05 . . . When p is less than .05, the experimenter has sufcient empirical evidence to support a claim (p. 385). In addition, Nickerson (2000), who earlier misinterpreted a p value as a Type I error rate, follows Frick (1996) in ascribing an evidential meaning to the p value when it is directly compared with this error rate:
A specied signicance level conventionally designated (alpha) serves as a decision criterion, and the null hypothesis is rejected only if the value of p yielded by the test is not greater than the value of . If is set at .05, say, and a signicance test yields a value of p equal to or less than .05, the null hypothesis is rejected and the result is said to be statistically signicant at that level. (Nickerson, 2000, pp. 242243)

Nix and Barnette (1998) do likewise in a paper subtitled A Review of Null Hypothesis Signicance Testing: As such, p values lower than the alpha value are viewed as a rejection of the null hypothesis, and p values equal to or greater than the alpha value are viewed as a failure to reject (p. 6). Yet we have seen that interpreting p values as evidence against the null hypothesis in a single experiment is impossible in the NeymanPearson framework. Their approach centers on decision rules with a priori stated error rates, and , which are limiting frequencies based on long-run repeated sampling. If a result falls into the critical region, H0 is rejected and HA is accepted, otherwise H0 is accepted and HA is rejected (Goodman, 1993; Huberty, 1993). Interestingly, this last claim contradicts Fishers (1966) remark that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation (p. 16). In the NeymanPearson framework one can indeed accept the null hypothesis. Neymans (1942) advice makes this plain:
. . . we may say that any test of a statistical hypothesis consists in a selection of a certain region, w0, in the n dimensional experimental space W, and in basing our decision as to the hypothesis H0 on whether the experimental point E, determined by the actual observations, falls within w0 or not. If it does, the hypothesis H0 will be rejected, if it does not, it will be accepted. (p. 303, my emphasis)

Note, then, the distinctly Fisherian bent adopted by Wilkinson and the APA Task Force on Statistical Inference (1999) when they recommend: Never use the unfortunate expression accept the null hypothesis (p. 599). To reiterate, this advice is at odds with, ostensibly, NeymanPearson statistical convention. Further Confusion over ps and s In the NeymanPearson decision model the researcher is only allowed to say whether or not the result fell in the critical region, not where it fell, as might

HUBBARD: DISTINCTIONS BETWEEN PS AND

309

be indicated by a p value. Thus, if the Type I error rate, , is xed at its usual .05 value before (as it must be) the study is carried out, and the researcher subsequently obtains a p value of, say, .0014, this exact value cannot be reported in a NeymanPearson hypothesis test (Oakes, 1986). This is because, Goodman (1993, 1999) explains, is the probability of a set of possible results that may fall anywhere in the tail area of the distribution under the null hypothesis, and we cannot know in advance which of these particular results will arise. This differs from the tail area for the p value, which is known only after the result is observed, and which, by denition, will always lie exactly on the border of that tail area. Consequently, a predetermined Type I error rate cannot be conveniently renegotiated as a measure of evidence after the result is observed (Royall, 1997). Despite the above, Wilkinson and the APA Task Force on Statistical Inference (1999) again display a Fisherian, rather than a Neyman Pearsonian, orientation when they state: It is hard to imagine a situation in which a dichotomous acceptreject decision is better than reporting an actual p value (p. 599). But it is not at all difcult to imagine such a situation in a NeymanPearson context. On the contrary, the dichotomous acceptreject decision is all that is possible in their statistical calculus. Furthermore, p values, exact or otherwise, play no role in the Neyman Pearson model. By the same reasoning, it is not permissible to report what Goodman (1993, p. 489) calls roving alphas, whereby p values are assigned a limited number of categories of Type I error rates, such as p < .05, p < .01, p < .001, and so on. As mentioned earlier, and elaborated below, a Type I error rate, , must be xed before the data are collected, and any ex post facto reinterpretation of values like p < .05, p < .01, and so on, as variable Type I error rates applicable to different parts of any given study is strictly inadmissible. Why Must Be Fixed Before the Study is Carried Out? At this juncture it is helpful to be reminded why it is necessary to x prior to conducting the study, rather than allow it to be exible (like a p value), and how the failure to do so has tainted researcher behavior. Alpha must be selected before performing the study so as to constrain the Type I error rate to some agreed-upon level. In specifying in advance, Neyman (1977) admitted that in their efforts to build a frequentist theory of testing hypotheses, he and Pearson were inspired by the French mathematician Borels advice that:
. . . (a) the criterion to test a hypothesis (a statistical hypothesis) using some observations must be selected not after the examination of the results of observation, but before, and (b) this criterion should be a function of the observations en quelque sort remarquable. (p. 103)

310

THEORY

&

PSYCHOLOGY

14(3)

Royall (1997, pp. 119121) addresses the rationale for prespecifying , and I paraphrase him here. Suppose a researcher decides not to supply ahead of time, but waits instead until the data are in. It is decided that if the result is statistically signicant at the .01 level, it will be reported as such, and if it falls between the .01 and .05 levels, it will be recorded as being statistically signicant at the .05 level. In the NeymanPearson model, this test procedure is improper and yields misleading results, namely, that while H0 will occasionally be rejected at the .01 level, it will always be rejected any time the result is statistically signicant at the .05 level. Thus, the research report does not have a Type I error probability of .01, but only .05. If the prespecied is xed at the .05 level, no extra observations can legitimize a claim of statistical signicance at the .01 level. This, Royall (1997) points out, makes perfect sense within NeymanPearson theory, although no sense at all from a scientic perspective, where additional observations would be seen as preferable. And this deciency in the NeymanPearson paradigm has encouraged researchers, albeit unwittingly, to incorrectly supplement their testing procedures with roving alpha p values. The NeymanPearson hypothesis testing program strictly disallows any peeking at the data and (repeated) sequential testing (Corneld, 1966; Royall, 1997). Corneld (1966, p. 19) tells of a situation he calls common among statistics consultants: suppose, after gathering n observations, a researcher does not quite reject H0 at the prespecied = .05 level. The researcher still believes the null hypothesis is false, and that if s/he had obtained a statistically signicant result, the ndings would be submitted for publication. The investigator then asks the statistician how many more data points would be necessary to reject the null. And, in NeymanPearson theory, the answer is no amount of (even extreme) additional data points would allow rejection at the .05 level. As incredible as this answer seems, it is correct. If the null hypothesis is true, there is a .05 chance of its being rejected after the rst study. As Corneld (1966) remarks, however, to this chance we must add the probability of rejecting H0 in the second study, conditional on our failure to do so after the rst one. This, in turn, raises the total probability of the erroneous rejection of H0 to over .05. Corneld shows that as the n size in the second study is continuously increased, the signicance level approaches .0975 (= .05 + .95 .05). This demonstrates that no amount of collateral data points can be gathered to reject H0 at the .05 level. Royall (1997) puts it this way: Choosing to operate at the 5% level means allowing only a 5% chance of erroneously rejecting the hypothesis, and the experimenter has already taken that chance. He spent his 5% when he tested after the rst n observations (p. 111). But once again, despite being in complete accord with NeymanPearson theory, this explanation runs counter to common sense. And once again, researchers

HUBBARD: DISTINCTIONS BETWEEN PS AND

311

mistakenly augment the NeymanPearson hypothesis-testing model with Fisherian p values, thereby introducing a thicket of roving or pseudoalphas in their reports. Further confounding matters, these variable Type I error p values are also interpreted in an evidential fashion. This happens, for example, when p < .05 is deemed signicant, p < .01 is highly signicant, p < .001 is extremely signicant, and so on. Goodman (1993, 1999) and Royall (1997) caution that because of its apparent similarity with the NeymanPearson Type I error rate, , Fishers p value has been subsumed within the formers paradigm. As a result, the p value has been contemporaneously interpreted as both a measure of evidence and an observed error rate. This has created extensive confusion over the meaning of p values and levels. Unfortunately, as Goodman (1992) warned,
. . . because p-values and the critical regions of hypothesis tests are both tail area probabilities, they are easy to confuse. This confusion blurs the division between concepts of evidence and error for the statistician, and obscures it completely for nearly everyone else. (p. 897)

Fisher, NeymanPearson and the .05 Level Finally, it is ironic that the confusion surrounding the differences between ps and s was inadvertently fueled by Neyman and Pearson themselves. This occurred when they used, for convenience, Fishers 5% and 1% signicance levels to help dene their Type I error rates (E.S. Pearson, 1962). Fishers popularizing of such nominal levels of statistical signicance is a curious, and inuential, historical oddity. While working on Statistical Methods for Research Workers, Fisher was denied permission by Karl Pearson to reproduce W.P. Eldertons table of 2 from the rst volume of Biometrika, and therefore created his own version. In doing so, Egon Pearson (1990) wrote: [Fisher] gave the values of [Karl Pearsons] 2 [and students t] for selected values of P . . . instead of P for arbitrary 2, and thus introduced the concept of nominal levels of signicance (p. 52, my emphasis). As noted, Fishers use of 5% and 1% levels was also adopted by NeymanPearson. And Fisher (1959) criticized them for doing so, explaining that no scientic worker has a xed level of signicance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas (p. 42). No wonder that many researchers confuse Fishers evidential p values with NeymanPearson behavioral error rates when both concepts are ossied at the 5% and 1% levels.

312

THEORY

&

PSYCHOLOGY

14(3)

Confusion Over ps and s: APA Publication Manuals Adding fuel to the re, there is also ongoing confusion and equivocation over the interpretation of p values and levels in various APA Publication Manuals. I noted previously that a prespecied Type I error rate () cannot be simultaneously viewed as a measure of evidence (p) once the result is observed. Yet this mistake is actively endorsed as ofcial policy in an earlier APA Publication Manual (1994):
For example, given a true null hypothesis, the probability of obtaining the particular value of the statistic you computed might be [p =] .008. Many statistical packages now provide these exact [p] values. You can report this distinct piece of information in addition to specifying whether you rejected or failed to reject the null hypothesis using the specied alpha level. With an alpha level of .05, the effect of age was statistically signicant, F (1, 123) = 7.27, p = .008. (p. 17)

And psychologists are provided with mixed messages by statements such as: Before you begin to report specic results, you should routinely state the particular alpha level you selected for the statistical tests you conducted: [for example,] An alpha level of .05 was used for all statistical tests (APA, 1994, p. 17). This is sound advice that is congruent with NeymanPearson principles. However, it is immediately followed with some poor advice: If you do not make a general statement about the alpha level, specify the alpha level when reporting each result (p. 17). This recommendation promotes unsound statistical practice by virtually sanctioning the habitual use of untenable roving alphas. Such a situation, whether applied to actual s or (more likely) p values, is completely unacceptable: must be determined prior to, not after, data collection and statistical analysis. The latest APA Publication Manual (2001), following on the heels of the report of Wilkinson and the APA Task Force on Statistical Inference (1999), continues in a similar vein. Like its 1994 predecessor, for instance, the 2001 Manual distinguishes the two types of probabilities associated with statistical testing, p and (although both editions incorrectly call the p value a likelihood when it is a probability). But the new version likewise proceeds to dispense erroneous counsel:
The APA is neutral on which interpretation [p or ] is to be preferred in psychological research. . . . Because most statistical packages now report the p value . . . and because this probability can be interpreted according to either mode of thinking, in general it is the exact probability (p value) that should be reported. (APA, 2001, p. 24, my emphasis)

But this is not the case. The p value is not a Type I error rate, and an level is not evidential. The Manual goes on to declare:
There will be casesfor example, large tables of correlations or complex tables of path coefcientswhere the reporting of exact probabilities could

HUBBARD: DISTINCTIONS BETWEEN PS AND

313

be awkward. In these cases, you may prefer to identify or highlight a subset of values in the table that reach some prespecied level of statistical signicance. To do so, follow those values with a single asterisk (*) or double asterisk (**) to indicate p < .05 or p < .01, respectively. When using prespecied signicance levels, you should routinely state the particular alpha level you selected for the statistical tests you conducted: An alpha level of .05 was used for all statistical tests. (p. 25)

Such recommendations are misleading to authors. Just after being told to present exact, data-dependent, p values wherever feasible, we are also instructed to use both p values and levels to indicate prespecied (yet variable) levels of statistical signicance. This virtually obligates the researcher to consider a p value as an observed Type I error rate. This obligation is all but cemented in another passage, where two examples are given:
Two common approaches for reporting statistical results using the exact probability formulation are as follows: With an alpha level of .05, the effect of age was statistically signicant, F (1,123) = 7.27, p < .01. The effect of age was not statistically signicant, F (1, 123) = 2.45, p = .12. The second example should be used only if you have included a statement of signicance level earlier in your article. (p. 25)

Note that the rst example uses p and interchangeably, as if they are the same thing. Moreover, p < .01 is not an exact probability. Example two, which does use an exact probability, implies that such probabilities cannot be used without rst announcing an overall signicance levela prespecied to which all others bow. This is not the case.2 Because some statisticians, some quantitative psychologists and APA Publication Manuals are unclear about the differences between ps and s, it is to be expected that confusion levels among psychologists in general will be high. I assess the magnitude of this confusion in a sample of APA journals. Confusion Over ps and s: Empirical Evidence An empirical investigation was made of the way in which the results of statistical tests are reported in the psychology literature. In particular, a randomly selected issue of each of 12 psychology journalsthe American Psychologist (AP), Developmental Psychology (DP), Journal of Abnormal Psychology (JAbP), Journal of Applied Psychology (JAP), Journal of Comparative Psychology (JComP), Journal of Consulting and Clinical Psychology (JCCP), Journal of Counseling Psychology (JCouP), Journal of

314

THEORY

&

PSYCHOLOGY

14(3)

TABLE 1. The incidence of statistical testing in 12 psychology journals: 19902002 Number of empirical papers
10 213 240 186 236 146 157 187 86 188 41 60 1,750

Journal
AP DP JAbP JAP JCCP JComP JCouP JEP JExPG JPSP PB PR Totals

Number of papers
149 222 242 193 273 148 176 193 100 191 128 110 2,125

Number of empirical papers using statistical tests


4 211 238 176 230 138 140 178 85 180 32 33 1,645

Percentage of empirical papers using statistical tests


40.0 99.1 99.2 94.6 97.5 94.5 89.2 95.2 98.8 95.7 78.0 55.0 94.0

Educational Psychology (JEP), Journal of Experimental Psychology General (JExPG), Journal of Personality and Social Psychology (JPSP), Psychological Bulletin (PB) and Psychological Review (PR)was examined for every year from 1990 through 2002 to ascertain the number of empirical articles and notes they contained. Table 1 shows that this examination produced a sample of 1,750 empirical papers, of which 1,645, or 94%, used statistical tests. There is some variability in the percentage of empirical papers using statistical testing, with a low of 40% for AP, and a high of 99.2% for JAbP. Moreover, six of the journals (DP, JAbP, JCCP, JEP, JExPG and JPSP) had in excess of 95% of their published empirical works featuring such tests. Even though the Fisherian evidential p value from a signicance test is at odds with the NeymanPearson hypothesis-testing approach, Table 2 reveals that p values are ubiquitous in psychologys empirical literature. In marked contrast, levels are few and far between. Of the 1,645 papers using statistical tests, 1,090, or 66.3%, employed roving alphas, that is, a discrete, graduated number of p values interpreted variously as Type I error rates and/or measures of evidence against H0, usually p < .05, p < .01, p < .001, and so on. In other words, these p values are sometimes viewed as an observed Type I error rate, meaning that they are not pre-assigned, or xed, error levels that would be dictated by NeymanPearson theory. Instead, these error rates are incorrectly determined solely by the data.

HUBBARD: DISTINCTIONS BETWEEN PS AND

TABLE 2. The reporting of results of statistical tests Roving alphas (R)


ps 1,090

Exact p values (Ep)


54

Combination of Eps with xed p values and roving alphas


Ep + .05 18 Ep + .01 5 Ep + .001 7 Ep + R 354 384 23.3

Fixed Level Values Level


.05 47 .01 17 .001 17 Other 7 88 5.3 1 7 0.4 2 16 1.0 6 0.4 1,645 100.0 2 1 1 5 11 6

ps

Signicant

Unspecied

Total

Total 1,090 Percentage 66.3

54 3.3

315

316

THEORY

&

PSYCHOLOGY

14(3)

Corroborating my gures on the incidence of roving alphas66.3% overall, and 74.4% for the JAPFinch et al. (2001) found that these were used in 77% of empirical articles in the JAP for the period 1955 to 1999, and in almost 100% of such articles in the British Journal of Psychology for 1999. They concluded: Typical practice is to report relative p values using two or more implicit alpha levels, most commonly .05 and .01 and sometimes .001 (p. 198, my emphasis). Interestingly, however, they are not critical, as they should be, of this custom of interpreting p values as Type I error rates. Gigerenzer (1993), on the other hand, most certainly is. When discussing how p values are usually reported in the literaturep < .05, p < .01, p < .001he observes: Neyman and Pearson would have rejected this practice: These p values are not the probability of Type I errors . . . [values] such as p < .05, which look like probabilities of Type I errors but arent (p. 329). Further clouding the issue, these same p values will be interpreted simultaneously in a quasi-evidential manner as a basis for rejecting H0 if p < . This includes, in many cases, mistakenly using the p value as a surrogate measure for effect sizes (e.g. p < .05 is signicant, p < .01 is very signicant, p < .001 is extremely signicant, etc.). In short, these roving alphas can assume a number of incorrect and contradictory interpretations. After the roving alphas, an additional 54 (3.3%) papers reported exact p values, and a further 384 (23.3%) presented various combinations of exact ps with either roving alphas or xed p values. Conservatively, therefore, some 1,474, or 89.6% (i.e. roving alphas plus the combination of exact ps and roving alphas), of empirical articles in a sample of psychology journals report the results of statistical tests in a fashion that is at variance with NeymanPearson convention. At the same time they violate Fisherian theory, as when p values are misinterpreted as both Type I error rates and as quasi-Fisherian (i.e. p < ) measures of evidence against the null. Another 6 (0.4%) studies were insufciently clear about the disposition of a result in their accounts (other than comments such as This result was statistically signicant). I therefore assigned them to the Unspecied category. Thus, 111 (6.7%) studies remain eligible for the reporting of xed level values in the manner intended by NeymanPearson. However, 88 of these 111 studies reported xed p rather than xed levels. After subtracting this group, a mere 23 (1.4%) studies remain eligible. Of these 23 papers, 7 simply refer to their published results as being signicant at the .05, .01, levels, and so on, but provide no information about p values and/or levels. In the nal analysis, only 16 of 1,645 empirical papers using statistical tests, or 1%, explicitly used levels. Table 3 makes it abundantly clear that articles in all 12 psychology journals reect confusion in the reporting of results of statistical tests.

HUBBARD: DISTINCTIONS BETWEEN PS AND

TABLE 3. The reporting of results of statistical tests by journal Combination of Eps with xed p values and roving alphas No.
1 34 82 19 68 38 40 34 14 43 6 5 384

Roving alphas (R) Journal


AP DP JAbP JAP JCCP JComP JCouP JEP JExPG JPSP PB PR Totals

Exact p values (Ep) No.


11 8 14 4 1 5 3 1 4 3 54

Fixed Level Values ps No.


1 6 3 23 11 8 9 8 7 3 4 5 88

Signicant %
25.0 2.8 1.3 13.1 4.8 5.8 6.4 4.5 8.2 1.7 12.5 15.2 5.3

s No.
1 2 2 2 4 3 2 16

Unspecied %
0.4 1.1 1.4 1.4 2.2 3.5 6.3 1.0

No.
2 160 140 131 136 84 88 126 55 133 16 19 1,090

%
50.0 75.8 58.8 74.4 59.4 60.9 62.9 70.8 64.7 73.9 50.0 57.6 66.3

%
5.2 3.4 6.1 2.9 0.7 2.8 3.5 0.6 12.5 9.1 3.3

%
25.0 16.1 34.5 10.8 29.7 27.5 28.6 19.1 16.5 23.9 18.8 15.2 23.3

No.
3 1 1 1 1 7

%
1.3 0.6 0.7 1.2 3.0 0.4

No.
1 1 1 1 2 6

%
0.4 0.6 0.7 0.6 2.4 0.4

S 317

318

THEORY

&

PSYCHOLOGY

14(3)

Fishers p values (roving alphas, exact, and in various combinations) dominate the way in which the results of statistical tests are portrayed in psychology journals. NeymanPearson xed s are practically invisible. Indeed, 5 of the 12 journals in this sample (AP, DP, JCCP, JPSP, PR) report no use of xed s whatsoever. The fact that NeymanPearson theory is recognized as statistical orthodoxy is overwhelmingly belied by the results in this paper. In psychology journals, Fishers legacy thrives; Neyman Pearson viewpoints, on the other hand, receive only lip service.

Discussion Confusion over the interpretation of classical statistical tests is so universal as to render their application almost meaningless. This ubiquitous confusion among researchers over the meaning of p values and levels is easier to understand when it is pointed out that both expressions are used to refer to the signicance level of a test. But their interpretations are poles apart. The level of signicance expressed by a p value in a Fisherian signicance test indicates the probability of encountering data this extreme (or more so) under a null hypothesis. This p value does not pertain to some prespecied error rate, but instead is determined by the data, and has epistemic implications through its use as a measure of inductive evidence against H0 in individual studies. Contrast this with the signicance level called in a NeymanPearson hypothesis test. Here, the emphasis is on minimizing Type II, or , errors (i.e. false acceptance of a null hypothesis) subject to a bound on Type I, or , errors (i.e. false rejections of a null hypothesis). Moreover, this error minimization applies only to long-run repeated sampling from the same population, not to single experiments, and is a directive for behaviors, not a way of gathering evidence. Viewed from this perspective, the two notions of statistical signicance could hardly be more distant in meaning. And both the Fisherian and NeymanPearson camps, as has been shown repeatedly throughout this paper, were keenly aware of these differences and attempted, in vain as it turns out, to communicate them to research workers. These differences are summarized in Table 4. Yet these distinguishing characteristics of ps and s are rarely spelled out in the literature. On the contrary, they tend to be used equivalently, even being compared with one another (e.g. p < ), especially in statistics textbooks aimed at applied researchers. Usually, in such texts, an anonymous account of standard NeymanPearson doctrine is put forward initially, and is often followed by an equally anonymous discussion of the p value approach. This transition from levels to p values (and their intermixing) is typically seamless, as if it constituted a natural progression through different parts of the same coherent statistical whole. Of course, a cynic might argue so what if researchers cannot correctly

HUBBARD: DISTINCTIONS BETWEEN PS AND

319

TABLE 4. Contrasting ps and s p value


Fisherian signicance level Signicance test Evidence against H0 Inductive philosophyfrom particular to general Inductive inferenceguidelines for interpreting strength of evidence in data Data-based random variable Property of data Short-runapplies to any single experiment/study Hypothetical innite population

level
NeymanPearson signicance level Hypothesis test Type I errorerroneous rejection of H0 Deductive philosophyfrom general to particular Inductive behaviorguidelines for making decisions based on data Pre-assigned xed value Property of test Long-runapplies only to ongoing, identical repetitions of original experiment/study, not to any given study Clearly dened population

distinguish between the meanings of ps and s. S/he may ask just exactly how the confusion over ps and s has impeded knowledge growth in the discipline. But surely continued, misinformed statistical testing cannot be justied on the grounds that its pernicious consequences are difcult to isolate. Another line of argument that might be adopted by those seeking to maintain the status quo with regard to statistical testing could be that in their interchangeable uses of p values and s, researchers are simply picking and choosing from the best parts of the (unstated) Fisherian and Neyman Pearson paradigms. Why should a researcher, one might say, feel obligated to use only one or the other of the statistical testing methods? It might be thought that employing an level here and a p value there does no damage. But these kinds of rationalizations of improper statistical practice do not wash. The Fisherian and NeymanPearson conceptions of statistical testing are incommensurate. The researcher is not at liberty to pick and choose, albeit unintentionally, from two incompatible statistical approaches. Yet the crux of the problem is that this is precisely what the overwhelming majority of psychology (and other) researchers do in their empirical analyses. For example, it would be surprising indeed to learn that the author(s) of even a single article from among the 1,645 examined in our sample consciously planned to analyze the data from both the Fisherian and NeymanPearsonian perspectives. The proof of the pudding is seen in the fact that the unawareness among researchers of the distinctions between these perspectives is so

320

THEORY

&

PSYCHOLOGY

14(3)

ingrained that using ps and s interchangeably is not perceived as even being a problem in the rst place. The foregoing account illustrates why Nickersons (2000) statement that reporting p values is not necessarily inconsistent with using an criterion (p. 277) is incorrect. Contrary to Nickersons claims (pp. 243 and 242243, respectively), a p value is not a bound on , nor is a decision criterion for p. Fisher and NeymanPearson themselves would have denied such assertions, for they continue to cause trouble. While not writing about the confusion regarding ps and s per se, Tryons (1998) comments over general misunderstandings of statistical tests by psychologists are relevant here:
. . . the fact that statistical experts and investigators publishing in the best journals cannot consistently interpret the results of these analyses is extremely disturbing. Seventy-two years of education have resulted in miniscule, if any, progress toward correcting this situation. It is difcult to estimate the handicap that widespread, incorrect, and intractable use of a primary data analytic method has on a scientic discipline, but the deleterious effects are undoubtedly substantial. (p. 796)

I share Tryons sentiments. Conclusions Among psychology researchers, the p value is interpreted simultaneously in NeymanPearson terms as a deductive assessment of error in long-run repeated sampling situations, and in a Fisherian sense as a measure of inductive evidence in a single study. Clearly, this is asking the impossible. More pointedly, a p value from a Fisherian signicance test has no place in the NeymanPearson hypothesis-testing framework. And the Neyman Pearson Type I error rate, , has no place in Fishers signicance-testing approach. Contrary to popular belief, ps and s are not the same thing; they measure different concepts. Such distinctions notwithstanding, the mixing of measures of evidence (ps) with measures of error (s) is standard practice in the classroom and in scholarly journals. Because of this, statistical testing is largely devoid of meaninga p and/or value can mean almost anything to the applied researcher, and hence nothing. Furthermore, this mixing of ideas from both schools of thought has not been evenhanded. Paradoxically, while the NeymanPearson hypothesis-testing framework began supplanting Fishers views on signicance testing half a century ago, this is not evident in journal articles. Here, it is Fishers legacy that is omnipresent. Despite trenchant criticism of classical statistical testing (e.g. Schmidt, 1996; Thompson, 1994a, 1994b, 1997),3 following the recommendations of Wilkinson and the APA Task Force on Statistical Inference (1999), and the

HUBBARD: DISTINCTIONS BETWEEN PS AND

321

continued coverage given to it in the latest APA Publication Manual (2001), it appears that such testing is here to stay. My advice to those who wish to continue using these tests is to be more deliberate about their application. At the very least, researchers should purposely determine whether their concerns are with controlling errors or collecting evidence. If the former, investigators should adopt a NeymanPearson approach to guide behavior, but one which makes a serious attempt to estimate the likely costs associated with Type I and II errors. Researchers choosing this option should discontinue using roving alphas and the like, and stick with xed, preassigned, s. If the focus of a study is more evidential in nature, the use of Fishers p value would be more appropriate, preferably in its exact format so as to once more avoid the roving alphas dilemma. For myself, I see little of value in classical statistical testingwhether of the ps or s variety. A better route for assessing the reliability, validity, and generalizability of empirical ndings is to establish systematic replication with extension research programs focusing on sample statistics, effect sizes and their condence intervals (Cumming & Finch, 2001; Hubbard, 1995; Hubbard, Parsa, & Luthy, 1997; Hubbard & Ryan, 2000; Thompson, 1994b, 1997, 1999, 2002). But that is another story.
Notes 1. The wording by Gigerenzer and Murray here is unfortunate, likely to cause mischief, and necessitates a digression. Their phraseology appears to convey to readers the impression that p values and levels are the same thing. This uncharacteristic lapse in communication by Gigerenzer and Murray can, no doubt, be attributed to what Nickerson (2000, pp. 261262) calls linguistic ambiguity, whereby even experts on matters relating to statistical testing can occasionally use misleading language. In fact, we shall see later that Gigerenzer and his colleagues are acutely aware of the distinctions between ps and s. 2. An anonymous reviewer offered three (speculative) reasons why the APA Publication Manuals bestow erroneous and inconsistent advice. First, across decades of editions, some views were included at one point and not taken out when alternative views were embedded. Second, most of the APA staff do not have training in statistical methods. Third, this same staff do not seem to be overly concerned about statistical issues, as evidenced by (a) the small amount of coverage devoted to them, and (b) the lack of attention in the latest (2001) edition to revisions involving statistical content. See, in this context, Fidler (2002). 3. Indeed, such criticism can also be found in disciplines as diverse as accounting (Lindsay, 1995), economics (McCloskey & Ziliak, 1996), marketing (Hubbard & Lindsay, 2002; Sawyer & Peter, 1983) and wildlife sciences (Anderson, Burnham, & Thompson, 2000). Moreover, this criticism has escalated decade by decade, judging by the number of articles published on the topic (cf. Hubbard & Ryan, 2000).

322 References

THEORY

&

PSYCHOLOGY

14(3)

American Psychological Association. (1994). Publication manual of the American Psychological Association (4th ed.). Washington, DC: Author. American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author. Anderson, D.R., Burnham, K.P., & Thompson, W. (2000). Null hypothesis testing: Problems, prevalence, and an alternative. Journal of Wildlife Management, 64, 912923. Barnard, G.A. (1985). A coherent view of statistical inference. Technical Report Series, Department of Statistics & Actuarial Science, University of Waterloo, Ontario, Canada. Berger, J.O., & Delampady, M. (1987). Testing precise hypotheses (with comments). Statistical Science, 2, 317352. Carlson, R. (1976). The logic of tests of signicance. Philosophy of Science, 43, 116128. Carver, R.P. (1978). The case against statistical signicance testing. Harvard Educational Review, 48, 378399. Chow, S.L. (1996). Statistical signicance: Rationale, validity and utility. Thousand Oaks, CA: Sage. Chow, S.L. (1998). Pr ecis of statistical signicance: Rationale, validity and utility (with comments). Behavioral and Brain Sciences, 21, 169239. Clark, C.M. (1999). Further considerations of null hypothesis testing. Journal of Clinical and Experimental Neuropsychology, 21, 283284. Clark-Carter, D. (1997). The account taken of statistical power in research published in the British Journal of Psychology. British Journal of Psychology, 88, 7183. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 13041312. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 9971003. Corneld, J. (1966), Sequential trials, sequential analysis, and the likelihood principle. The American Statistician, 20, 1823. Cortina, J.M., & Dunlap, W.P. (1997). On the logic and purpose of signicance testing. Psychological Methods, 2, 161172. Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and calculation of condence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 532574. Daniel, L.G. (1998). Statistical signicance testing: A historical overview of misuse and misinterpretation with implications for the editorial policies of educational journals. Research in the Schools, 5, 2332. Dar, R., Serlin, R.C., & Omer, H. (1994). Misuse of statistical tests in three decades of psychotherapy research. Journal of Consulting and Clinical Psychology, 62, 7582. Dixon, P., & OReilly, T. (1999). Scientic versus statistical inference. Canadian Journal of Experimental Psychology, 53, 133149. Falk, R., & Greenbaum, C.W. (1995). Signicance tests die hard: The amazing persistence of a probabilistic misconception. Theory & Psychology, 5, 7598. Fidler, F. (2002). The fth edition of the APA Publication Manual: Why its statistics

HUBBARD: DISTINCTIONS BETWEEN PS AND

323

recommendations are so controversial. Educational and Psychological Measurement, 62, 749770. Finch, S., Cumming, G., & Thomason, N. (2001). Reporting of statistical inference in the Journal of Applied Psychology: Little evidence of reform. Educational and Psychological Measurement, 61, 181210. Fisher, R.A. (1925). Statistical methods for research workers. Edinburgh: Oliver & Boyd. Fisher, R.A. (1929). The statistical method in psychical research. Proceedings of the Society for Psychical Research, London, 39, 189192. Fisher, R.A. (1935a). The design of experiments. Edinburgh: Oliver & Boyd. Fisher, R.A. (1935b). Statistical tests. Nature, 136, 474. Fisher, R.A. (1950). Contributions to mathematical statistics. London: Chapman & Hall. Fisher, R.A. (1955). Statistical methods and scientic induction. Journal of the Royal Statistical Society, B, 17, 6978. Fisher, R.A. (1956). Statistical methods and scientic inference. Edinburgh: Oliver & Boyd. Fisher, R.A. (1958). The nature of probability. The Centennial Review of Arts and Sciences, 2, 261274. Fisher, R.A. (1959). Statistical methods and scientic inference (2nd rev. ed.). Edinburgh: Oliver & Boyd. Fisher, R.A. (1966). The design of experiments (8th ed.). Edinburgh: Oliver & Boyd. Frick, R.W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379390. Gibbons, J.D., & Pratt, J.W. (1975). P-values: Interpretation and methodology. The American Statistician, 29, 2025. Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C.A. Lewis (Eds.), A handbook for data analysis in the behavioral sciencesMethodological issues (pp. 311339). Hillsdale, NJ: Erlbaum. Gigerenzer, G., & Murray, D.J. (1987). Cognition as intuitive statistics. Hillsdale, NJ: Erlbaum. Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Kruger, L. (1989). The empire of chance. New York: Cambridge University Press. Goodman, S.N. (1992). A comment on replication, P-values and evidence. Statistics in Medicine, 11, 875879. Goodman, S.N. (1993). p values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137, 485496. Goodman, S.N. (1999). Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine, 130, 9951004. Grayson, D., Pattison, P., & Robins, G. (1997). Evidence, inference, and the rejection of the signicance test. Australian Journal of Psychology, 49, 6470. Hacking, I. (1965). Logic of statistical inference. New York: Cambridge University Press. Hinkley, D.V. (1987). Comment. Journal of the American Statistical Association, 82, 128129. Hogben, L. (1957). Statistical theory. New York: Norton.

324

THEORY

&

PSYCHOLOGY

14(3)

Hubbard, R. (1995). The earth is highly signicantly round (p < .0001). American Psychologist, 50, 1098. Hubbard, R., & Lindsay, R.M. (2002). How the emphasis on original empirical marketing research impedes knowledge development. Marketing Theory, 2, 381402. Hubbard, R., Parsa, R.A., & Luthy, M.R. (1997). The spread of statistical signicance testing in psychology: The case of the Journal of Applied Psychology, 19171994. Theory & Psychology, 7, 545554. Hubbard, R., & Ryan, P.A. (2000). The historical growth of statistical signicance testing in psychologyand its future prospects. Educational and Psychological Measurement, 60, 661681. Huberty, C.J (1993). Historical origins of statistical testing practices: The treatment of Fisher versus NeymanPearson views in textbooks. Journal of Experimental Education, 61, 317333. Huberty, C.J, & Pike, C.J. (1999). On some history regarding statistical testing. In B. Thompson (Ed.), Advances in social science methodology (Vol. 5, pp. 122). Stamford, CT: JAI Press. Hyde, J.S. (2001). Reporting effect sizes: The roles of editors, textbook authors, and publication manuals. Educational and Psychological Measurement, 61, 225228. Johnstone, D.J. (1986). Tests of signicance in theory and practice (with comments). The Statistician, 35, 491504. Johnstone, D.J. (1987a). On the interpretation of hypothesis tests following Neyman and Pearson. In R. Viertl (Ed.), Probability and Bayesian statistics (pp. 311339). New York: Plenum. Johnstone, D.J. (1987b). Tests of signicance following R.A. Fisher. The British Journal for the Philosophy of Science, 38, 481499. Kempthorne, O. (1976). Of what use are tests of signicance and tests of hypothesis. Communications in StatisticsTheory and Methods, A5(8), 763777. Keuzenkamp, H.A., & Magnus, J.R. (1995). On tests and signicance in econometrics. Journal of Econometrics, 67, 524. Kirk, R.E. (1996). Practical signicance: A concept whose time has come. Educational and Psychological Measurement, 56, 746759. Krantz, D.H. (1999). The null hypothesis testing controversy in psychology. Journal of the American Statistical Association, 44, 13721381. Krueger, J. (2001). Null hypothesis signicance testing: On the survival of a awed method. American Psychologist, 56, 1626. Kyburg, H.E. (1974). The logical foundations of statistical inference. Dordrecht: Reidel. Leavens, D.A., & Hopkins, W.D. (1998). Intentional communication by chimpanzees: A cross-sectional study of the use of referential gestures. Developmental Psychology, 34, 813822. LeCam, L., & Lehmann, E.L. (1974). J. Neyman: On the occasion of his 80th birthday. The Annals of Statistics, 2, viixiii. Lehmann, E.L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88, 12421249. Lindsay, R.M. (1995). Reconsidering the status of tests of signicance: An alternative criterion of adequacy. Accounting, Organizations and Society, 20, 3553.

HUBBARD: DISTINCTIONS BETWEEN PS AND

325

Loftus, G.R. (1996). Psychology will be a much better science when we change the way we analyze data. Psychological Science, 7, 161171. Macdonald, R.R. (1997). On statistical testing in psychology. British Journal of Psychology, 88, 333347. McCloskey, D.N., & Ziliak, S.T. (1996). The standard error of regressions. Journal of Economic Literature, 34, 97114. Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103115. Mulaik, S.A., Raju, N.S., & Harshman, R.A. (1997). There is a time and a place for signicance testing. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no signicance tests? (pp. 65116). Hillsdale, NJ: Erlbaum. Nester, M.R. (1996). An applied statisticians creed. Applied Statistics, 45, 401410. Neyman, J. (1942). Basic ideas and some recent results of the theory of testing statistical hypotheses. Journal of the Royal Statistical Society, 105, 292327. Neyman, J. (1950). First course in probability and statistics. New York: Holt. Neyman, J. (1957). Inductive behavior as a basic concept of philosophy of science. International Statistical Review, 25, 722. Neyman, J. (1961). Silver jubilee of my dispute with Fisher. Journal of the Operations Research Society of Japan, 3, 145154. Neyman, J. (1962). Two breakthroughs in the theory of statistical decision making. Review of the International Statistical Institute, 30, 1127. Neyman, J. (1967). R.A. Fisher (18901962), an appreciation. Science, 156, 14561460. Neyman, J. (1971). Foundations of behavioristic statistics (with comments). In V.P. Godambe & D.A. Sprott (Eds.), Foundations of statistical inference (pp. 119). Toronto: Holt, Rinehart & Winston of Canada. Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese, 36, 97131. Neyman, J., & Pearson, E.S. (1928a). On the use and interpretation of certain test criteria for purposes of statistical inference. Part I. Biometrika, 20A, 175240. Neyman, J., & Pearson, E.S. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference. Part II. Biometrika, 20A, 263294. Neyman, J., & Pearson, E.S. (1933). On the problem of the most efcient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London, Ser. A, 231, 289337. Nickerson, R.S. (2000). Null hypothesis signicance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241301. Nix, T.W., & Barnette, J.J. (1998). The data analysis dilemma: Ban or abandon. A review of null hypothesis signicance testing. Research in the Schools, 5, 314. Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. Pearson, E.S. (1962). Some thoughts on statistical inference. Annals of Mathematical Statistics, 33, 394403. Pearson, E.S. (1990). Student a statistical biography of William Sealy Gosset. Oxford: Clarendon Press. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be

326

THEORY

&

PSYCHOLOGY

14(3)

reasonably supposed to have arisen from random sampling. London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, 50, 157175. Reid, C. (1982). Neymanfrom life. Berlin: Springer-Verlag. Rosnow, R.L., & Rosenthal, R. (1989). Statistical procedures and the justication of knowledge in psychological science. American Psychologist, 44, 12761284. Royall, R.M. (1997). Statistical evidence: A likelihood paradigm. New York: Chapman & Hall. Rozeboom, W.W. (1960). The fallacy of the null-hypothesis signicance test. Psychological Bulletin, 57, 416428. Sawyer, A.G., & Peter, J.P. (1983). The signicance of statistical signicance tests in marketing research. Journal of Marketing Research, 20, 122133. Schmidt, F.L. (1996). Statistical signicance testing and cumulative knowledge in psychology: Implications for the training of researchers. Psychological Methods, 1, 115129. Seidenfeld, T. (1979). Philosophical problems of statistical inference: Learning from R.A. Fisher. Dordrecht: Reidel. Spielman, S. (1974). The logic of tests of signicance. Philosophy of Science, 41, 211226. Thompson, B. (1994a). Guidelines for authors. Educational and Psychological Measurement, 54, 837847. Thompson, B. (1994b). The pivotal role of replication in psychological research: Empirically evaluating the replicability of sample results. Journal of Personality, 62, 157176. Thompson, B. (1997). Editorial policies regarding statistical signicance tests: Further comments. Educational Researcher, 26, 2932. Thompson, B. (1999). If statistical signicance tests are broken/misused, what practices should supplement or replace them? Theory & Psychology, 9, 165181. Thompson, B. (2002). What future quantitative social science research could look like: Condence intervals for effect sizes. Educational Researcher, 31, 2532. Tryon, W.W. (1998). The inscrutable null hypothesis. American Psychologist, 53, 796. Wilkinson, L., & the APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594604. Zabell, S.L. (1992). R. A. Fisher and the ducial argument. Statistical Science, 7, 369387.

Acknowledgements. The author thanks Susie Bayarri, Jim Berger and Steve Goodman for conversations and correspondence related to the topics discussed in this paper. I am indebted to Gerd Gigerenzer and an anonymous reviewer for suggestions that have improved the manuscript. Any errors are my responsibility. Raymond Hubbard is Thomas F. Sheehan Distinguished Professor of Marketing in the College of Business and Public Adminstration, Drake University. His research interests include applied methodology and the

HUBBARD: DISTINCTIONS BETWEEN PS AND

327

sociology of knowledge development in the management and social sciences. He has published several articles on these topics. Address: College of Business and Public Adminstration, Drake University, Des Moines, IA 50311, USA. [email: Raymond.Hubbard@drake.edu]

Potrebbero piacerti anche