Combining Scores Multi Item Scales

1 How to combine scores across multiple questions to form a total scale score (modified and shortened, from Chapter
19, Warner, 2007) 19.6 Methods for the Computation of Summated Scales 19.6.1 Implicit Assumption: All Items Measure the Same Construct and Are Scored in the Same Direction When we add together scores on a list of measures or questions, we implicitly assume that all these scores measure the same underlying construct and that all the questions or items are scored in the same direction. Consider the first assumption, the assumption that all items measure the same construct. What information would be obtained by a set of numbers that measured completely unrelated things? A sum of X1 = height, X2 = agreeableness, and X3 = number of pairs of shoes owned by a person would be a meaningless number because the scores that are combined are not measures of the same underlying latent variable. In general, it does not make sense to summarize information across a set of X measured variables by summing them unless they are highly correlated with each other, and both the pattern of correlations and the nature of the items are consistent with the interpretation that all the individual X items are slightly different ways of measuring the same underlying latent variable (e.g., depression). The items included in psychological tests such as the CESD scale are typically written so that they assess slightly different aspects of a complex variable such as depression (e.g., low self-esteem, fatigue, sadness). To evaluate empirically whether these items can reasonably be interpreted as measures of the same underlying latent variable or construct, we look for reasonably large correlations among the scores on the items. If the scores on a set of measurements or test items are highly correlated with each other, this evidence is consistent with the belief that the items may all be measures of the same underlying construct. However,
2 high correlations among items can arise for other reasons and are not necessarily proof that the items measure the same underlying construct; for example, they may occur due to sampling error or may arise because the items have some kind of measurement artifact in common, such as a strong social desirability bias. The most widely reported method of evaluating reliability for summated scales, Cronbachss alpha, is based on the mean of the inter-item correlations. 19.6.2 Reverse-Worded Questions Consider the second assumption: the assumption that all items are scored in the same direction. In the CESD scale in the appendix to this chapter, most of the items are worded in such a way that a higher score indicates a greater degree of depression. For example, for Question 3, I felt that I could not shake off the blues even with help from my family or friends, the response that corresponds to 4 points (I felt this way most of the time, 57 days per week) indicates a higher level of depression than the response that corresponds to 1 point (I felt this way rarely or none of the time). However, a few of the items (Numbers 4, 8, 12, and 16) are reverse worded. Question 4 asks how frequently the respondent felt that I was just as good as other people. The response to this question that would indicate the highest level of depression corresponds to the lowest frequency of occurrence (1 = Rarely or none of the time). When reverse-worded items are included in a multiple-item measure, the scoring on these items must be recoded before we sum scores across items, such that a high score on every item corresponds to the same thing, that is, a higher level of depression. When I name my SPSS variables, I generally give names that help me to remember what scale each question belongs to, which item number, and whether or not it is reverse scored. So for example when my survey included the 20 item (question) CESD scale I named the items dep1, dep2, dep3, etc. However, when a question is reverse worded and needs to be
3 recoded before it is used in a reliability analysis (such as Cronbachss alpha) or summed with other items, I initially give the variable a name such as revdepd4. When self-report methods are used, it is often desirable to include some reverse-worded questions. Self-report responses are prone to many types of bias, including yea-saying or nay-saying bias (some respondents tend to agree or disagree with all items), and social desirability bias (many people tend to report behaviors and attitudes that they believe are socially desirable); see Converse and Presser (1999) for further discussion. To avoid the yeasaying bias, some scales include reverse-worded items. For example, the CESD scale includes statements about feelings and behaviors, and respondents are asked to rate how frequently they experience each of these, using a scale from 1 (rarely or none of the time, less than 1 day a week) to 4 (most or all of the time, 57 days a week). It is generally preferable to report final scores for a scale scored in a direction such that a higher score corresponds to more of the attitude or ability that the test is supposed to measure. For example, it is easier to talk about scores on a depression scale, and to interpret correlations of the depression scale with other variables, if a higher score corresponds to more severe depression. (If a depression scale was scored such that a high score corresponded to a low level of depression, then scores on the depression scale would correlate negatively with other measures of negative mood such as anxiety; this would be confusing for the data analyst and the reader.) Most of the items on the CESD scale are worded such that a high frequency of reported occurrence corresponds to a higher level of depression. For example, a high reported frequency of occurrence for the item I had crying spells corresponds to a higher level of depression. However, a few of the CESD scale items were reverse worded, for example, I enjoyed life. For these reverse-worded items, a score of 1 or 2 indicating a low frequency of occurrence corresponds to a higher level of depression. Before combining
4 scores across items that are worded in different directions (such that for some items, a high score corresponds to more depression, and for other items, a low score corresponds to more depression), it is necessary to recode the direction of scoring on reverse-worded items so that a higher score always corresponds to a higher level of depression. Items 4, 8, 12, and 16 in the appendix were reverse worded. Scores on these reverse-worded items must be recoded when we form a sum of the scores across all 20 items to serve as an overall measure of depression. In the following example, revdep4 is the name of the SPSS variable that corresponds to the reverse-worded depression item I felt that I was just as good as other people (item number 4 on the CESD scale). One simple method to reverse the scoring on this item (so that a higher score corresponds to more depression) is as follows: Create a new variable (dep4) that corresponds to 5 revdep4. If you take a value that is one unit higher than the highest possible score on a measure (in this case, because the possible scores are 1, 2, 3, and 4, we use the value 5), and then subtract each persons score from that reference value, this reverses the direction of scoring. This can be done in SPSS by making the following menu selections: <Transform> <Compute>. In the dialog box for the Compute procedure (see Figure 19.5), the name of the new variable or Target Variable (dep4) is placed in the left-hand side box. The equation to compute a score for this new variable as a function of the score on an existing variable is placed in the right-hand side box titled Numeric Expression (in this case, the numeric expression is 5 revdep4). Insert Figure 19.5 Figure 19.5 Computing a Reverse-Scored Variable for Dep4 It is also helpful to create a variable with a different name for the reverse-coded score for
5 each item (e.g., dep4 is the reverse-coded score on revdep4). If you change the direction of scoring by changing the original values and retain the original variable name (as in this example, dep4 = 5 dep4), it is easy to lose track of which items have and have not already been reverse scored. Based on a preliminary examination of the data, the researcher evaluates whether the two assumptions required for simple summated scales are satisfied (i.e., scores on all the items are positively intercorrelated, and it makes sense to interpret all the items as measures of the same underlying construct or variable). 19.6.3 Sum of Raw Scores After recoding any reverse-worded items, you can create a total score for each scale by summing scores across items as shown in Figure 19.7. In this first example, a score for selected items from the CESD scale was computed by summing the scores on Items 1 through 5 (with Item 4 reverse scored). The <Transform> and <Compute> menu selections open the SPSS Compute dialog window that appears in Figure 19.7. The name of the new variable (in this example, briefcesd) is placed in the left-hand side window under the Target Variable. The equation that specifies which scores are summed is placed in the Numeric Expression window. To form a score that is the sum of items named dep1 to dep5 (but using dep4 instead of revdep4), you can use the following numeric expression: briefcesd = dep1 + dep2 + dep3 + dep4 + dep5. Insert Figure 19.7 Figure 19.7 Computation of a Brief 5-Item Version of the Depression Scale If an individual has a missing score on one or more individual items, use of this computation: briefcesd = dep1 + dep2 + dep3 + dep4 + dep5. will result in a system missing code for the new scale total score. In this dataset, one participant had a system missing code (19.6)
6 on revdep4 and dep4; therefore, the number of scores is reduced from N = 98 in the entire SPSS data file to N = 97 for analyses that involve the variable briefcesd. If you want to obtain a score for people who have missing values on some items, you can use the MEAN function in the SPSS Compute dialog window (see Figure 19.8); this returns the mean score, based on all non-missing items. For example, if a person is missing a score on dep2, the numeric expression mean(dep1, dep2, dep3, dep4, dep5) will return the mean for all availablel scores on Items dep1, dep3, dep4, and dep5. If you want to put the total score back into the units that you would have obtained by summing items, multiply this mean by the number of items in the scale (in this case, the number of items was 5). 19.6.4 Sum of z Scores Summing raw scores may be reasonable when the items are all scored using the same response alternatives or all measured in the same units. However, there are occasions when researchers want to combine information across variables that are measured in quite different units. Suppose a sociologist wants to create an overall index of socioeconomic status (SES) by combining information about the following measures: annual income in dollars, years of education, and occupational prestige rated on a scale from 0 to 100. If raw scores (in dollars, years, and points) were summed, the value of the total score would be dominated by the value of annual income. If we want to give these three factors (income, education, and occupational prestige) equal weight when we combine them, we can convert each variable to a z score or standard score and, then, form a unit-weighted composite of these z scores: ztotal = zX1 + zX2 + + zXp. (19.7)
To create a composite of z scores on income, education, and occupational prestige so as to summarize information about SES, you could compute SES = zincome + zeducation + zoccupationprestige. You could also use the Mean function to obtain a mean of z scores for the items in a scale.
7 19.7 Assessment of Internal Homogeneity for Multiple-Item Measures The internal consistency reliability of a multiple-item scale tells us the degree to which the items on the scale measure the same thing. If the items on a test all measure the same underlying construct or variable, and if all items are scored in the same direction, then the correlations among all the items should be positive. 19.7.2 Cronbachs Alpha Reliability Coefficient: Conceptual Basis We can summarize information about positive intercorrelations between the items on a multiple-item test by calculating a Cronbachs alpha reliability. The Cronbachs alpha has become the most popular form of reliability assessment for multiple-item scales. As seen in an earlier section, as we sum a larger number of items for each participant, the expected value of ei approaches 0, while the value of p T increases. In theory, as the number of items (p) included in a scale increases, assuming other characteristics of the data remain the same, the reliability of the measure (the size of the p T component compared with the size of the e component) also increases. The Cronbachs alpha provides a reliability coefficient that tells us, in theory, how reliable our estimate of the stable entity that we are trying to measure is, when we combine scores from p test items (or behaviors or ratings by judges). The Cronbachs alpha uses the mean of all the inter-item correlations (for all pairs of items or measures) to assess the stability or consistency of measurement. The Cronbachs alpha can be understood as a generalization of the Spearman-Brown prophecy formula; we calculate the mean inter-item correlation (r) to assess the degree of agreement among individual test items, and then, we predict the reliability coefficient for a pitem test from the correlations among all these single-item measures. Another possible interpretation of the Cronbachs alpha is that it is, essentially, the average of all possible split half reliabilities. Here is one formula for the Cronbachs from Carmines and Zeller (1979,
8 p. 44):
pr , [1 + r ( p 1)]
(19.11)
where p is the number of items on the test and r the mean of the inter-item correlations. The size of the Cronbachs alpha depends on the following two factors: As p (the number of items included in the composite scale) increases, and assuming that r stays the same, the value of the Cronbachs alpha increases. As r (the mean of the correlations among items or measures) increases, assuming that the number of items p remains the same, the Cronbachs alpha increases. It follows that we can increase the reliability of a scale by adding more items (but only if doing so does not decrease r, the mean inter-item correlation) or by modifying items to increase r (either by dropping items with low item-total correlations or by writing new items that correlate highly with existing items). There is a trade-off: If the inter-item correlation is high, we may be able to construct a reasonably reliable scale with few items, and of course, a brief scale is less costly to use and less cumbersome to administer than a long scale. Note that all items must be scored in the same direction prior to summing. Items that are scored in the opposite direction relative to other items on the scale would have negative correlations with other items, and this would reduce the magnitude of the mean inter-item correlation. Researchers usually hope to be able to construct a reasonably reliable scale that does not have an excessively large number of items. Many published measures of attitudes or personality traits include between 4 and 20 items for each trait. Ability or achievement tests (such as IQ) may require much larger numbers of measurements to produce reliable results. Note that when the items are all dichotomous (such as true/false), the Cronbachs alpha may still be used to assess the homogeneity of response across items. In this situation, it is
9 sometimes called a Kuder-Richardson 20 (KR-20) reliability coefficient. However, the Cronbachs alpha is not appropriate for use with items that have categorical responses with more than two categories. 19.7.3 Cronbachs Alpha for Five Selected CESD Scale Items Ninety-seven students filled out the 20-item CESD scale (items shown in the appendix to this chapter) as part of a survey. The names given to these 20 items in the SPSS data file that appears in Table 19.2 were dep1 to dep20. Questions 4, 8, 12, and 16 were reverse worded, and therefore, it was necessary to recode the scores on these items. The recoded values were placed in variables with the names dep4, dep8, dep12, and dep16. The SPSS reliability procedure was used to assess the internal consistency reliability of their responses. The value of the Cronbachs alpha is an index of the internal consistency reliability of the depression score formed by summing the first 5 items. In this first example, only the first 5 items (dep1, dep2, dep3, dev4, and dep5) were included. To run SPSS reliability, the following menu selections were made, starting from the top level menu for the SPSS data worksheet (see Figure 19.11): <Analyze> <Scale> <Reliability>. The reliability procedure dialog box appears in Figure 19.12. The names of the 5 items on the CESD scale were moved into the variable list for this procedure. The Statistics button was clicked to request additional output; the Reliability Analysis: Statistics window appears in Figure 19.13. In this example, Scale if item deleted in the Descriptives for box and Correlations in the Inter-Item box were checked. The syntax for this procedure appears in Figure 19.14, and the output appears in Figure 19.15. Insert Figure 19.11 Figure 19.11 SPSS Menu Selections for the Reliability Procedure Insert Figure 19.12
10 Figure 19.12 SPSS Reliability Analysis for 5 CESD Scale Items: Dep1, Dep2, Dep3, Dep4, and Dep5 Insert Figure 19.13 Figure 19.13 Statistics Selected for SPSS Reliability Analysis Insert Figure 19.14 Figure 19.14 SPSS Syntax for Reliability Analysis Insert Figure 19.15 Figure 19.15 SPSS Output From the First Reliability Procedure NOTE: Scale: BriefCESD. The Reliability Statistics panel in Figure 19.15 reports two versions of the Cronbachs alpha statistic for the entire scale including all 5 items. For the sum dep1 + dep2 + dep3 + dep4 + dep5, the Cronbachs alpha estimates the proportion of the variance in this total that is due to p T, the part of the score that is stable or consistent for each participant across all 5 items. A score can be formed by summing raw scores (the sum of dep1, dep2, dep3, dep4, and dep5), z scores, or standardized scores (zdep1 + zdep2 ++ zdep5). The first value, = .59, is the reliability for the scale formed by summing raw scores; the second value, = .61, is the reliability for the scale formed by summing z scores across items. In this example, these two versions of the Cronbachs alpha (raw score and standardized score) are nearly identical. They generally differ from each other more when the items that are included in the sum are measured using different scales with different variances (as in the earlier example of an SES scale based on a sum of income, occupational prestige, and years of education). Recall that the Cronbachs alpha, like other reliability coefficients, can be interpreted as a proportion of variance. Approximately 60% of the variance in the total score for depression, which is obtained by summing the z scores on Items 1 through 5 from the CESD scale, is
11 shared across these 5 items. A Cronbachs reliability coefficient of .61 would be considered unacceptably poor reliability in most research situations. Subsequent sections describe two different things researchers can do that may improve the Cronbachs alpha reliability: deleting poor items or increasing the number of items. A correlation matrix appears under the heading Inter-Item Correlation Matrix. This reports the correlations between all possible pairs of items. If all items measure the same underlying construct, and if all items are scored in the same direction, then all the correlations in this matrix should be positive and reasonably large. Note that the same item that had a small loading on the depression factor in the preceding FA (trouble concentrating) also tended to have low or even negative correlations with the other 4 items. The Item-Total Statistics table shows how the statistics associated with the scale formed by summing all five items would change if each individual item were deleted from the scale. The Corrected ItemTotal Correlation for each item is its correlation with the sum of the other 4 items in the scale; for example, for dep1, the correlation of dep1 with the corrected total (dep2 + dep3 + dep4 + dep5) is shown. This total is called corrected because the score for dep1 is not included when we assess how dep1 is related to the total. If an individual item is a good measure, then it should be strongly related to the sum of all other items in the scale; conversely, a low item-total correlation is evidence that an individual item does not seem to measure the same construct as other items in the scale. The item that has the lowest item-total correlation with the other items is, once again, the question about trouble concentrating. This low item-total correlation is yet another piece of evidence that this item does not seem to measure the same thing as the other 4 items in this scale. The last column in the Item-Total Statistics table reports Cronbachss Alpha if Item Deleted; that is, what is the Cronbachs alpha for the scale if each individual item is deleted?
12 For the item that corresponded to the question trouble concentrating, deletion of this item from the scale would increase the Cronbachs to .70. Sometimes the deletion of an item that has low correlations with other items on the scale results in an increase in reliability. In this example, we can obtain slightly better reliability for the scale if we drop the item trouble concentrating, which tends to have small correlations with other items on this depression scale; the sum of the remaining 4 items has a Cronbachs of .70, which represents slightly better reliability. 19.7.4 Improving Cronbachs Alpha by Dropping a Poor Item The SPSS reliability procedure was performed on the reduced set of 4 items: dep1, dep2, dep3, and dep4. The output from this second reliability analysis (in Figure 19.16) shows that the reduced 4-item scale had Cronbachs reliabilities of .703 (for the sum of raw scores) and .712 (for the sum of z scores). A review of the column headed Cronbachss Alpha if Item Deleted in the new Item-Total Statistics table indicates that the reliability of the scale would become lower if any additional items were deleted from the scale. Thus, we have obtained slightly better reliability from the 4-item version of the scale (Figure 19.16) than for a 5-item version of the scale (Figure 19.15). The 4-item scale had better reliability because the mean inter-item correlation was higher after the item trouble concentrating was deleted. Insert Figure 19.16 Figure 19.16 Output for the Second Reliability Analysis: Scale Reduced to Four Items NOTE: Item trouble concentrating has been dropped. 19.7.5 Improving the Cronbachs Alpha by Increasing the Number of Items Other factors being equal, Cronbachs alpha reliability tends to increase as p, the number of items in the scale, increases. For example, we obtain a higher Cronbachs alpha when we use all 20 items in the full-length CESD scale than when we examine just the first 5 items.
13 The output from the SPSS reliability procedure for the full 20-item CESD scale (with Items 4, 8, 12, and 16 reverse scored) appears in Figure 19.17. For the full scale formed by summing scores across all 20 items, the Cronbachs was .88. Insert Figure 19.17 Figure 19.17 SPSS Output: Cronbachs Alpha Reliability for the 20-Item CESD Scale 19.7.6 A Few Other Methods of Reliability Assessment for Multiple-Item Measures 19.7.6.1 Split-Half Reliability A split-half reliability for a scale with p items is obtained by dividing the items into two sets (each with p/2 items). This can be done randomly or systematically; for example, the first set might consist of odd-numbered items and the second set might consist of even-numbered items. Separate scores are obtained for the sum of the Set 1 items (X1) and the sum of the Set 2 items (X2), and a Pearson r (r12) is calculated between X1 and X2. However, this r12 correlation between X1 and X2 is the reliability for a test with only p/2 items; if we want to know the reliability for the full test that consists of twice as many items (all p items, in this example), we can predict the reliability of the longer test using the Spearman-Brown prophecy formula (Carmines & Zeller, 1979): rXX = 2 r12 , 1 + r12 (19.12)
where r12 is the correlation between the scores based on split-half versions of the test (each with p/2 items), and rXX is the reliability for a score based on all p items. Depending on the way in which items are divided into sets, the value of the split-half reliability can vary. The Cronbachs alpha can be interpreted as the mean of all possible different split-half reliabilities. 19.7.6.2 Parallel Forms Reliability
14 Sometimes it is desirable to have two versions of a test that include different questions but that yield comparable information; these are called parallel forms. Parallel forms of a test, such as the Eysenck Personality Inventory, are often designated Form A and Form B. Parallel forms are particularly useful in repeated measures studies where we would like to test some ability or attitude on two occasions, but we want to avoid repeating exactly the same questions. Parallel forms reliability is similar to split-half reliability, except that when parallel forms are developed, more attention is paid to matching items so that the two forms contain similar types of questions. For example, consider Eysencks Extraversion scale. Both Form A and Form B include similar numbers of items that assess each aspect of extraversion for instance, enjoyment of social gatherings, comfort in talking with strangers, sensation seeking, and so forth. A Pearson r between scores on Form A and Form B is a typical way of assessing reliability; in addition, however, a researcher wants scores on Form A and Form B to yield the same means, variances, and so forth, so these should also be assessed. 19.9 Validity Assessment Validity of a measurement essentially refers to whether the measurement really measures what it purports to measure. In psychological and educational measurement, the degree to which scores on a measure correspond to the underlying construct that the measure is supposed to assess is called construct validity. (Some textbooks used to list construct validity as one of several types of measurement validity; in recent years, many authors use the term construct validity to subsume all the forms of validity assessment described below.) For some types of measurement (such as direct measurements of simple physical characteristics), validity is reasonably self-evident. If a researcher uses a tape measure to obtain information about peoples heights (whether the measurements are reported in centimeters, inches, feet, or other units), the researcher does not need to go to great lengths to
15 persuade readers that this type of measurement is valid. However, there are many situations where the characteristic of interest is not directly observable, and researchers can only obtain indirect information about it. For example, we cannot directly observe intelligence (or depression); but we may infer that a person is intelligent (or depressed) if he or she gives certain types of responses to large numbers of questions that researchers agree are diagnostic of intelligence (or depression). A similar problem arises in medicine, for example, in the assessment of blood pressure. Arterial blood pressure could be measured directly by shunting the blood flow out of the persons artery through a pressure measurement system, but this procedure is invasive (and generally, less invasive measures are preferred). The commonly used method of blood pressure assessment uses an arm cuff; the cuff is inflated until the pressure in the cuff is high enough to occlude the blood flow; a human listener (or a microphone attached to a computerized system) listens for sounds in the brachial artery while the cuff is deflated. At the point when the sounds of blood flow are detectable (the Korotkoff sounds), the pressure on the arm cuff is read, and this number is used as the index of systolic blood pressurethat is, the blood pressure at the point in the cardiac cycle when the heart is pumping blood into the artery. The point of this example is that this common blood pressure measurement method is quite indirect; research had to be done to establish that measurements taken in this manner were highly correlated with measurements obtained more directly by shunting blood from a major artery into a pressure detection system. Similarly, it is possible to take satellite photographs and use the colors in these images to make inferences about the type of vegetation on the ground, but it is necessary to do validity studies to demonstrate that the type of vegetation that is identified using satellite images corresponds to the type of vegetation that is seen when direct observations are made at ground level. As these examples illustrate, it is quite common in many fields (such as psychology,
16 medicine, and natural resources) for researchers to use rather indirect assessment methods either because the variable in question cannot be directly observed or because direct observation would be too invasive or too costly. In cases such as these, whether the measurements are made through self-report questionnaires, by human observers, or by automated systems, validity cannot be assumed; we need to obtain evidence to show that measurements are valid. For self-report questionnaire measurements, two types of evidence are used to assess validity. One type of evidence concerns the content of the questionnaire (content or face validity); the other type of evidence involves correlations of scores on the questionnaire with other variables (criterion-oriented validity). 19.9.1 Content and Face Validity Both content and face validity are concerned with the content of the test or survey items. Content validity involves the question whether test items represent all theoretical dimensions or content areas. For example, if depression is theoretically defined to include low selfesteem, feelings of hopelessness, thoughts of suicide, lack of pleasure, and physical symptoms of fatigue, then a content-valid test of depression should include items that assess all these symptoms. Content validity may be assessed by mapping out the test contents in a systematic way and matching them to elements of a theory or by having expert judges decide whether the content coverage is complete. A related issue is whether the instrument has face validity; that is, does it appear to measure what it says it measures? Face validity is sometimes desirable, when it is helpful for test takers to be able to see the relevance of the measurements to their concerns, as in some evaluation research studies where participants need to feel that their concerns are being taken into account.
17 If a test is an assessment of knowledge (e.g., knowledge about dietary guidelines for blood glucose management for diabetic patients), then content validity is crucial. Test questions should be systematically chosen so that they provide reasonably complete coverage of the information (e.g., What are the desirable goals for the proportions and amounts of carbohydrate, protein, and fat in each meal? When blood sugar is tested before and after meals, what ranges of values would be considered normal?). When a psychological test is intended for use as a clinical diagnosis (of depression, for instance), clinical source books such as the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) might be used to guide item selection, to ensure that all relevant facets of depression are covered. More generally, a well-developed theory (about ability, personality, mood, or whatever else is being measured) can help a researcher map out the domain of behaviors, beliefs, or feelings that questions should cover to have a content-valid and comprehensive measure. However, sometimes, it is important that test takers should not be able to guess the purpose of the assessment, particularly in situations where participants might be motivated to fake good, fake bad, lie, or give deceptive responses. There are two types of psychological tests that (intentionally) do not have high face validity: projective tests and empirically keyed objective tests. One well-known example of a projective test is the Rorschach test, in which people are asked to say what they see when they look at ink blots; a diagnosis of psychopathology is made if responses are bizarre. Another is the Thematic Apperception Test, in which people are asked to tell stories in response to ambiguous pictures; these stories are scored for themes such as need for achievement and need for affiliation. In projective tests, it is usually not obvious to participants what motives are being assessed, and because of this, test takers should not be able to engage in impression
18 management or faking. Thus, projective tests intentionally have low face validity. Some widely used psychological tests were constructed using empirical keying methods; that is, test items were chosen because the responses to those questions were empirically related to a psychiatric diagnosis (such as depression), even though the question did not appear to have anything to do with depression. For example, persons diagnosed with depression tend to respond False to the MMPI (Minnesota Multiphasic Personality Inventory) item I sometimes tease animals; this item was included in the MMPI depression scale because the response was (weakly) empirically related to a diagnosis of depression, although the item does not appear face valid as a question about depression (Wiggins, 1973). Face validity can be problematic; people do not always agree about what underlying characteristic(s) a test question measures. Gergen, Hepburn, and Fisher (1986) demonstrated that when items taken from one psychological test (the Rotter Internal/External Locus of Control scale) were presented to people out of context and people were asked to say what trait they thought the questions assessed, they generated a wide variety of responses. 19.9.2 Criterion-Oriented Types of Validity Content validity and face validity are assessed by looking inside a test to see what material it contains and what the questions appear to measure. Criterion-oriented validity is assessed by examining correlations of scores on the test with scores on other variables that should be related to it if the test really measures what it purports to measure. If the CESD scale really is a valid measure of depression, for example, scores on this scale should be correlated with scores on other existing measures of depression that are thought to be valid, and they should predict behaviors that are known or theorized to be associated with depression. 19.9.2.1 Convergent Validity
19 Convergent validity is assessed by checking to see if scores on a new test of some characteristic X correlate highly with scores on existing tests that are believed to be valid measures of that same characteristic. For example, do scores on a new brief IQ test correlate highly with scores on well-established IQ tests such as the WAIS or the Stanford-Binet? Are scores on the CESD scale closely related to scores on other depression measures such as the BDI? If a new measure of a construct has reasonably high correlations with existing measures that are generally viewed as valid, this is evidence of convergent validity. 19.9.2.2 Discriminant Validity Equally important, scores on X should not correlate with things the test is not supposed to measure (discriminant validity). For instance, researchers sometimes try to demonstrate that scores on a new test are not contaminated by social desirability bias by showing that these scores are not significantly correlated with scores on the Crown-Marlowe Social Desirability scale or other measures of social desirability bias. 19.9.2.3 Concurrent Validity As the name suggests, concurrent validity is evaluated by obtaining correlations between scores on the test with current behaviors or current group memberships. For example, if persons who are currently clinically diagnosed with depression have higher mean scores on the CESD scale than persons who are not currently diagnosed with depression, this would be one type of evidence for concurrent validity. 19.9.2.4 Predictive Validity Another way of assessing validity is to ask whether scores on the test predict future behaviors or group membership. For example, are scores on the CESD scale higher for persons who later commit suicide than for people who do not commit suicide? 19.9.3 Construct Validity: Summary
20 Many types of evidence (including content, convergent, discriminant, concurrent, and predictive validity) may be required to establish that a measure has strong construct validity that is, that it really measures what the test developer says it measures, and it predicts the behaviors and group memberships that it should be able to predict. Westen and Rosenthal (2003) suggested that researchers should compare a matrix of obtained validity coefficients or correlations with a target matrix of predicted correlations and compute a summary statistic to describe how well the observed pattern of correlations matches the predicted pattern. This provides a way of quantifying information about construct validity based on many different kinds of evidence. Although the preceding examples have used psychological tests, validity questions certainly arise in other domains of measurement. For example, referring to the example discussed earlier, when the colors in satellite images are used to make inferences about the types and amounts of vegetation on the ground, are those inferences correct? Indirect assessments are sometimes used because they are less invasive (e.g., as discussed earlier, it is less invasive to use an inflatable arm cuff to measure blood pressure) and sometimes because they are less expensive (broad geographical regions can be surveyed more quickly by taking satellite photographs than by having observers on the ground). Whenever indirect methods of assessment are used, validity assessment is required. Multiple-item assessments of some variables (such as depression) may be useful or even necessary to achieve validity as well as reliability. How can we best combine information from multiple measures? This brings us back to a theme that has arisen repeatedly throughout the book; that is, we can often summarize the information in a set of p variables or items by creating a weighted linear composite or, sometimes, just a unit weight sum of scores for the set of p variables.
21 19.10 Typical Scale Development Study If an existing multiple-item measure is available for the variable of interest, such as depression, it is usually preferable to employ an existing measure for which we have good evidence about reliability and validity. However, occasionally, a researcher would like to develop a measure for some construct that has not been measured before or develop a different way of measuring a construct for which the existing tests are flawed. An outline of a typical research process for scale development appears in Figure 19.19. In this section, the steps included in this diagram are discussed briefly. Although the examples provided involve self-report questionnaire data, comparable issues are involved in combining physiological measures or observational data. Insert Figure 19.19 Figure 19.19 Possible Steps in the Development of a Multiple-Item Scale 19.10.1 Generating and Modifying the Pool of Items or Measures When a researcher sets out to develop a measure for a new construct (for which there are no existing measures) or a different measure in a research domain where other measures have been developed, the first step is the generation of a pool of candidate items. There are many ways in which this can be done. For example, to develop a set of self-report items to measure Machiavellianism (a cynical, manipulative attitude toward people), Christie and Geis (1970) drew on the writings of Machiavelli for some items (and also on statements by P. T. Barnum, another notable cynic). To develop measures of love, Rubin (1970) drew on writings about love that ranged from the works of classic poets to the lyrics of popular songs. In some cases, items are borrowed from existing measures; for example, a number of research scales have used items that are part of the MMPI. However, there are copyright restrictions on the use of items that are part of published tests.
22 Brainstorming by experts, and interviews, focus groups, or open-ended questions with members of the population who are the focus of assessment can also provide useful ideas about items. For example, to develop a measure of college student life space, including numbers and types of material possessions, Brackett (2004) interviewed student informants, visited dormitory rooms, and examined merchandise catalogs popular in that age group. A theory can be extremely helpful as guidance in initial item development. The early interview and self-report measures of the global Type A behavior pattern drew on a developing theory that suggested that persons prone to cardiovascular disease tend to be competitive, time urgent, job-involved, and hostile. The behaviors that were identified for coding in the interview thus included interrupting the interviewer and loud or explosive speech. The self-report items on the Jenkins Activity Survey, a self-report measure of global Type A behavior, included questions about eating fast, never having time to get a haircut, and being unwilling to lose in games even when playing checkers with a child (Jenkins, Zyzanski, & Rosenman, 1979). It is useful for the researcher to try to anticipate the factors that will emerge when these items are pretested and FA is performed. If a researcher wants to measure satisfaction with health care, and the researcher believes that there are three separate components to satisfaction (evaluation of practitioner competence, satisfaction with rapport or bedside manner, and issues of cost and convenience), then he or she should pause and evaluate whether the survey includes sufficient items to measure each of these three components. Keeping in mind that a minimum of 4 to 5 items are generally desired for each factor or scale and that not all candidate items may turn out to be good measures, it may be helpful to have something like 8 or 10 candidate items that correspond to each construct or factor that the researcher wants to measure.
23 19.10.2 Administer Survey to Participants The survey containing all the candidate items should be pilot tested on a relatively small sample of participants; it may be desirable to interview or debrief participants to find out whether items seemed clear and plausible and whether response alternatives covered all the options people might want to report. A pilot test can also help the researcher judge how long it will take for participants to complete the survey. After making any changes judged necessary based on the initial pilot tests, the survey should be administered to a sample that is large enough to be used for FA (see Chapter 18 for sample size recommendations). Ideally, these participants should vary substantially on the characteristics that the scales are supposed to measure (because a restricted range of scores on T, the component of the X measures that taps stable individual differences among participants, will lead to lower inter-item correlations and lower scale reliabilities). 19.10.3 Factor Analyze Items to Assess the Number and Nature of Latent Variables or Constructs Using the methods described in Chapter 18, FA can be performed on the scores. If the number of factors that are obtained and the nature of the factors (i.e., the groups of variables that have high loadings on each factor) are consistent with the researchers expectations, then the researcher may want to go ahead and form one scale that corresponds to each factor. If the FA does not turn out as expected, for example, if the number of factors is different from what was anticipated or if the pattern of variables that load on each factor is not as expected, the researcher needs to make a decision. If the researcher wants to make the FA more consistent with a priori theoretical constructs, it may be necessary to go back to Step 1 to revise, add, and drop items. If the researcher sees patterns in the data that were not anticipated from theoretical evaluations (but the patterns make sense), he or she may want to use the empirical
24 factor solution (instead of the original conceptual model) as a basis for grouping items into scales. Also, if a factor that was not anticipated emerges in the FA, but there are only a few items to represent that factor, the researcher may want to add or revise items to obtain a better set of questions for the new factor. In practice, a researcher may have to go through these first three sets several times; that is, the researcher may run FA, modify items, gather additional data, and run a new FA several times until the results of the FA are clear, and the factors correspond to meaningful groups of items that can be summed to form scales. Note that some scales are developed based on the predictive utility of items rather than on the factor structure; for these, DA (rather than FA) might be the data reduction method of choice. For example, items included in the Jenkins Activity Survey (Jenkins et al., 1979) were selected because they were useful predictors of a person having a future heart attack. 19.10.4 Development of Summated Scales After FA (or DA), the researcher may want to form scales by combining scores on multiple measures or items. There are numerous options at this point. 1. One or several scales may be created (depending on whether the survey or test measures just one construct or several separate constructs). 2. Composition of scales (i.e., selection of items) may be dictated by conceptual grouping of items or by empirical groups of items that emerge from FA. In most scale development research, researchers hope that the items that are grouped to form scales can be justified both conceptually and empirically. 3. Scales may involve combining raw scores or standardized scores (z scores) on multiple items. Usually, if the variables use drastically different measurement units (as in the example above where an SES index was formed by combining income, years of
25 education, and occupational prestige rating), z scores are used to ensure that each variable has equal importance. 4. Scales may be based on sums or means of scores across items.
19.10.5 Assess Scale Reliability At a minimum, the internal consistency of each scale is assessed, usually by obtaining a Cronbachs alpha. Test-retest reliability should also be assessed if the construct is something that is expected to remain reasonably stable across time (such as a personality trait), but high test-retest reliability is not a requirement for measures of things that are expected to be unstable across time (such as moods). 19.10.6 Assess Scale Validity If there are existing measures of the same theoretical construct, the researcher assesses convergent validity by checking to see whether scores on the new measure are reasonably highly correlated with scores on existing measures. If the researcher has defined the construct as something that should be independent of verbal ability or not influenced by social desirability, the researcher should assess discriminant validity by making sure that correlations with measures of verbal ability and social desirability are close to 0. To assess concurrent and predictive validity, scores on the scale can be used to predict current or future group membership and current or future behaviors, which it should be able to predict. For example, scores on Zick Rubins Love Scale (Rubin, 1970) were evaluated to see if they predicted self-rated likelihood that the relationship would lead to marriage and whether scores predicted which dating couples would split up and which ones would stay together within the year or two following the initial survey. 19.10.7 Iterative Process At any point in this process, if results are not satisfactory, the researcher may cycle
26 back to an earlier point in the process; for example, if the factors that emerge from FA are not clear or if internal consistency reliability of scales is low, the researcher may want to generate new items and collect more data. In addition, particularly for scales that will be used in clinical diagnosis or selection decisions, normative data are required; that is, the mean, variance, and distribution shape of scores must be evaluated based on a large number of people (at least several thousand). This provides test users with a basis for evaluation. For example, for the BDI (Beck et al., 1961), the following interpretations for scores have been suggested based on normative data for thousands of test takers: scores from 5 to 9, normal mood variations; 10 to 18, mild to moderate depression; 19 to 29, moderate to severe depression; and 30 to 63, severe depression. Scores of 4 or below on the BDI may be interpreted as possible denial of depression or faking good; it is very unusual for people to have scores that are this low on the BDI. 19.10.8 Create Final Scale When all the criteria for good quality measurement appear to be satisfied (i.e., the data analyst has obtained a reasonably brief list of items or measurements that appears to provide reliable and valid information about the construct of interest), a final version of the scale may be created. Often such scales are first published as tables or appendixes in journal articles. A complete report for a newly developed scale should include the instructions for the test respondents (e.g., what period of time should the test taker think about when reporting frequency of behaviors or feelings?); a complete list of items, statements, or questions; the specific response alternatives; indication whether any items need to be reverse coded; and scoring instructions. Usually, the scoring procedure consists of reversing the direction of scores for any reverse-worded items and then summing the raw scores across all items for each scale. If subsequent research provides additional evidence that the scale is reliable and
27 valid, and if the scale measures something that has a reasonably wide application, at some point, the test author may copyright the test and perhaps have it distributed on a fee per use basis by a test publishing company. Of course, as years go by, the contents of some test items may become dated. Therefore, periodic revisions may be required to keep test item wording current. 19.11 Summary To summarize, measurements need to be reliable. When measurements are unreliable, it leads to two problems. Low reliability may imply that the measure is not valid (if a measure does not detect anything consistently, it does not make much sense to ask what it is measuring). In addition, when researchers conduct statistical analyses, such as correlations, to assess how scores on an X variable are related to scores on other variables, the relationship of X to other variables becomes weaker as the reliability of X becomes smaller; the attenuation of correlation due to unreliability of measurement was discussed in Chapter 7. To put it more plainly, when a researcher has unreliable measures, relationships between variables usually appear to be weaker. It is also essential for measures to be valid: If a measure is not valid, then the study does not provide information about the theoretical constructs that are of real interest. It is also desirable for measures to be sensitive to individual differences, unbiased, relatively inexpensive, not very invasive, and not highly reactive. Research methods textbooks point out that each type of measurement method (such as direct observation of behavior, self-report, physiological or physical measurements, and archival data) has strengths and weaknesses. For example, self-report is generally low cost, but such reports may be biased by social desirability (i.e., people report attitudes and behaviors that they believe are socially desirable, instead of honestly reporting their actual attitudes and behaviors). When it is possible to do so, a study can be made much stronger by
28 including multiple types of measurements (this is called triangulation of measurement). For example, if a researcher wants to measure anxiety, it would be desirable to include direct observation of behavior (e.g., ums and ahs in speech and rapid blinking), self-report (answers to questions that ask about subjective anxiety), and physiological measures (such as heart rates and cortisol levels). If an experimental manipulation has similar effects on anxiety when it is assessed using behavioral, self-report, and physiological outcomes, the researcher can be more confident that the outcome of the study is not attributable to a methodological weakness associated with one form of measurement, such as self-report. The development of a new measure can require a substantial amount of time and effort. It is relatively easy to demonstrate reliability for a new measurement, but the evaluation of validity is far more difficult and the validity of a measure can be a matter of controversy. When possible, researchers may prefer to use existing measures for which data on reliability and validity are already available. For psychological testing, a useful online resource is the American Psychological Association FAQ on testing: www.apa.org/science/testing.html. Another useful resource is a directory of published research tests on the Educational Testing Service (ETS) Test Link site www.ets.org/testcoll/index.html, which has information on about 20,000 published psychological tests. Although most of the variables used as examples in this chapter were self-report measures, the issues discussed in this chapter (concerning reliability, validity, sensitivity, bias, cost effectiveness, invasiveness, and reactivity) are relevant for other types of data, including physical measurements, medical tests, and observations of behavior. Appendix: The CESD Scale INSTRUCTIONS: Using the scale below, please circle the number before each statement
29 which best describes how often you felt or behaved this way DURING THE PAST WEEK. 1 = Rarely or none of the time (less than 1 day) 2 = Some or a little of the time (12 days) 3 = Occasionally or a moderate amount of time (34 days) 4 = Most of the time (57 days) The total CESD depression score is the sum of the scores on the following twenty questions with Items 4, 8, 12, and 16 reverse scored. 1. 2. 3. 4. 5. 6. 7. 8. 9. I was bothered by things that usually dont bother me. I did not feel like eating; my appetite was poor. I felt that I could not shake off the blues even with help from my family or friends. I felt that I was just as good as other people. (reverse worded) I had trouble keeping my mind on what I was doing. I felt depressed. I felt that everything I did was an effort. I felt hopeful about the future. (reverse worded) I thought my life bad been a failure.
10. I felt fearful. 11. My sleep was restless. 12. I was happy. (reverse worded) 13. I talked less than usual. 14. I felt lonely. 15. People were unfriendly. 16. I enjoyed life. (reverse worded) 17. I had crying spells.
30 18. I felt sad. 19. I felt that people dislike me. 20. I could not get going. A total score on CESD is obtained by reversing the direction of scoring on the four reverse-worded items (4, 8, 12, and 16), so that a higher score on all items corresponds to a higher level of depression, and then summing the scores across all 20 items. Appendix Source: Radloff, L. S. (1977). The CESD Scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385401. WWW Links: Resources on Psychological Measurement American Psychological Association www.apa.org/science/testing.html Goldbergs International Personality Item Poolroyalty-free versions of scales that measure Big Five personality traits http://ipip.ori.org/ipip/ Mental Measurements Yearbook Test Reviews online http://buros.unl.edu/buros/jsp/search.jsp PsychWeb information on psychological tests www.psychweb.com/tests/psych_tests
31
Figure 19.5 Computing a Recoded Variable (Dep4) From the Reverse Scored Item Revdep4
32 Figure 19.7 Computation of Brief Five-Item Version of Depression Scale: Adding Scores Across Items Using Plus Signs
33 Figure 19.8 Combining Scores from Five Items Using the SPSS MEAN Function (Multiplied By Number of Items)
34 Figure 19.11 SPSS Menu Selections for Reliability Procedure
35 Figure 19.12 SPSS Reliability Analysis for Five CESD Items: Dep1, Dep2, Dep3, Dep4, Dep5 NOTE: Dep4 is the recoded version of revdep4, corrected so that the direction of scoring is the same as for other items on the scale.
36 Figure 19.13 Statistics Selected for SPSS Reliability Analysis
37 Figure 19.14 SPSS Syntax for Reliability Analysis
38 Figure 19.15 SPSS Output from First Reliability Procedure for Scale: Briefcesd Reliability Statistics Cronbach's Alpha Based on Cronbach's Standardized Alpha Items N of Items .585 .614 5 Inter-Item Correlation Matrix dep1 dep2 dep3 dep4 1.000 .380 .555 .302 .380 1.000 .394 .193 .555 .394 1.000 .446 .302 .193 .446 1.000 .062 .074 .115 -.129
dep1 dep2 dep3 dep4 dep5
dep5 .062 .074 .115 -.129 1.000
dep1 dep2 dep3 dep4 dep5
Scale Mean if Item Deleted 5.6701 5.7010 5.6082 7.4742 4.8247
Item-Total Statistics Scale Corrected Squared Cronbach's Variance if Item-Total Multiple Alpha if Item Item Deleted Correlation Correlation Deleted 5.786 .511 .341 .455 5.941 .398 .195 .504 4.845 .615 .434 .365 5.710 .294 .237 .562 7.042 .032 .055 .703
39 Figure 19.16 Output for the Second Reliability Analysis: Scale reduced to Four Items NOTE: dep5, Trouble Concentrating, has been dropped Reliability Statistics Cronbach's Alpha Based on Cronbach's Standardized Alpha Items N of Items .703 .712 4 Inter-Item Correlation Matrix dep1 dep2 dep3 dep4 1.000 .380 .555 .302 .380 1.000 .394 .193 .555 .394 1.000 .446 .302 .193 .446 1.000 Item-Total Statistics Scale Corrected Squared Cronbach's Variance if Item-Total Multiple Alpha if Item Item Deleted Correlation Correlation Deleted 4.625 .541 .341 .617 4.811 .407 .194 .686 3.810 .633 .419 .542 4.166 .410 .204 .702
dep1 dep2 dep3 dep4
dep1 dep2 dep3 dep4
Scale Mean if Item Deleted 3.1753 3.2062 3.1134 4.9794
40 Figure 19.17 note to copyeditor: this figure remains the same as in the first edition SPSS Output: Cronbach Alpha Reliability for 20 Item CES-D Scale Scale: CESDTotal
Case Processing Summary N Cases Valid Excludeda Total 94 4 98 % 95.9 4.1 100.0
a. Listwise deletion based on all variables in the procedure.
Reliability Statistics Cronbach's Alpha .880 N of Items 20
41 Figure 19.xx Possible Steps in the Development of a Multiple Item Scale.

Combining Scores Multi Item Scales

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Combining Scores Multi Item Scales

Caricato da

Copyright:

Formati disponibili

1 How to combine scores across multiple questions to form a total scale score (modified and shortened, from Chapter

34 Figure 19.11 SPSS Menu Selections for Reliability Procedure

36 Figure 19.13 Statistics Selected for SPSS Reliability Analysis

37 Figure 19.14 SPSS Syntax for Reliability Analysis

dep1 dep2 dep3 dep4 dep5

dep5 .062 .074 .115 -.129 1.000

dep1 dep2 dep3 dep4 dep5

Scale Mean if Item Deleted 5.6701 5.7010 5.6082 7.4742 4.8247

dep1 dep2 dep3 dep4

dep1 dep2 dep3 dep4

Scale Mean if Item Deleted 3.1753 3.2062 3.1134 4.9794

a. Listwise deletion based on all variables in the procedure.

Reliability Statistics Cronbach's Alpha .880 N of Items 20

41 Figure 19.xx Possible Steps in the Development of a Multiple Item Scale.

Potrebbero piacerti anche