Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
What Is a Scale? Rating Scales Limitations Levels of Measurement Attitude Scales Liken-type Scale Construction Initial Administration and Scoring Selecting the Final Items Validity and Reliability Limitations Semantic Differential Selection of Terms Length and Layout Scoring Limitations Performance Rating Scales Limitations Consumer Rating Scales Limitations Sensory Evaluation Limitations Summary
What Is a Scale?
Originally from the Latin word scala , meaning a ladder or flight of steps, a scale represents a series of ordered steps at fixed intervals used as a standard of measurement. Scales are used to rank people's judgments of objects, events, or other people from low to high or from poor to good. Commonly used scales in behavioral research include attitude scales designed to measure people's opinions on social issues, employee rating scales to measure job-related performance, scales for determining socioeconomic status used in sociological research, product rating scales used in consumer research, and sensory evaluation scales to judge the quality of food, air, and other phenomena. These scales provide numerical scores that can be used to compare individuals and groups. 151
152
Rating Scales
There are various methods for making ratings. With graphic rating scales, the respondent places a mark along a continuous line. The ends and perhaps the midpoint of the line are named, but not the intervening points. The person can make a mark at any point along the line. The score is computed by measuring the distance of the check mark from the left end of the scale.
Example
Place a checkmark somewhere along the scale to indicate the quality of this loudspeaker system.
Limitations
Rating scales are easy to construct and easy to answer, but they may not be reliable. If the respondent were asked to answer the same question tomorrow, how similar would the ratings be? Also, a single rating may catch only one aspect of a more complex concept. Even something as simple as rating a sound system may involve several aspects such as frequency range, distortion with volume change, etc. The problem can be solved by using a multi-item measure, an instrument that includes more than one question. Such scales are frequently used in the measurement of attitudes.
Levels of Measurement
Before describing more complex scales, it is necessary to look more closely at what scale numbers actually represent. When interpreting the meaning of a score
153
on a scale, it is necessary to have a clear idea of the level of measurement that the scale represents. The process of assigning numbers to events, ratings, or behavior occurs at a particular level of measurement. These levels can be nominal, ordinal, interval, or ratio. Nominal measures are qualitative or categorical, providing no information about quantity. Presence or absence is perhaps the most simple form, often indicated by 1 versus 0, a dichotomous or binary classification (examples are yes/no, present/not present, student/non-student, citizen/non-citizen). Male/female is a nominal measure. There may be more than two categories, for example, homeowner, renter, and other. Numbers may be used, but only to represent categories (e.g., 1 = male, 2 = female). An ordinal scale provides additional information about size or direction. Street addresses are ordinal measures. They indicate direction but provide no certain information about the distance between individual buildings. A ranking of contest winners into first, second, and third place is ordinal if there is no description of the size of the differences between them. An interval scale possesses the qualities of an ordinal scale, plus the additional characteristic of equal intervals between scale points. Such scales contain units similar to a temperature scale, on which the difference between 85 and 87 F is comparable in degrees to the difference between 47 and 49. Time of day (e.g., 10 a.m., 2 p .m.) is measured in equal intervals. Ratio scales not only have equal intervals, but have the additional property of an absolute zero point. This permits comparisons such as "twice as much" or "three times as many." Grade point average (GPA) is a ratio measure (a student who received all Fs would have a GPA of 0). A student with a 4.0 has twice as many grade points as a student with a 2.0. Time, distance, and physical qualities such as weight, age, and size, are easily expressed in ratio measures. Most subjective rating scales, like those described in this chapter, are ordinal rather than interval or ratio. In an opinion survey, we can say that someone who strongly disagrees is more opposed than someone who slightly disagrees, but we don't know how much difference there is between the two attitudes. The level of measurement in a study influences comparisons and generalizations that are justified, as well as the selection of statistical tests. Here is a summary of the characteristics of the four levels of measurement:
154
Attitude Scales
An attitude scale is a special type of questionnaire designed to produce scores indicating the intensity and direction (for or against) of a person's feelings about an object or event. There are several types of scales that can be constructed, but the most common is the Likert -type. The scale is constructed so that all its questions concern a single issue. Attitude scales are often used in attitude change experiments. One group of people is asked to fill out the scale twice, once before some event, such as reading a persuasive argument, and again afterward. A control group fills out the scale twice without reading the argument. The control group is used to measure exposure or practice effects. The change in the scores of the experimental group relative to the control group, whether their attitudes have become more or less favorable, indicates the effects of the argument.
Likert-type Scale
A Likert -type scale, named for Rensis Likert (1932) who developed this type of attitude measurement, presents a list of statements on an issue to which the respondent indicates degree of agreement using categories such as Strongly Agree, Agree, Undecided, Disagree, and Strongly Disagree.
Construction The first step is to collect statements on a topic from people holding a wide range of attitudes, from extremely favorable to extremely unfavorable. Duplications and irrelevant statements are discarded. For example, college students provided the following examples of positive and negative statements about marijuana: I don't approve of something that puts you out of normal state of mind. It has its place. It corrupts the individual. Marijuana does some people a lot of good. If marijuana is taken safely, its effects can be quite enjoyable. I think it is horrible and corrupting. It is usually the drug people start on before addiction. It is perfectly healthy and should be legalized. Its use by an individual could be the beginning of a sad situation. A Likert scale includes only statements that are clearly favorable or clearly unfavorable. Statements that are neutral, ambiguous, or borderline are eliminated. This can be accomplished by asking a few people, who are called "judges" in the procedure, to rate each statement as to whether it expresses a favorable or unfavorable opinion about the topic. Where there is little agreement among these judges or difficulty in deciding whether the item is favorable or unfavorable, the state-
155
ment is eliminated. For example, the statement "Marijuana use should be taxed heavily" was rejected because it was ambiguous. Some judges thought it was pro-marijuana because it implied legalization, while others felt it was anti-marijuana because it advocated a heavy tax. The statement "Having never tried marijuana, I can't say what effects it would have" would be eliminated because it is neither positive nor negative.
The statements are arranged in random order on a questionnaire with a choice of degrees of agreement. Each statement is followed by five degrees of agreement (strongly agree, agree slightly, undecided, disagree slightly, strongly disagree). Favorable statements are scored 5, 4, 3, 2, and 1, respectively. Unfavorable statements are 'scored in the reverse direction (1, 2, 3, 4, and 5, respectively).
People who are very favorable toward marijuana use would be expected to strongly agree with the favorable statements and strongly disagree with the unfavorable statements. They would earn a high score on the scale when the item scores are added together. Conversely, people with very unfavorable attitudes would be expected to strongly disagree with the favorable statements and strongly agree with the unfavorable statements, and would score low on the scale. Note the importance of reverse scoring the negative items. A person who strongly disagrees with the statement "Marijuana use corrupts the individual" is expressing a positive attitude toward marijuana use, and hence the item is scored as a 5 rather than 1.
The point of constructing the scale is to measure a person's attitude toward something. Thus, a scale should consist of items that distinguish people with a positive attitude on a topic from people with a negative attitude. Here is a method for getting rid of items that do not distinguish between people with different attitudes. 1. Sort the questionnaires from lowest to highest on the basis of the total score (with negative items scored in the reverse direction).
156
2. Take the top and bottom quarters (which will be the people with the most and least favorable attitudes). 3. For each group, calculate the average (mean) score for each individual item. 4. Keep only those items that distinguish the two groups. In other words if both the high (very favorable) and low (very unfavorable) scorers rated an item in the same way, that item is not discriminating and should be dropped. Another way of cleaning up an attitude scale is to use items that cluster or hang together. If people who strongly agree with item #3 also strongly agree with item #5, then it is likely that #3 and #5 are measuring similar or closely-related attitudes. Precise assessment requires the use of correlation, either among items or between an individual item and the total score. This can be done using correlation coefficients (described in Chapter 19). The final version of the scale is administered and scored as described in the preceding section.
157
of scales whose reliability has already been established is available in Robinson, Shaver, and Wrightsman, Measures of Personality and Psychosocial Attitudes (1991). Chapter 16 (Standardized Tests and Inventories) also lists a number of sources for locating attitude scales. Journal articles and reviews are a good sources for references to scales currently in use on specialized topics. There are computerized databases available at many campus and agency libraries. For example, the Health and Psychosocial Instruments (HAPI) database contains information about questionnaires, rating scales, and other instruments used in published studies. For each instrument, there is a brief description of its form and uses, plus information about the authors, year of publication, length, reliability and validity, and published references.
Limitations
There are questions about the validity of attitude scales. Often they predict behavior poorly or not at all. The words on the printed page bear little resemblance to the actual situation. Another problem with attitude scales is the assumption that attitudes lie along a single dimension of favorability. People's opinions on a topic like marijuana are complex and multidimensional. A person may be in favor of reducing the penalties on marijuana possession but not on cultivation or sale, and may want strict penalties for anyone driving under the drug's influence. A single favorability score cannot reflect the specificity of these concerns. Questionnaires allow for a more in-depth and detailed assessment of such complexity.
Semantic Differential
The semantic differential is a procedure developed by psychologist Charles Osgood and his associates to measure the meaning of concepts (Osgood, May, & Miron, 1975). The respondent is asked to rate an object or a concept along a series of scales with opposed adjectives at either end.
The semantic differential is a good instrument for exploring the connotative meaning of things. Connotation refers to the personal meaning of something, as distinct from its physical characteristics. For example, a panther, in addition to being a large cat, connotes stealth and power. Crepes Suzette suggest elegance and expensive dining.
In the research that developed the semantic differential, three major categories of connotative meaning were found: value (e.g., good-bad, ugly-beautiful), activity (e.g., fast-slow, active-passive), and strength (e.g., weak-strong, large-small). Table 10-1 presents four adjective pairs high in value, activity, or strength. Not surprisingly, the value dimension (good-bad, valuable-worthless) is of greatest importance in evaluative research. When you want to know whether or not people like something, you will probably want to include good-bad, ugly-beautiful, and friendly-unfriendly. Activity and strength are important dimensions in certain circumstances. A comparison of people's images of cities and small towns found major differences on the activity and strength dimensions. Cities were full of bustle, hurry, and activity, while in small towns the pace was more slow, relaxed, and leisurely. Cities were also rated as larger, stronger, and more powerful than small towns. Other adjectives may be more relevant to a particular topic. An investigation of religious concepts used adjectives closely related to religious belief, such as sacred-profane, mysterious-obvious, and public-private. The nature of the project will determine the selection of adjectives. The most common error made by inexperienced researchers using this technique is to overestimate the respondents' vocabulary level. Although most college students know the meaning of "profane" and "despotic," a substantial number of students may not, which reduces the validity of the results when these terms are included on a rating scale. Pretesting the adjective pairs is essential for eliminating difficult or ambiguous terms. Even if adjectives have been used by other researchers, it will still be necessary to test them on your particular respondents. Adjectives that have one meaning for one group of people may mean something else to another group.
159
Counterbalance the order of positive and negative adjectives. Begin some scales with the positive term (happy-sad) and others with the negative term (noisy -quiet). This will prevent the respondent from falling into a fixed pattern of always checking to the right or left. Make sure that people put their marks in the right place. Researchers often use solid lines for the responses and colons as spacers.
It is important that answers be marked on the lines and not on the dots. Tabulating the responses becomes more complicated when people have checked on the dots. When this occurs, you can assign the response a mid-point value such as 2.5. Another possibility is to assign the score to the right or left line in random or alternating order. That is, if a person has checked midway between the second and third line, the response will be scored as a 2 the first time and a 3 the next time this occurs. Most researchers follow Osgood in using seven-point scales. This includes a midpoint, which is useful when the item is neither happy nor sad or neither light nor dark, but somewhere in the middle. However, if machine scoring limited to a five-point scale can be done cheaply and quickly, this option should be seriously considered. Five-point scales are more easily tabulated by hand, too. Many researchers find that differences among the three scale points to the righ t or left of the midpoint have little meaning. The direction of response (e.g., whether the cafeteria is seen as a happy place) is more important than whether it is seen as extremely happy, moderately happy, or somewhat happy. If you plan to combine all t hree categories to the right of the neutral point later, you might as well begin with a smaller number of scale points-five or even three.
Scoring
On a seven-point scale each level is given a numerical value from 0 to 6 or 1 to 7, going from left to right. The average is computed separately for each pair. Thus, 3 is the midpoint value of the happy-sad scale, whose endpoints are 0 and 6. Anything below 3 means that the item is generally happy, and anything above 3 means that the item is generally sad. In summarizing the results in a report, it is helpful
160
to the reader to reorganize all the scales so that the favorable end is on the left and the unfavorable end on the right. Note that this differs from the order of the scales given to the respondents. Placing all the favorable adjectives on the left in the report allows the reader to see at a glance how the ratings came out. The results can be presented graphically as well as in averages. Figure 10-2 shows student ratings of a reading room in a university library. The room is seen as valuable and strong but relatively low in activity.
Limitations
The semantic differential is usable only with intelligent and cooperative adults. People with little education often focus on the ends of the scale and do not use the middle points. We would not recommend using the semantic differential with children, with people whose command of the language is limited, with older people who would have difficulty seeing the various scale points, or with any group of respondents who are not accustomed to making fine distinctions.
161
Limitations
Performance rating scales have not been very useful in research because of the reluctance of raters to say unkind things about people, halo effects, and a lack of
162
standards for judging employee effectiveness. If other criteria of effectiveness are available, such as production records or customer ratings, the supervisor's rating may provide useful supplementary information.
163
For children and others not accustomed to making verbal ratings, a series of facial expressions can be used to indicate liking.
Example
There is no reason why a rating scale should be dull and lifeless. A restaurant used movie titles to increase customer interest in filling out the rating scale: 1. Rate our food A. Some Kind of Wonderful B. Bound for Glory C. Touch and Go D. Crimes and Misdemeanors E. Mississippi Burning F. Unable to rate 2. Rate our service A. All the Right Moves B. Dream Team C. We're No Angels D. Missing E. Ruthless People F. Unable to rate No matter how carefully the rating scale is constructed or how interesting the categories, there will always be some items that some people will be unable to rate. The easiest way to deal with this, as illustrated in the examples, is to include a separate category "unable to rate" or "no opinion." Another possibility is to instruct people to leave blank any item they are unable to rate. However, if space is available, it is better to add a specific category for those unable to express an opinion.
Limitations
Rating scales attached to the product or left on motel dressers are subject to response bias. Persons most likely to fill out and send in questionnaires will be those with strong opinions pro and con-and generally the latter. Response rates will vary with the consumer's interest in helping the manufacturer or service agent.
164
Sensory Evaluation
Sensory evaluation began in the laboratories of early experimental psychologists who were interested in the basic properties of odors, tastes, sound, and other sensations. The connection between the physical qualities of objects and their sensory attributes is called psychophysics. A key assumption in psychophysics is that people can make meaningful ratings of the degree of their sensory experiences (e.g., rating items as more or less bright, loud, sweet, and so on). The food and beverage industries rely heavily on sensory evaluation. Before a new product is marketed, its consumer acceptance will be tested. Products are often first rated by expert judges who have exceptionally well-developed palates, noses, or visual sensitivity before being tried out on a panel of nonexperts. Researchers in Norway examined consumer response to black currant juice, which varied in strength, color, acidity, portion size, and time of testing (before or after lunch). Preference was found to be mainly influenced by color, acidity, and portion size (Martens, Risvik, & Schutz, 1983). The qualities to be rated depend as much on the interests of the investigator as on the objective characteristic of the item. A firm might be interested in the vi-
Sensory evaluation. The student was asked to rate the flavor and appearance of tomatoes.
165
sual appearance of a bar of soap, the texture of canned fruit, or the sound level of fluorescent lights. Deciding what characteristics are relevant should be done in consultation with the client or consumer organization, or it can be based on previous research. Various methods have been used to present material to the judges. One approach is to present the judges, at the beginning of the session, with standards. For an investigation of taste qualities, the judge will first taste four different compounds, one very sweet, one very sour, one very salty, and another very bitter to use as standards in making subsequent judgments. Example Rate the item you have tasted along each of the following scales. Place a check anywhere along the line.
Note that the four taste qualities are rated separately. Sweet is not considered the opposite of sour. Grapefruit and pineapple can be both sweet and sour. In the method of paired comparisons, two items are presented and the person asked to compare them. This method is useful in deciding whether or not a change represents an improvement relative to a standard. Example Compared to B (the standard), item A is:
Since the subject compares only two items at a time, each comparison can be done quickly and easily. There is very little dependence on memory. Comparison procedures are useful with inexperienced raters who can express a preference for one item over another without being specific as to their reasons.
166
Example
Which of these two wines is sweeter, A or B? Which of these two wines would you choose to accompany a steak dinner, A or B? Such sessions are conducted as blind taste trials. The term blind indicates that the subject is not aware of the origin or identity of the item being rated. The subject is told its general category (wine) but not the specific variety, cost, of place of origin. Blind tasting minimizes the effects of labels and stereotypes. Subjects may be more likely to give high ratings to wines with expensive labels or fancy names. A further refinement of this procedure requires two experimenters, one who replaces all identifying information with code numbers before the sessions. The second experimenter, who has no information on the coding system, conducts the actual taste trials. This is called a double blind procedure, as both the subject and experimenter conducting the tests are in the dark about what is being tasted.
Limitations
Like performance rating, sensory evaluation is subject to a halo effect. When people like a product, they tend to see most things about it as good; if they dislike it, they see everything about it as bad. Without careful explanation, the terms used in sensory evaluation may not be clear to those doing the rating; for example, people may have difficulty distinguishing among fragrant, fruity, and spicy. Expert judges, such as food critics and wine tasters, use different criteria than those used by ordinary consumers. Sensory evaluation requires people to make artificial distinctions. When they taste ketchup on a hot dog, most people do not divide the taste into separate degrees of sweetness, sourness, and saltiness.
Summary
Rating scales are used to rank people's judgments of objects, events, or other people from low to high or from good to poor. They provide numerical scores that can be used to compare individuals and groups. On a graphic rating scale, the respondent places a mark along a continuous line. On a step scale, the rater checks one of a graded series of steps without intermediate points. On a comparative rating scale, the person is asked to compare the object or person with others in the same category. The numbers on a scale will reflect one of four levels of measurement: nominal--contains information only on qualities, or the presence or absence of something; ordinal--contains information on direction, such as increasing or decreasing size or order; interval-contains information on direction, and the intervals between each step are the same size; and ratio-contains information on direction, possesses equal intervals, and an absolute zero. An attitude scale is a special type of questionnaire designed to produce scores
167
indicating the overall degree of favorability of a person's feelings about a topic. A Likert-type scale contains only statements that are clearly favorable or clearly unfavorable. No neutral or borderline statements are included. The respondents rate each statement along a five-point scale of agreement, from strongly agree to strongly disagree. Validity is increased by eliminating items that fail to discriminate between persons holding very positive and very negative views on the topic. Reliability refers to consistency of measurement. There are three common methods for estimating the reliability of an attitude scale. In the test-retest method, the scale is given to the person on two occasions and the results are compared. The split-half method involves splitting an attitude scale into two halves which are then compared. The third method of measuring reliability involves constructing two equivalent forms of the scale. If the scale is reliable, the person's score on the two forms should be similar. The chief limitation of attitude scales is that they may not predict behavior. The semantic differential is a procedure developed to measure the connotative meaning of concepts. Connotation refers to the personal meaning of something as distinct from its physical characteristics. Three major categories of connotative meaning are value, strength, and activity. Performance rating scales are used to judge the competence and efficiency of employees. Experience with performance scales in most settings has been disappointing. Many supervisors are not willing to make honest judgments. The halo effect refers to the tendency to rate specific abilities on the basis of an overall impression. Consumer ratings are used to find out people's opinions about products and services with which they are familiar. Sensory evaluation is used to test the psychophysical properties of products, particularly food and beverages. Sometimes people are asked to rate items along graphic rating scales (e.g., sweet-not sweet, salty-not salty). In the method of paired comparisons, items are presented two at a time and the person is asked to compare them. In a blind taste trial, the respondent does not know the origin or specific identity of the item being rated. In a double-blind procedure, neither the subject nor the investigator knows the origin or specific identity of the item being rated. Without careful explanation, the terms used in sensory evaluation may not be clear to those doing the rating. Expert judges such as food critics use different criteria than those used by ordinary consumers.
References
Liken, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 140, 1-55. Martens, M., Risvik, E., & Schutz, H. G. (1983). Factors influencing preference: A study on black currant juice. Proceedings of the Sixth International Congress of Food Science and Technology, 2, 193-194. Osgood, C. E., May, W. H., & Miron, M. S. (1975). Cross-cultural universals of affective meaning. Urbana, IL: University of Illinois Press.