Sei sulla pagina 1di 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331765650

Does the Number of Response Options Matter? Psychometric Perspectives


Using Personality Questionnaire Data

Article  in  Psychological Assessment · March 2019


DOI: 10.1037/pas0000648

CITATIONS READS

19 2,554

4 authors, including:

Leonard J Simms Kerry A. Zelazny


University at Buffalo, The State University of New York University at Buffalo, The State University of New York
80 PUBLICATIONS   4,565 CITATIONS    4 PUBLICATIONS   146 CITATIONS   

SEE PROFILE SEE PROFILE

Trevor Williams
University at Buffalo, The State University of New York
7 PUBLICATIONS   156 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Latent Structure of Narcissistic Personality Disorder View project

MMPI-2-RF work View project

All content following this page was uploaded by Leonard J Simms on 17 March 2019.

The user has requested enhancement of the downloaded file.


Psychological Assessment
Does the Number of Response Options Matter?
Psychometric Perspectives Using Personality
Questionnaire Data
Leonard J. Simms, Kerry Zelazny, Trevor F. Williams, and Lee Bernstein
Online First Publication, March 14, 2019. http://dx.doi.org/10.1037/pas0000648

CITATION
Simms, L. J., Zelazny, K., Williams, T. F., & Bernstein, L. (2019, March 14). Does the Number of
Response Options Matter? Psychometric Perspectives Using Personality Questionnaire Data.
Psychological Assessment. Advance online publication. http://dx.doi.org/10.1037/pas0000648
Psychological Assessment
© 2019 American Psychological Association 2019, Vol. 1, No. 999, 000
1040-3590/19/$12.00 http://dx.doi.org/10.1037/pas0000648

Does the Number of Response Options Matter? Psychometric Perspectives


Using Personality Questionnaire Data
Leonard J. Simms, Kerry Zelazny, Trevor F. Williams, and Lee Bernstein
University at Buffalo, The State University of New York

Psychological tests typically include a response scale whose purpose it is to organize and constrain the
options available to respondents and facilitate scoring. One such response scale is the Likert scale, which
initially was introduced to have a specific 5-point form. In practice, such scales have varied considerably
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

in the nature and number of response options. However, relatively little consensus exists regarding
This document is copyrighted by the American Psychological Association or one of its allied publishers.

several questions that have emerged regarding the use of Likert-type items. First, is there a “psycho-
metrically optimal” number of response options? Second, is it better to include an even or odd number
of response options? Finally, do visual analog items offer any advantages over Likert-type items? We
studied these questions in a sample of 1,358 undergraduates who were randomly assigned to groups to
complete a common personality measure using response scales ranging from 2 to 11 options, and a visual
analog condition. Results revealed attenuated psychometric precision for response scales with 2 to 5
response options; interestingly, however, the criterion validity results did not follow this pattern. Also,
no psychometric advantages were revealed for any response scales beyond 6 options, including visual
analogs. These results have important implications for psychological scale development.

Public Significance Statement


We studied several aspects of questionnaire response scales and their impact on the reliability and
validity of a personality measure. We found that smaller numbers of response options have a negative
impact on the measurement precision of a scale, which has important implications how psychological
measures should be developed and used in practice.

Keywords: Likert scales, scale development, psychometrics, reliability, validity

Supplemental materials: http://dx.doi.org/10.1037/pas0000648.supp

Measurement tools in psychology and the social sciences more dissertation, to measure a range of psychological attitudes. The
generally—including a wide variety of self-report, interview-based, original “Likert” scale (pronounced “lick urt,”/‘lIk.ɘrt/) included
and observational methods—typically include a response scale five symmetrical and balanced options reflecting degree of agree-
whose purpose it is to organize and constrain the options available ment: strongly agree, agree, undecided/neither, disagree, or strongly
to respondents and facilitate scoring. Today, response scales vary disagree. However, despite the ubiquitous place that response
considerably in terms of the nature and number of response op- scales occupy in psychological measurement, the Likert scale has
tions. Arguably the most common response scale used is that been extended and elaborated in numerous ways in the years that
introduced by Rensis Likert in 1932, as part of his doctoral have passed since its original introduction, often with little empir-
ical justification. Although a small (but inconsistent) literature
exists regarding the nature and number of response options to
include in psychological measures, measurement lore rather than
Editor’s Note. Yossef S. Ben-Porath, Editor, served as the sole action
data typically guides the choices scale developers and researchers
editor for this submission.
make regarding response scales.
Given this backdrop, a number of questions frequently are
posed regarding the use of Likert-type items on psychological
Leonard J. Simms, Kerry Zelazny, Trevor F. Williams, and Lee Bernstein, questionnaires. First, is there a “psychometrically optimal”
Department of Psychology, University at Buffalo, The State University of number of response options? Second, is it better to include an
New York.
even or odd number of response options? And finally, do visual
This study was supported by a research grant to Leonard J. Simms from
analog scales—in which respondents simply make a mark (or
the National Institute of Mental Health (R01MH080086).
Correspondence concerning this article should be addressed to Leonard J. move a slider) along a line ranging from agree to disagree—
Simms, Department of Psychology, University at Buffalo, The State Univer- offer any advantages over traditional Likert-type items (e.g.,
sity of New York, Park Hall 218, Buffalo, NY 14221. E-mail: ljsimms@ permitting the possibility finer distinctions along the response
buffalo.edu scale)? In this article, we briefly review the literatures related to

1
2 SIMMS, ZELAZNY, WILLIAMS, AND BERNSTEIN

each of these questions and then offer a summary of fresh data “I would like the work of a librarian,” on a 10-point Likert scale
that were collected to address each. ranging from 1 ⫽ very strongly disagree to 10 ⫽ very strongly
agree (see Table 1 for an example of all point labels across such
Is There a “Psychometrically Optimal” Number of a scale). How finely can a respondent discriminate along this
Response Options? scale? Is a response of 9 (strongly agree) reliably different than a
response of 10 (very strongly agree)? Although an empirical
Despite the central importance of response scale to most ques- question, anecdotal evidence from years of developing and using
tionnaire and rating scale measures of psychological constructs, scales of various lengths in our laboratory suggests that there is a
little consensus has emerged in the literature regarding the number point of diminishing returns with respect to the number of response
of points to include in a Likert-type rating scale. Likert’s (1932) options.
original scale included five options, as described above, but even Some literature is consistent with this perspective. Bendig (1953)
a casual look at the measures used in practice and research reveals
reported equal reliability for three, five, six, or nine response
that measures range widely in the number of response options they
options but a reliability decrease for 11 options. More recently,
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

offer respondents. Several lines of thought may influence one’s


numerous studies have concluded that measurement precision (and
This document is copyrighted by the American Psychological Association or one of its allied publishers.

view on the optimal number of points to include on a Likert scale.


sometimes validity) does indeed asymptote. Lee and Paek (2014)
First, given the importance of score variance in classical test
concluded, based on simulated data, that four to six response
theory, response scales that result in increases in valid score
options is the ideal number, although they examined only up to six
variance should increase measurement precision and, thus, maxi-
response options. Lozano, García-Cueto, and Muñiz (2008) con-
mize validity coefficients. By this way of thinking, some scale
cluded that 4 to four to seven increases after seven options. Preston
development lore suggests that longer response scales are prefer-
able since they will increase variability in total scores and thus and Colman (2000) investigated all possible response scales be-
maximize precision and validity. But the key is that the variance in tween two and 11 options on a measure of service elements
scores must reflect reliable and valid distinctions among individ- associated with a recently visited store or restaurant, finding that
uals on the psychological characteristic being evaluated. Some (a) two to four response options performed poorly in terms of
evidence would seem to support this perspective. Several studies reliability, validity, discriminating power, and (b) performance on
suggest that estimates of score reliability increase with the number these metrics improved up to seven response options. However,
of response options (e.g., Finn, Ben-Porath, & Tellegen, 2015; although Preston & Colman’s study had strong resolution for
Flamer, 1983; Hilbert, Küchenhoff, Sarubin, Nakagawa, & Büh- determining a potential point of asymptote, it was limited in
ner, 2016; Weng, 2004). several important ways. First, the scale they used was not a
Second, we might expect measurement precision and validity to traditional Likert scale, which limits its generalizability. Second,
asymptote (or perhaps even decrease) at some number of response this study included only 149 participants, all of whom completed
options if the added score variance offered by more differentiated all 10 of the different response options under consideration, which
response scales reflects psychometric noise rather than signal. That may have resulted in respondent fatigue and/or memory effects
is, it is possible (perhaps even likely) that humans’ ability to make that could have affected the results. Finally, this study focused on
fine-grained distinctions about relatively fuzzy and complex psy- measurement of attitudes regarding an experience in a restaurant
chological constructs is not without limits (Cox, 1980; Symonds, rather than a deeper psychological construct, such as personality. It
1924; Tourangeau, Rips, & Rasinski, 2000). For example, imagine is an open question what a study with similar resolution would
a hypothetical respondent trying to provide a response to the item, show if applied to traditional personality scales.

Table 1
Summary of Likert Response Labels Used in This Study
Option 2-point 3-point 4-point 5-point 6-point 7-point 8-point 9-point 10-point 11-point Visual Analog

1 Disagree Disagree Strongly Strongly Disagree Strongly Strongly Disagree Very Strongly Very Strongly Very Strongly Very Strongly Disagree
Disagree Disagree Disagree Disagree Disagree Disagree
2 Agree Neither Agree nor Disagree Disagree Disagree Disagree Strongly Strongly Disagree Strongly Strongly Disagree Agree
Disagree Disagree Disagree
3 Agree Agree Neither Agree nor Slightly Slightly Disagree Disagree Disagree Disagree Disagree
Disagree Disagree
4 Strongly Agree Slightly Neither Agree nor Slightly Slightly Disagree Mostly Mostly Disagree
Agree Agree Disagree Disagree Disagree
5 Strongly Agree Agree Slightly Agree Slightly Neither Agree nor Slightly Slightly Disagree
Agree Disagree Disagree
6 Strongly Agree Agree Slightly Agree Slightly Neither Agree nor
Agree Agree Disagree
7 Strongly Agree Strongly Agree Mostly Agree Slightly Agree
Agree
8 Very Strongly Strongly Agree Agree Mostly Agree
Agree
9 Very Strongly Strongly Agree
Agree Agree
10 Very Strongly Strongly Agree
Agree
11 Very Strongly
Agree
NUMBER OF RESPONSE OPTIONS 3

In contrast to the preceding paragraphs, a number of studies also Aside from these considerations, Garland (1991) suggested that
have suggested no clear trends in scale psychometrics as a function midpoint responses may be related to socially desirable respond-
of the number of response options (e.g., Bendig, 1954; Capik & ing. Velez and Ashworth (2007) argued that midpoint endorsement
Gozum, 2015; Matell & Jacoby, 1972). Moreover, some recent decreases with clearer items. Hernández, Drasgow, & González-
studies have painted a contradictory picture of response option Roma (2004) identified different classes of individuals in terms of
effects. Finn et al. (2015), following up on similar studies by Cox how they use middle response options, and suggested small per-
and colleagues (2012) and Cox, Courrégé, Feder, & Weed (2017)), sonality differences across these groups. Thus, the reasons for and
compared 2- and 4-point response scales as applied to the items of impact of allowing a middle-point option have been considered
the restructured form of the Minnesota Multiphasic Personality elsewhere in the literature, albeit infrequently and without clear
Inventory (MMPI)-2 (MMPI-2-RF; Tellegen & Ben-Porath, 2008/ conclusions about the question under consideration here: Should
2011). In this line of research, increasing the number of response scale developers use even or odd numbers of response options for
options resulted in improved internal consistency but, interest- Likert-type items? Thus, our second aim is to directly address this
ingly, no advantage in convergent validity. However, it is an open question.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

question whether the validity results will replicate with response


This document is copyrighted by the American Psychological Association or one of its allied publishers.

scales with more than four options.


Do Visual Analog Response Scales Offer Advantages
Despite these diverging conclusions regarding the optimal num-
Relative to Likert-Type Scales?
ber of response options to include for psychological test items,
several tentative conclusions are possible. First, most studies sug- Visual analog scales (Aitken, 1969; Flynn, van Schaik, & van
gest that two to three response options result in attenuated psy- Wersch, 2004; Hayes & Patterson, 1921) ostensibly are continuous
chometrics. Second, although several studies show improved psy- measurement devices that have been used in many literatures to
chometrics as the number of response options increases, the bulk represent both full constructs and/or single items on longer scales
of recent studies point to the existence of an asymptote above of psychological constructs. Historically, visual analogs appeared
which scale psychometrics level off or decrease. Third, there is as a single horizontal line on a page, with two opposing anchors on
inconsistency in the specific point of diminishing returns, with four either end of this line (e.g., agree vs. disagree, true vs. untrue, good
to seven response options variously reported as the number needed vs. bad, etc.). Respondents are asked to simply mark their standing
for maximum measurement precision. Finally, recent work sug- on the given construct with a short vertical line, and scoring is
gests that improvements in measurement precision may not result completed by physically measuring the location of the mark using
in concomitant improvements in convergent validity. In the present a measure (e.g., a ruler) with reasonable precision. This can be a
study, we aim to replicate and extend these previous findings using labor-intensive process, given that each mark must be hand-
a well-known measure of a prominent personality model. Our measured. However, with the advent of computer- and Internet-
design includes all response options ranging from two to 11 based survey programs, visual analogs now can be neatly repre-
response options, large sample sizes, and thus adequate resolution sented (and more easily scored) using a slider bar that respondents
to derive stronger conclusions that the previous studies conducted move using a mouse, trackpad, or other input device.
in this domain. Given the greater resolution and flexibility associated with
visual analogs, some have suggested that they might maximize
Is It Better to Have an Even or Odd Number of measurement precision, even for items on traditional psychological
scales (e.g., Reips & Funke, 2008; Russell & Bobko, 1992). That
Response options?
said, relatively little work has directly compared visual analog
Another question that faces psychological scale developers— scales to traditional Likert-type items as the basis for scale devel-
and one about which there is relatively more opinion than data—is opment. The work that has been conducted has led to competing
whether Likert-type items should include an even or odd number conclusions, with some studies showing no differences in mea-
of response options (Kulas & Stachowski, 2013; Nadler, Weston, surement properties of visual analog scales as compared with
& Voyles, 2015). With an odd number of response options, the traditional Likert scale (e.g., Bergman, 2009) and others showing
middle option (i.e., neither agree nor disagree) has ambiguous modest evidence of psychometric superiority for visual analogs
meaning, which should increase measurement error if respondents (e.g., Hilbert et al., 2016; Russell & Bobko, 1992). Thus, although
use that option in different ways that do not reflect their perceived computers and online data collection make the use of such scales
standing on the characteristic being measured. Kulas and Sta- much easier than previously, it is an open question whether hu-
chowski (2013) recently studied the reasons respondents may have mans can make fine-grained distinctions along these scales in a
for selecting the middle option on an oddly numbered Likert scale, way that actually improves precision.
which may include (a) the response reflects moderate standing on
the item/trait (which arguably would be the ideal), (b) the respon-
Present Study
dent has difficulty deciding his or her standing one the item, (c) the
respondent is confused about the item’s meaning, and/or (d) the Given the importance of measurement precision across all areas
respondent feels that his or her response is context-dependent (i.e., of psychology (and the social sciences more generally), the limited
what the authors called the “it depends” reason for using the research and lack of scientific consensus regarding the nature and
middle option). The latter three options are less than ideal, since number of response options for questionnaire items is surprising.
they represent construct-irrelevant reasons for using a particular With this article, we sought to add meaningfully to this literature
response option and thus should increase measurement error and by studying three issues related to response scales: (a) the optimal
thus attenuate validity. number of response options for a Likert scale, (b) even versus odd
4 SIMMS, ZELAZNY, WILLIAMS, AND BERNSTEIN

numbers of response options, and (c) visual analog scales versus Agreeableness, and Openness). Benet-Martinez and John (1998)
traditional Likert scales. Our study extends previous work by reported good internal consistency for BFI scales and good con-
simultaneously assessing each of these three questions using a vergence between BFI scales and other established measures of the
well-powered study with adequate resolution to yield more con- Big Five Model. A noted above, we manipulated the response
clusive results. Given the previous (albeit inconclusive) work in scale across groups for the BFI, and the resultant reliabilities and
this area, as well as considerations drawn from basic psychometric validities are presented below, representing the primary analyses
theory, we predicted that measurement precision (and subsequent for this article.
markers of validity) would (a) asymptote after six to seven re- Personality Inventory for DSM–5 (PID-5; Krueger et al.,
sponse options, (b) show no advantage for odd numbers of re- 2012). The PID-5 includes 220 self-report items assessing the 25
sponse options relative to matched evens, and (c) be no stronger maladaptive traits and five higher order domains of the Alternative
for visual analog scales relative to traditional Likert scales. Model of Personality Disorder (American Psychiatric Association,
2013). Responses are provided on a four-point Likert scale from 0
Method (very false or often false) to 3 (very true or often true), and scale
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

scores are means of associated items, with higher scores represent-


This document is copyrighted by the American Psychological Association or one of its allied publishers.

ing greater pathology. Recent studies have supported the construct


Participants and Procedures validity of the PID-5 five-factor structure (e.g., Wright & Simms,
Participants from this study were 1,358 undergraduate students 2014). In the current study the average alpha coefficients were .94
who participated in exchange for course research credit at a large (range ⫽ .92–.96) and .87 (range ⫽ .68 –.96) across domains and
university in the Northeastern United States. All study procedures traits, respectively. We included the PID-5 to compare the impact
were reviewed and approved by the university’s cognizant insti- of differing response options on BFI criterion validity.
tutional review board. Participants were 51% female; Mean age
was 19.1 years old (SD ⫽ 1.9). The sample was 60.8% White, Analyses and Results
22.8% Asian American, 8.0% African American, and 5.1% His-
panic/Latino. Participants participated in groups of as many as To facilitate our primary aims, we conducted analyses to compare
eight participants. Questionnaires were completed using computers descriptive statistics, internal consistency reliabilities, short-term re-
in private carrels in a laboratory setting. All participants completed test correlations, and criterion validity correlations. First, descriptive
the Big Five Inventory (BFI; John & Srivastava, 1999) twice, statistics across scales and numbers of response options— calculated
before and after completing the Personality Inventory for DSM–5 as Cohen’s d effect size differences on item means, compared to two
(PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2012). response options—are presented in Figure 1. What is clear from these
Response scales for the BFI were manipulated across groups, as results is that item means decreased a full standard deviation between
described below. two and four response options. Between four and seven response
The study was structured with both cross-sectional and repeated- options, the item means tended to remain relatively consistent, but
measures components. Participants were randomly assigned to one then additional, albeit small, decreases in item means were observed
of six groups. Within the first five groups, response scales were beyond seven response options. Significance testing results for these
paired to facilitate comparisons between odd and even numbers of descriptive statistics are presented in Supplemental Table 1: Within-
response options. Each group was designated to complete one and between-subjects t tests generally corroborated the pattern de-
odd-even comparison (e.g., four vs. five response options). Re- scribed here. We also examined scale skewness and kurtosis, finding
sponse options, which are presented in Table 1 for all groups, were that skewness and kurtosis generally decreased from two to four
developed such that matched odd- and even-numbered response options, but did not change dramatically after four response options.
scales were identical except that paired odd-numbered scales in- We next examined internal consistency reliabilities across scales
cluded a middle “neither agree nor disagree” option. This yielded and numbers of response options. Results related to our first
five groups with varying numbers of response options: (a) two- aim— determining the optimal number of response options to
and three-response options (n ⫽ 225), (b) four- and five-response include on a Likert scale—appear in Figures 2. Two findings can
options (n ⫽ 232), (c) six- and seven-response options (n ⫽ 221), be gleaned from this figure. First, alpha coefficients varied across
(d) eight- and nine-response options (n ⫽ 219), and (e) 10- and scales, with BFI Neuroticism and BFI Extraversion demonstrating
11-response options (n ⫽ 212). Order or response scale presenta- consistently stronger alphas than the remaining three BFI scales.
tion was counterbalanced within groups. In addition, a sixth group Second, the mean alpha line neatly summarizes the results: Internal
completed the BFI twice using a visual analog scale for each item consistency reliabilities appeared to increase gradually until six
(n ⫽ 249), which was presented as a slider bar with a marker that response options and then leveled off. These results reveal no
could be moved by participants using a mouse. The visual analog psychometric advantage for any scale beyond six response options.
scale was scored based on marker position on a 0 to 1000 scale. In Figure 3, we rearranged the internal consistency values to
facilitate an examination of our second and third study aims—
examining odd versus even numbers of response options and
Measures
visual analog scales. In the first paired comparison—2- versus
Big Five Inventory (BFI; John & Srivastava, 1999). The 3-point scales—the mean alpha coefficients increased from two to
BFI is a 44-item scale that is designed to use a 5-point Likert-type three response options, M alphas ⫽ .73 and .78, respectively. Mean
rating scale, ranging from 1 (strongly disagree) to 5 (strongly alphas were the same for the 4- and 5-point scales, both alphas ⫽
agree), and provides scores on the domains of the Big Five Model .80. Mean alphas decreased slightly from six to seven response
of personality (Neuroticism, Extraversion, Conscientiousness, options, M alphas ⫽ .83 and .82, respectively. Similarly, mean
NUMBER OF RESPONSE OPTIONS 5
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Figure 1. Cohen’s d effect size differences (based on item means) as a function of scale and number of
response options. VA ⫽ visual analog scale.

alphas decreased slightly from eight to nine response options, M results—presented in Supplemental Table 2—revealed several pat-
alphas ⫽ .82 and .81, respectively. The same pattern was observed terns. First, there were relatively few significant differences among
from 10 to 11 response options, M alphas ⫽ .83 and .82, respec- the alphas presented in Figures 2 and 3. Second, the bulk of
tively. Finally, the visual analog scale yielded a slightly lower significant alpha differences involved the 2-point scale, which was
alpha compare d to the most other alphas between six and 11re- significantly lower than the 3-point scale for four of five traits (i.e.,
sponse options, M ␣ ⫽ .81. Taken together, these results suggest all but conscientiousness), ␹2s (1) ranged from 10.1 to 16.6, p ⬍
a slight decrement in alpha for odd numbers of response options, .01. Similarly, the 2-point scale yielded alphas that were signifi-
beyond six, and an additional slight decrement for visual analog cantly lower than those for all more differentiated scales for
scales. That said, the differences between matched even and odd extraversion, agreeableness, and openness, ␹2s (1) ranged from 3.6
numbers of response options were small enough that the practical to 16.6, p ⬍ .05, whereas the 2-point scale was less consistently
impact likely is minimal. different than other response scales for conscientiousness and
We also tested for significant differences among the alpha coeffi- neuroticism. Finally, to detect alpha differences between even and
cients described above, using methods described by Feldt, Woodruff, odd numbers of response options, 7 of 25 matched comparisons
and Salih (1987) and implemented in “cocron,” an open-source pro- yielded alpha differences in the predicted direction, ␹2s (1) ranged
gram on the R platform (Diedenhofen & Musch, 2016). These from 5.2 to 16.6, p ⬍ .05, whereas the remaining tests were not

.90

.85

.80
Coefficient Alpha

.75

BFI Extraversion
.70 BFI Agreeableness
BFI Conscienousness
BFI Neurocism
.65 BFI Openness
Mean

.60
2 3 4 5 6 7 8 9 10 11 VA
Response Opons

Figure 2. Internal consistency reliabilities as a function of scale and number of response options. VA ⫽ visual
analog scale.
6 SIMMS, ZELAZNY, WILLIAMS, AND BERNSTEIN

.90

.85

.80
Coefficient Alpha

.75

BFI Extraversion
.70 BFI Agreeableness
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

BFI Conscienousness
This document is copyrighted by the American Psychological Association or one of its allied publishers.

BFI Neurocism
.65 BFI Openness
Mean

.60
2 3 4 5 6 7 8 9 10 11 VA
Response Opons

Figure 3. Internal consistency reliabilities as a function of scale and number of response options, highlighting
matched even-odd pairs. VA ⫽ visual analog scale.

significant. Thus, these significance tests painted a more conser- .895, .892, and .896, respectively. Full retest results are presented
vative picture of the differences than was evident through visual in Supplemental Table 3, including Z tests for significant differ-
inspection of the figures; through this lens, the most robust finding ences among the correlations. Although the figures suggest no
is that 2-point scales are impoverished relative to more differen- advantage for any scale beyond six response options, the signifi-
tiated scales with respect to internal consistency reliability. cance testing revealed a more conservative pattern: Short-term
We next studied short-term retest correlations within each retest coefficients were significantly lower for the two-to-three
group. Given the short interval—approximately 30 min— between group relative to all other groups for Neuroticism, Extraversion,
test and retest, these coefficients are akin to dependability corre- and Openness, Zs ranged from 2.01 to 2.68, p ⬍ .05. No other
lations (see Watson, 2004). These results—which appear in Figure differences among these correlations were significant.
4—showed that retest correlations (a) all were strong and greater Finally, we studied convergent validity as a function of scale and
than .84, and (b) asymptoted for the six versus seven or eight number of response options. Results for the BFI Extraversion, Agree-
versus nine, and 10 versus 11 comparisons, M retest correlations ⫽ ableness, Conscientiousness, and Neuroticism scales are presented in

Figure 4. Short-term retest correlations as a function of scale and number of response options.
NUMBER OF RESPONSE OPTIONS 7
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Figure 5. Mean convergent validity correlations, measured against the Personality Inventory for DSM–5, as a
function of scale and number of response options. For each BFI scale, convergent validities were included in the
means only if at least one correlation was greater than or equal to .40. All correlations are absolute values. VA ⫽
visual analog scale.

Figure 5 and Supplemental Tables 4 to 8. For each BFI scale, we How Many Response Options?
calculated mean convergent validity correlations against the scales of
the PID-5, Convergent validities were included in the means only if at As noted previously, traditional Likert scales (Likert, 1932)
least one absolute correlation was greater than or equal to .40. No included 5 symmetrical response options ranging from strongly
correlates met this threshold for BFI Openness. Given the premise that disagree to strongly agree, but many variations on that original
unreliability should attenuate validity, we predicted increases in con- structure have emerged since the scales original publication, often
vergent validity as the number of response options increased. Results with little or no evidence offered to support the deviation. Our
varied across scales, with no single pattern of validities emerging results revealed several themes regarding the ideal or optimal
across scales as a function of response scale. The mean validity number of response options to include on such a scale. First,
data—represented by a bold, dashed line in Figure 5—revealed a although we did not make any specific prediction regarding de-
relatively flat line across response formats, with two exceptions. First, scriptive statistics, our results show that changing the number of
there was a slight dip in validity for those who completed the 4- and response options has a non-negligible impact on basic scale norms.
5-point BFI scales, mean rs ⫽ .42 and .41, respectively, compared to Large differences in scale means were revealed for smaller num-
the mean validity across all scales and response scales, mean r ⫽ .45. bers of response options. And although these differences stabilized
Second, there was a notable drop in validity for those completing he a bit for scales with four or more response options, item means
BFI using the visual analog format, mean r ⫽ .41. Notably, however, continued to decline at a smaller rate. Thus, for example, if
none of the mean validity correlations presented in Figure 5 were response scales are modified for a given research study or clinical
significantly different from one another, all Zs ⬍ 1.96, ps ⬎ .05. application, simple proration of published norms—to account for
changes in the number of response options—appears to be a
problematic exercise.
Discussion Second, response scales with two and three response options
In this article, we have attempted to bring clarity to three (and to a lesser extent four and five response options) generally
questions involved in determining a response scale for psycholog- attenuated the psychometric precision associated with these five
ical test items: (a) is there a psychometrically optimal number of BFI scales, which suggests that scale developers wishing to adopt
response options to include for Likert-type items, (b) is there any such response scales do so at some risk of reducing measurement
psychometric advantage or disadvantage associated with using odd precision. Practically speaking, to have scales with minimally
or even numbers of response options, and (c) how does the visual acceptable measurement precision (e.g., ␣ ⫽ .80; see Clark &
analog format compare to traditional Likert-type items with re- Watson, 1995), using fewer response options will require scale
spect to their resultant psychometric characteristics. To address developers to include more items. Thus, these results reveal a
these questions, we examined descriptive statistics, internal con- trade-off between response scale simplicity and measurement im-
sistency reliability, short-term retest correlations, and convergent precision that must be adequately reconciled at the scale develop-
validity across different numbers of response options and with five ment stage. Some developers may desire simpler response scales
broad personality scales. In this discussion, we will summarize and for nonpsychometric reasons (e.g., readability, simplicity), but
consider the implications of our results related to each of these more items will be needed for such scales if minimally acceptable
three questions. measurement precision is desired.
8 SIMMS, ZELAZNY, WILLIAMS, AND BERNSTEIN

Third, no improvements in psychometric precision were identi- impression should attenuate validity correlations (e.g., Gulliksen,
fied past 6 response options. Thus, there appears to be little 1950). That did not occur in this study.
psychometric basis to arguments that additional numbers of re- Another point that deserves discussion is that statistical signif-
sponse options will translate to increased scale reliability while icance testing showed a more conservative pattern than the visual
holding the number of items constant. Going beyond 6 options pattern of results presented in the figures. The most conservative
may confuse participants who perhaps have difficulty perceiving conclusion based on significance testing is that 2- and 3-point
differences between similarly worded response options (e.g., strongly response formats are impoverished with respect to measurement
agree vs. very strongly agree). Alternatively, more differentiated precision (but not validity). Differentiation beyond four response
response scales may pose important challenges to the ability of options generally was not supported by significance testing. How-
humans to make fine-grained distinctions regarding responses to ever, the visual patterns appear to show replicable additional
relatively coarse psychological test items. That is, there likely are advantages, albeit small and nonsignificant, between four and five
important cognitive information processing variables (e.g., mem- and six and seven options. Thus, we maintain that scale developers
ory, perception, discrimination, intelligence) at play that have not should strongly consider using six or seven response options
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

been adequately studied with respect to how they interact with unless pilot testing or other considerations make such a differen-
This document is copyrighted by the American Psychological Association or one of its allied publishers.

participants’ approaches to psychological test items. More work is tiated format undesirable.
needed in this area. Notably, some readers may find this recommendation curious
Interestingly, the criterion validity results defied the pattern of given that the validity results failed to support the need for such a
attenuated psychometrics for smaller numbers of response options, differentiated response scale. We acknowledge this discrepancy.
such that the lower psychometric precision associated with smaller However, the decision regarding how many options to offer on a
numbers of response options did not translate into attenuated personality questionnaire like the BFI involves more than a con-
validity as would be expected from classical test theory (e.g., sideration of validity. Deficits in measurement precision alone are
Crocker & Algina, 1986; Gulliksen, 1950; McDonald, 1999). This directly related to increases in the standard error of measurement
extends recent work suggesting that moving from two to four of a scale (and thus the confidence intervals around scores), which
response options results in no improvement in convergent validity has implications for the precision of point estimates of scores used
in clinical and applied settings. Thus, even in the absence of
correlations (Cox et al., 2012, 2017; Finn et al., 2015). In present
validity implications (about which the conclusions are still un-
study, we replicated this finding and extended it to more differen-
clear), it is our view that response scales should be adequately
tiated response formats up to 11 options (and the visual analog
differentiated to maximize measurement precision.
scale). Several possibilities could explain this pattern of results.
First, the BFI scales we used for the validity analyses are psycho-
metrically strong scales with a long literature showing evidence of Is It Better to Have an Even or Odd Number
their reliability and validity (e.g., see John & Srivastava, 1999). In of options?
particular, it is notable that even in our own results, the short-term
Given the ambiguity associated with the use of the middle option
retest correlations, although slightly lower for less differentiated
on odd-numbered response scales (Kulas & Stachowski, 2013), we
scales, did not reveal dramatic differences in scale dependability.
predicted that odd-numbered Likert scales would show no advantage,
Arguably, measurement imprecision related to temporal differ- psychometrically speaking, over matched even-numbered scales. This
ences (i.e., retest reliability or dependability; Watson, 2004) might result was generally supported, as alphas and criterion validity corre-
be more detrimental to criterion validity than imprecision related lations generally revealed no advantage for odd-numbered scales
to content heterogeneity (i.e., internal consistency reliability). If relative to matched even-numbered scales. Taken together with the
so, then our failure to see a consistent impact of response options conclusions presented above, these results suggest little psychometric
on criterion validity is not as surprising. justification for Likert scales with more than six or seven response
Alternatively, unknown sampling differences across our groups options. Moreover, although the psychometric differences between
could have affected these findings, despite large samples and six and seven response options were small to nonexistent, we would
random assignment. Also, our criterion variables were limited to argue that six response options are preferable to seven options on the
the scales of another personality questionnaire with known limi- grounds of parsimony. That said, the most honest appraisal of our
tations (e.g., Al-Dajani, Gralnick, & Bagby, 2015). Given that both results is that it probably doesn’t matter much. Those unconcerned
our primary measure and convergent validity measure were self- with the ambiguity of the middle option may continue to use odd-
report questionnaires with similar formats, the impact of shared numbered Likert scales given that there is no clear and unequivocal
method variance could not be disentangled in the current study psychometric penalty for doing so.
(e.g., Courrégé & Weed, 2018). Taken together, it would be
important to study the impact on validity of different numbers of
Do Visual Analog Scales Offer Any Advantages Over
response options using a broader range of test and nontest criteria,
such as behavioral observations or real-word indicators of person-
Traditional Likert-Type Items?
ality. Regardless, although the precision results point to an asymp- Visual analog scales offer the promise of added measurement
totic point of diminishing returns following 6 response options, the precision given that they do not rely on a limited number of anchor
criterion validity results limit the strength of this conclusion. This points and rather allow participants to simply mark their response
result deserves further study since it calls into question long- anywhere they wish along a continuous line. That said, our results
standing tenets of classical test theory regarding the relation be- failed to show any psychometric advantage for visual analog items
tween reliability and validity. All things being equal, measurement relative to traditional Likert-type items. In fact, most of our anal-
NUMBER OF RESPONSE OPTIONS 9

yses appeared to reveal small nonsignificant decrements in psy- article. College samples typically are comprised of relatively well-
chometric performance, including criterion validity, for scales educated participants with average to above-average intellectual
comprised of visual analog items. Similar to that which was ability and reading levels. Given the cognitive information pro-
discussed above, it appears that the promise of added psychometric cessing demands that test items likely place on those providing
precision is not realized in practice with scales based on visual responses, college samples likely represent an upper bound on the
analog items, perhaps because humans are unable to reliably make amount of complexity that it is reasonable to include in an item
meaningful and valid fine-grained distinctions for coarse items response format. Thus, samples comprised of participants with
reflecting complex psychological characteristics. Although the dif- lower cognitive ability might yield different results from those
ferences in results for visual analogs generally were not large or identified here.
statistically significant, nothing in our results would lead us to
recommend their use as the basis for psychological test items. Final Thoughts
Moreover, although not directly assessed in the present study, the
present results also call into question the common practice of using Despite the above caveats, the present study offers important
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

visual analog scales as single-item scales to reflect underlying guidance to aid scale developers who wish to base their work on
This document is copyrighted by the American Psychological Association or one of its allied publishers.

latent traits and characteristics. evidence-based practices. We conclude that a 6-point response
scale is the most reasonable format based on the present results,
especially for measures of personality constructs like those we
Limitations and Future Directions assessed in this study. Going much beyond six response options
Strengths of our study include a large sample size, resolution seems to challenge humans’ basic ability to make fine-grained
permitting us to examine all common numbers of response options distinctions about the complex psychological constructs we tend to
between two and 11 (including a visual analog option), and the use study. We also made suggestions regarding the use of odd-
of a widely used and psychometrically strong measure of a prom- numbers of response options and visual analog items. However, an
inent measure of personality as the basis for the study. In addition, important take-home message from this article is that the choice of
we examined the impact of response scale on both reliability and response format should be given the same amount of thought and
validity, something not often done in this literature. That said, our examination as typically is put into the development of the test
results must be interpreted in the context of several limitations. items themselves.
First, our results, by design, apply only to traditional Likert-type If you’re building a new scale and wish to deviate from the
agree-disagree scales. Other metrics—such as scales related to suggestions listed here, we strongly recommend pilot testing.
frequency or intensity or similarity—might yield different results
and thus must be studied in a manner similar to what we presented References
in this article. That said, the basic finding—that there is a point of
diminishing returns in terms of the number of response options—is Aitken, R. C. (1969). Measurement of feelings using visual analogue
likely to generalize to all possible response scales for psycholog- scales. Proceedings of the Royal Society of Medicine, 62, 989 –993.
Al-Dajani, N., Gralnick, T. M., & Bagby, R. M. (2015). A Psychometric
ical scales, although the exact point might be a function of the
Review of the Personality Inventory for DSM–5 (PID-5): Current Status
nature of the response scale under consideration. For example, and Future Directions. Journal of Personality Assessment, 98, 1–20.
Likert scales are inherently bipolar in nature; unipolar scales (e.g., American Psychiatric Association. (2013). Diagnostic and Statistical Man-
not at all to extremely) may not yield as many meaningfully ual of Mental Disorders (5th ed.). Washington, DC: Author.
different anchor points. All such variations in response scales Bendig, A. W. (1953). The reliability of self-ratings as a function of the
ought to be carefully studied during the scale development process. amount of verbal anchoring and of the number of categories on the scale.
Second, our analyses and results were confined to a single Journal of Applied Psychology, 37, 38 – 41. http://dx.doi.org/10.1037/h0
multiscale measure of personality. Although the BFI is a strong 057911
and widely studied measure of a prominent model of personality, Bendig, A. W. (1954). Reliability and the number of rating-scale categories.
it does not fully represent all possible constructs psychological Journal of Applied Psychology, 38, 38 – 40. http://dx.doi.org/10.1037/h0
055647
researchers and clinicians may wish to measure. Taken together
Benet-Martínez, V., & John, O. P. (1998). Los Cinco Grandes across
with the finding that our patterns of results appeared to vary cultures and ethnic groups: Multitrait multimethod analyses of the Big
somewhat even within the BFI scales, generalizing the present Five in Spanish and English. Journal of Personality and Social Psychol-
findings to other measures and different constructs must be done ogy, 75, 729 –750. http://dx.doi.org/10.1037/0022-3514.75.3.729
with caution until the present results are replicated with a broader Bergman, R. D. (2009). Testing the measurement invariance of the Likert
range of measures and constructs, including different domains, and graphic rating scales under two conditions of scale numeric pre-
such as psychopathology, mood/affect, attitudes, values, and con- sentation (Doctoral dissertation). Retrieved from Proquest Dissertations
structs used in other areas of the social sciences (e.g., political or & Theses Global. (Order No. 3360158)
moral attitudes, market research evaluations, etc.). Capik, C., & Gozum, S. (2015). Psychometric features of an assessment
Finally, our sample, although large, was comprised solely of instrument with Likert and dichotomous response formats. Public
Health Nursing, 32, 81– 86. http://dx.doi.org/10.1111/phn.12156
college undergraduates who completed the study to fulfill a course
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in
research requirement. Replicating the present work in a broader, objective scale development. Psychological Assessment, 7, 309 –319.
more ecologically valid community and psychiatric samples would http://dx.doi.org/10.1037/1040-3590.7.3.309
be important aims for future work in this area. Although this is a Courrégé, S. C., & Weed, N. C. (2018). The role of common method
common limitation in empirical studies involving student samples, variance in MMPI-2-RF response option augmentation. Unpublished
there are specific reasons to highlight this limitation in the present manuscript.
10 SIMMS, ZELAZNY, WILLIAMS, AND BERNSTEIN

Cox, A. C., Courrégé, S. C., Feder, A. H., & Weed, N. C. (2017). Effects Research in Personality, 47, 254 –262. http://dx.doi.org/10.1016/j.jrp
of augmenting response options of the MMPI-2-RF: An extension of .2013.01.014
previous findings. Cogent Psychology, 4, 1323988. http://dx.doi.org/10 Lee, J., & Paek, I. (2014). In search of the optimal number of response
.1080/23311908.2017.1323988 categories in a rating scale. Journal of Psychoeducational Assessment,
Cox, A., Pant, H., Gilson, A. N., Rodriguez, J. L., Young, K. R., Kwon, S., 32, 663– 673. http://dx.doi.org/10.1177/0734282914522200
& Weed, N. C. (2012). Effects of augmenting response options on Likert, R. (1932). A technique for the measurement of attitudes. Archives
MMPI-2 RC scale psychometrics. Journal of Personality Assessment, de Psychologie, 140, 5–53.
94, 613– 619. http://dx.doi.org/10.1080/00223891.2012.700464 Lozano, L. M., García-Cueto, E., & Muñiz, J. (2008). Effect of the number of
Cox, E. P. (1980). The optimal number of response alternatives for a scale: response categories on the reliability and validity of rating scales. Method-
A review. Journal of Marketing Research, 17, 407– 422. http://dx.doi ology: European Journal of Research Methods for the Behavioral and
.org/10.2307/3150495 Social Sciences, 4, 73–79. http://dx.doi.org/10.1027/1614-2241.4.2.73
Crocker, L., & Algina, J. (1986). Introduction to classical & modern test Matell, M. S., & Jacoby, J. (1972). Is there an optimal number of alterna-
theory. Orlando, FL: Holt, Rinehart and Winston. tives for Likert-scale items? Effects of testing time and scale properties.
Diedenhofen, B., & Musch, J. (2016). cocron: A web interface and R Journal of Applied Psychology, 56, 506 –509. http://dx.doi.org/10.1037/
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

package for the statistical comparison of Cronbach’s alpha coefficients. h0033601


McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ:
This document is copyrighted by the American Psychological Association or one of its allied publishers.

International Journal of Internet Science, 11, 51– 60.


Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for Erlbaum.
coefficient alpha. Applied Psychological Measurement, 11, 93–103. Nadler, J. T., Weston, R., & Voyles, E. C. (2015). Stuck in the middle: The
http://dx.doi.org/10.1177/014662168701100107 use and interpretation of mid-points in items on questionnaires. Journal
Finn, J. A., Ben-Porath, Y. S., & Tellegen, A. (2015). Dichotomous versus of General Psychology, 142, 71– 89. http://dx.doi.org/10.1080/00221309
polytomous response options in psychopathology assessment: Method or .2014.994590
meaningful variance? Psychological Assessment, 27, 184 –193. http://dx Preston, C. C., & Colman, A. M. (2000). Optimal number of response
.doi.org/10.1037/pas0000044 categories in rating scales: Reliability, validity, discriminating power,
and respondent preferences. Acta Psychologica, 104, 1–15. http://dx.doi
Flamer, S. (1983). Assessment of the multitrait-multimethod matrix valid-
.org/10.1016/S0001-6918(99)00050-5
ity of Likert scales via confirmatory factor analysis. Multivariate Be-
Reips, U. D., & Funke, F. (2008). Interval-level measurement with visual
havioral Research, 18, 275–306. http://dx.doi.org/10.1207/s15327906
analogue scales in Internet-based research: VAS Generator. Behavior
mbr1803_3
Research Methods, 40, 699 –704. http://dx.doi.org/10.3758/BRM.40.3
Flynn, D., van Schaik, P., & van Wersch, A. (2004). A comparison of
.699
multi-item Likert and visual analogue scales for the assessment of
Russell, C. J., & Bobko, P. (1992). Moderated regression analysis and
transactionally defined coping function. European Journal of Psycho-
Likert scales: Too coarse for comfort. Journal of Applied Psychology,
logical Assessment, 20, 49 –58. http://dx.doi.org/10.1027/1015-5759.20
77, 336 –342. http://dx.doi.org/10.1037/0021-9010.77.3.336
.1.49
Symonds, P. M. (1924). On the loss of reliability in ratings due to
Garland, R. (1991). The mid-point on a rating scale: Is it desirable.
coarseness of the scale. Journal of Experimental Psychology, 7, 456 –
Marketing Bulletin, 2, 66 –70.
461. http://dx.doi.org/10.1037/h0074469
Gulliksen, H. (1950). Theory of mental tests. New York, NY: Wiley. Tellegen, A., & Ben-Porath, Y. S. (2008/2011). Minnesota Multiphasic
http://dx.doi.org/10.1037/13240-000 Personality Inventory–2 Restructured Form: Technical manual. Minne-
Hayes, M. H. S., & Patterson, D. G. (1921). Experimental development of apolis, MN: University of Minnesota Press.
the graphic rating method. Psychological Bulletin, 18, 98 –99. Tourangeau, R., Rips, L. J., & Rasinski, K. (2000). The psychology of
Hernández, A., Drasgow, F., & González-Romá, V. (2004). Investigating survey response. Cambridge, England: Cambridge University Press.
the functioning of a middle category by means of a mixed-measurement http://dx.doi.org/10.1017/CBO9780511819322
model. Journal of Applied Psychology, 89, 687– 699. http://dx.doi.org/ Velez, P., & Ashworth, S. D. (2007). The impact of item readability on the
10.1037/0021-9010.89.4.687 endorsement of the midpoint response surveys. Survey Research Meth-
Hilbert, S., Küchenhoff, H., Sarubin, N., Nakagawa, T. T., & Bühner, M. ods, 1, 69 –74.
(2016). The influence of the response format in a personality question- Watson, D. (2004). Stability versus change, dependability versus error:
naire: An analysis of a dichotomous, a Likert-type, and a visual analogue Issues in the assessment of personality over time. Journal of Research in
scale. TPM-Testing, Psychometrics, Methodology in Applied Psychol- Personality, 38, 319 –350. http://dx.doi.org/10.1016/j.jrp.2004.03.001
ogy, 23, 3–24. Weng, L. J. (2004). Impact of the number of response categories and
John, O. P., & Srivastava, S. (1999). The Big Five Trait taxonomy: History, anchor labels on coefficient alpha and test-retest reliability. Educational
measurement, and theoretical perspectives. In L. A. Pervin & O. P. John and Psychological Measurement, 64, 956 –972. http://dx.doi.org/10
(Eds.), Handbook of personality: Theory and research (2nd ed., pp. .1177/0013164404268674
102–138). New York, NY: Guilford Press. Wright, A. G. C., & Simms, L. J. (2014). On the structure of personality
Krueger, R. F., Derringer, J., Markon, K. E., Watson, D., & Skodol, A. E. disorder traits: Conjoint analyses of the CAT-PD, PID-5, and NEO-PI-3
(2012). Initial construction of a maladaptive personality trait model and trait models. Personality Disorders, 5, 43–54. http://dx.doi.org/10.1037/
inventory for. DSM–5. Psychological Medicine, 42, 1879 –1890. http:// per0000037
dx.doi.org/10.1017/S0033291711002674
Kulas, J. T., & Stachowski, A. A. (2013). Respondent rationale for neither Received January 26, 2018
agreeing nor disagreeing: Person and item contributors to middle cate- Revision received June 28, 2018
gory endorsement intent on Likert personality indicators. Journal of Accepted June 30, 2018 䡲

View publication stats

Potrebbero piacerti anche