Detecting Gender Bias in A Language Proficiency Test

International Journal of Language Studies (IJLS), Vol. 5(2), 2011 (pp.
167-178) 167
Detecting gender bias in a language proficiency test
Hossein Karami, University of Tehran, Iran
The present study makes use of the Rasch model to investigate the
presence of DIF between male and female examinees taking the
University of Tehran English Proficiency Test (UTEPT). The results of
the study indicated that 19 items are functioning differentially for the
two groups. Only 3 items, however, displayed DIF with practical
significance. A close inspection of the items indicated that the presence
of DIF may be interpreted as impact rather than bias. Therefore, it is
concluded that the presence of the differentially functioning may not
render the test unfair. On the other hand, it is argued that the fairness
of the test may be under question due to other factors.
Keywords: Bias; UTEPT; Gender; Fairness; the Rasch Model; Delta Plot; DIF
1. Introduction
The primary concern in test development and test use, as Bachman (1990, p.
236) suggests, is demonstrating that the interpretations and uses we make of
test scores are valid. Moreover, a test needs to be fair for different test takers.
In other words, the test should not be biased against test takers
characteristics, e.g. males vs. females, blacks vs. whites, etc. To examine such
an issue requires at least a statistical approach to test analysis which is able
to find initially whether the test items are functioning differentially among
test taking groups and finally detect the sources of this variance (Geranpayeh
& Kunnan, 2006). One of the approaches suggested for such purposes is called
Differential Item Functioning (DIF).
DIF occurs when examinees with the same level of ability but from two
different groups have different probabilities of endorsing an item (Clauser
and Mazor, 1998). It is also synonymous with statistical bias where one or
more parameters of the statistical model are under- or overestimated
(Camilli, 2006; Wiberg, 2007). An item showing DIF may not be biased
because it may reflect true difference in the ability levels of the test taking
groups. For example, if we run a DIF analysis of the items answered by high
and low ability test takers, most of the items may show DIF. However, the
items cannot be judged as biased because the results of DIF analysis reflect
true differences between the groups. As Elder (1996) nicely puts it,
Investigations of test bias are concerned less with the magnitude of score
168 | Hossein Karami
differences between one group and another than with the nature of these
differences and whether or not they result from distortion in the
measurement process (p. 235).
Test developers and users should provide evidence that their tests are free of
bias. The present study was an attempt to examine the presence of gender
bias in a language proficiency test.
2. Background
There have been various studies investigating the impact of a variety of
grouping factors on test performance in the context of language testing. A
number of researchers (e.g. Chen & Henning, 1985; Brown, 1999; Elder, 1996;
Kim 2001; Ryan & Bachman, 1992) have focused on language background.
One of the earliest DIF studies in the context of language testing was Chen and
Henning (1985). They were concerned with the identification of linguistic or
cultural bias in a university placement test. The performance of Chinese and
Spanish test takers were compared using Rasch model. Only four vocabulary
recognition items out the 150 items analyzed in the research were flagged as
possessing DIF in favor of Spanish examinees. In closer inspection, all four
English items were found to possess close cognate forms in Spanish (p. 159).
In a similar vein, Elder (1996) conducted a study to determine whether
language background may lead to DIF. She examined reading and listening
subsections of the Australian Language Certificates (ALC), a test given to
Australian school-age learners from diverse language backgrounds. Her
participants included those who were enrolled in language classes in years 8
and 9. The languages of her focus were Greek, Italian, and Chinese. Elder
(1996) compared the performance of background speakers (those who used
to speak the target language plus English at home) with non-background
speakers (those who were only exposed to English at home). She applied the
Mantel-Haenszel procedure to detect DIF. In the case of the Chinese reading
test, 10 items showed large DIF in favor of the background speakers and 7
others favored the non-background speakers. For the Chinese listening test,
16 out of 23 items showed large DIF.
Nine out of 30 items in the Italian reading test also displayed DIF. Five out of
22 items in the listening test were also flagged as showing large DIF. The
Greek listening and reading tests showed smaller amounts of DIF. Only 4
reading items and 3 listening items were detected as displaying large DIF.
Though Elder interpreted the results as not showing bias in this case, that
interpretation may be problematic when it is remembered that, for example,
in the case of the Chinese test, over 70 percent of the items showed DIF.
International Journal of Language Studies (IJLS), Vol. 5(2), 2011 | 169
Brown (1999) investigated the relative importance of items, persons,
subtests, and languages to the performance on the TOEFL test using
Generalizability Theory. The participants were 15000 individuals from 15
different language backgrounds who had taken the 1991 version of the
TOEFL. Although the results of the study indicated only a very small
proportion of the test variance, Brown (1999) emphasized that as long as
the variance component due to languages and their interactions with other
facets are not zero, we have a professional responsibility to attend to
differential item functioning results from both ethical and statistical points of
view (p. 235).
A few studies have been undertaken to investigate the impact of academic
background on test performance. In one of the earliest attempts, Alderson
and Urquhart (1985) investigated the effect of students' academic discipline
on their performance on ESP reading tests. They concluded that academic
background can play an important role in test performance but the effects
were not consistent (p. 201). The authors concluded that more research is
needed.
In a more recent study, Pae (2004) undertook a DIF study of examinees with
different academic backgrounds using Item Response Theory. Participants
were 14000 examinees (7000 Humanities and 7000 Sciences) randomly
selected from among 839,837 examinees who took the 1998 Korean National
Entrance Exam for Colleges and Universities. He used Item Response Theory
and Mantel-Haenszel procedures to analyze the data. Pae (2004) reported
that seven items were easier for the Humanities, whereas nine items were in
favor of the Sciences group.
Gender, as a grouping factor, has also received some attention from
researchers (e.g.; Ryan & Bachman, 1992; Takala & Kaftandjieva, 2000). Ryan
and Bachman (1992), for example, compared the performance of a group of
males and female test takers on FCE and TOELF tests. No significant
difference was reported between the performance of males and females.
Takala and Kaftandjieva (2000) undertook a study to investigate the presence
of DIF in the vocabulary subtest of the Finnish Foreign Language Certificate
Examination, an official, national high-stakes foreign-language examination
based on a bill passed by Parliament. The participants were a total of 475
examinees (182 males and 293 females). To detect DIF, Takala and
Kaftandjieva (2000) utilized the One Parameter Logistic Model (OPLM), a
modification of the Rasch model where item discrimination is not considered
to be 1, as is the case with the Rasch model, but is input as a known constant.
Over 25 percent of the items showed DIF in favor of either males or females.
To see whether test fairness is affected by the presence of DIF items, they ran
four separate ability estimates based on four subsets of items: the whole test,
items that were easier for males, items that were easier for females, and items
which showed equal difficulty levels for both groups. The results indicated
that the total test was not biased because, as the authors pointed out, there
were equal numbers of DIF items favoring males and females in the test.
However, there was a significant difference between the ability estimated
based on the subset favoring males and that containing DIF items in favor of
females. As Takala and Kaftandjieva (2000) suggest, the results indicate the
importance of DIF analysis in item banking. Collecting a number of DIF items
in a test may result in bias against a group of test takers.
The present study was another attempt to detect the presence of gender-
induced DIF in a language proficiency test, the University of Tehran English
Proficiency Test (UTEPT). As the test is a high-stakes test (see below), it is
incumbent upon the test-developers and users to make sure that there is no
source of bias in the test. Any such variance will undermine the validity of the
test.
3. METHOD
3.1. Participants
The participants were 1562 examinees who had taken the University of
Tehrans English Language Proficiency Test (UTEPT) in 2010. The
participants in the present study were divided into two groups based on their
gender. Males included 991 examinees equaling 63.4 percent of the total. On
the other hand, 571 examinees, equaling 36.6 percent of the total, were
female.
3.2. Instrumentation
The applicants to the PhD programs of the University of Tehran are required
document their qualification in terms of the score they make on a proficiency
test called the University of Tehran English Proficiency Test (UTEPT). As a
regulation, the candidates will not be allowed to sit for any PhD Entrance
Exam unless they present the criterion proficiency score. Thus, the score on
the proficiency test is a prerequisite for the acceptance into PhD programs.
The aim of the UTEPT is to select those individuals who have the right level of
English proficiency. The test is composed of three sections including
Grammar, Reading and Vocabulary. The number of items for each section in
the 2010 version exploited in this study was as follows:
1) Structure and Written Expression (30 items)
2) Vocabulary (35 items)
3) Reading Comprehension (35 items)
All questions were in the multiple choice format. Passages immediately
followed by a number of comprehension questions comprise the reading
section. The number of comprehension questions is different for each passage.
Usually, a total raw score is reported to the candidates which is simply the
sum of the scores they get on the three subtests.
3.3. Procedure
A variety of DIF detection techniques have been offered during the last three
decades ranging from simple procedures based on difficulty indices (e.g.,
transformed item difficulty index (TID) or delta plot) to complex techniques
based on IRT. Despite the diversity of techniques, however, only a limited
number of them appear to be in current use. DIF detention techniques based
on difficulty indexes are not common. Although they are conceptually simple
and their application does not require understanding complicated
mathematical formulas, they face certain problems including the claim that
they assume equal discrimination across all items and that there is no
matching for ability (McNamara & Roever (2006). If the first assumption is
not met, the results can be misleading (Angoff, 1993; Scheuneman & Bleistein,
1989). When an item has high discrimination level, it shows large differences
between the groups. On the other hand, differences between the groups will
not be significant in an item with low discrimination.
By way of contrast, methods based on item response theory are conceptually
elegant though mathematically very complicated. The main difference
between IRT DIF detection techniques and other methods including logistic
regression and Mantel-Haenszel (MH) is the fact that in non-IRT approaches,
examinees are typically matched on an observed variable (such as total test
score), and then counts of examinees in the focal and reference groups getting
the studied item correct or incorrect are compared (Clauser & Mazor 1998,
p. 35). That is, the conditioning or the matching criterion is the observed
score. However, in IRT-based methods, matching is based on the examinees
estimated ability level or the latent trait
From among the extant IRT models, Rasch model has gained a unique status
among the practitioners due to its firm theoretical underpinnings and also its
relation to conjoint measurement theory. Specifically, the theory of conjoint
measurement exploites the ordinal relations between a set of variables to
detect quantitative structure in them (for excellent expositions of conjoint
measurement see Michell, 1990, 2003). Like other IRT models, the Rasch
model focuses on the probability of endorsing item i by person n. Unlike
other IRT models, however, the Rasch model essentially takes into account
the person ability and item difficulty and considers item discrimination to be
1. The probability of endorsing an item is modeled to be a function of the
difference between person ability and item difficulty. The following formula
shows just this:

Where b
n
is person ability and d
i
is item difficulty. The formula simply states
that the probability of endorsing the item is a function of the difference
between person ability, b
n
and item difficulty, d
i
. This is possible because item
difficulty and person ability are on the same scale in the Rasch model. It is
also intuitively appealing to conceive of probability in such terms. The Rasch
model assumes that any person taking the test has an amount of the construct
gauged by the test and that any item also shows an amount of the construct.
These values work in the opposite direction. Thus, it is the difference between
item difficulty and person ability that counts (Wilson, 2005).
The Rasch model provides us with sample independent item difficulty indices.
DIF occurs when invariance is not accrued in a particular application of the
model (Engelhard, 2008). That is, the indices are dependent on the sample
who takes the test. The amount of DIF is calculated by a separate calibration
t-test approach first proposed by Wright and Stone (1979, see Smith, 2004).
The formula is the following:

Where d
i1
is the difficulty of item i in calibration 1, d
i2
, is the difficulty of item i
in calibration based on groups 2, s
2
i1,
is the standard error of estimate for d
i1
,
and s
2
i2
is the standard error of estimate for d
i2
. Winsteps (Linacre, 2010b)
applies the above formula in DIF analysis.
The following steps were followed in this study:
1. Checking whether the data fits the model
2. Checking the assumptions of the Rasch model
3. DIF analysis
4. Results and Discussion
4.1. Checking model-data fit
IRT models in general, and the Rasch model in particular, are falsifiable
models in the sense that the claims of the model are true if and only if certain
predictions of the model come empirically realized in the data ( Hambleton,
Swaminathan, & Rogers, 1991). That is, the Rasch model can be applied to any
set of data if the data fit the models predictions. For example, one of the
predictions of the model is that highly proficient test-takers will endorse
more items than those with low proficiency levels. Of course, some variation
in the data is acceptable as the Rasch model is stochastic not deterministic.
However, if the variation is more than the expectations of the model, then the
data will not fit the model and the model cannot be applied. Data fit may be
ensured using fit statistics. All but one item fit well with the expectations of
the model. The outfit mean-squares for this item was 1.31. It is generally
recommended that outfit mean-squares statistics beyond 1.3 are problematic
(Baghaei, 2009; Bond & Fox 2007; Linacre, 2009). After a close inspection of
the empirical ICC for this time, the researcher decided to remove it from the
analysis.
4.2. Checking the assumptions
The Rasch model, like other IRT models, has two foundational assumptions
for the data. The first assumption is called unidimensionality. This assumption
requires that any measurement scale should gauge one, and only one, ability
or attribute at a time. That is, a test of, say, strategic competence should only
measure strategic competence and successful performance on the test should
not be affected by the test takers organizational competence. In other words,
only one ability is being measured by a set of items in a test (Hambleton,
Swaminathan, & Rogers, 1991).
Another assumption closely related to unidimensionality is local
independence. This assumption requires that the response to an item should
be independent of the responses to all other items. In other words, the
response to an item should not affect performance on other items. In fact,
local independence implies that the responses to a set of item are
independent when the ability is controlled for. That is, after the ability of
interest is controlled for, or it is held constant, there should remain no
relationship between the responses to the items. Hence, local independence is
sometimes called conditional independence.
Unidimensionality can be checked by a Principle Components Analysis of the
residuals (Linacre, 2009). Here it is required that after the first Rasch
modeled dimension is taken into account, no systematic pattern should
remain among the residuals. A PCA analysis showed that the Rasch modeled
dimension accounted for about 22.5 eigenvalues while the first contrast
explained 3 eigenvalues in the residuals. Each eigenvalue shows the amount
of the explained variance in the data. The total of the eigenvalues equals the
number of the items. As per suggestions from Linacre (2010a), we cross-
plotted the results of the analysis from the total test with all the subtests and
also among the subtests themselves. All disattenuated correlations were 1
indicating that all the subtests and the total test were telling the same story.
Local independence was also ensured by analyzing the standardized residual
correlations reported in Winsteps. The largest correlation was between two
vocabulary items and that amounted to a correlation of only .59. A close
inspection of the items showed that one item had been mistakenly repeated
in the test. Thus, one of the items was deleted from the analysis. All other
residual correlations were less than .3 posing no violation of the local
independence principle.
4.3. DIF analysis
The results of DIF analysis are presented in Table 1 (next page). Note that
only items with significant DIF at p<.05 level are included in the table. The DIF
Contrast column shows the difference between the item difficulty levels in the
two calibrations of the items for the two groups, Males and Females. It is the
numerator of the Wright and Stones (1979) t formula discussed earlier. The
researcher had coded the Males group as 1 and the Females group as 2. Thus,
the positive DIF contrasts are favoring the Females group and the negative
ones favor the Males. Out of the 19 items that displayed significant DIF, 7
items were favoring the Males while the rest, 12 items, showed DIF in favor of
the Females.
One problem with significance tests is the fact that they are extremely
sensitive to sample size. With large sample sizes, even minor differences will
be significant. On the other hand, huge differences between the groups will
not be significant if the sample size is small. In the present study, the sample
size was large. Therefore, taking the significance tests at face value may not
be justified.
In order to render the results more plausible, the Bonferroni correction was
applieu Simply put the Bonfeiioni coiiection pioceeus by uistiibuting the
level, in this case 0.05, among all the comparisons such that 0.05 is the sum of
the levels foi all those compaiisons see Thompson In the piesent
study, we had 99 comparisons. Thus, dividing the alpha level, 0.05, by the
number of comparisons, 98, gives us the new alpha level: approximately
0.0005. (Note that two items were discarded from the study, one due to misfit
and the other because of local dependence).
Applying this new significance level, only 3 out of the 19 items show
significant DIF at p<0.0005 level. All three items are favoring Females.
Interestingly, these were all grammar items (that is, grammar items #1, #8,
#10). The items were testing the examinees knowledge of the verb forms
after the verb let, use of appositives, and cohesive ties such as for example.
As there are no signs of the existence of gender-related clues in these items, it
may be safely concluded that the source of DIF in this case is not bias. In
addition, there are only three items with significant DIF in the test and they
may not be enough to render the test unfair.
Table 1
DIF Results
Number Item DIF Contrast t Probability
1 Grammar 1* .43 3.85 .0001
2 Grammar 2 .27 2.43 .0153
3 Grammar 6 -.32 -2.41 .0160
4 Grammar 8* .50 3.88 .0001
5 Grammar 10* .67 4.51 .0000
6 Grammar 13 .34 2.37 .0181
7 Grammar 15 -.35 -2.16 .0307
8 Grammar 17 .24 2.12 .0344
9 Grammar 18 .31 2.79 .0054
10 Grammar 23 .25 2.02 .0435
11 Grammar24 .37 3.27 .0011
12 Vocabulary 10 -.29 -2.66 .0079
13 Vocabulary 13 .29 2.47 .0137
14 Vocabulary 15 -.38 -3.46 .0006
15 Vocabulary 24 -.29 -2.51 .0121
16 Vocabulary 26 -.37 -2.29 .0220
17 Vocabulary 27 .36 3.08 .0021
18 Reading 15 .31 2.53 .0114
19 Reading 23 -.30 -2.56 .0105
On the other hand, one item was deleted from the analysis because it was
mistakenly included twice in the test. It will be unfair to the examinees who
did not know the correct answer because in this case they will lose two
marks. As the test is a high-stakes test, with grave consequences for the test-
takers, even this much of unfairness in the test is not acceptable.
5. Conclusion
The results of the present study indicate that there were 19 items with
significant DIF in the UTEPT. Only 3 items, however, had practical
significance. All three items were from the Grammar part. As there were no
signs of gender-related clues in the test, it may be concluded that the source
of DIF is not bias.
The fairness of the test, however, is not completely supported either due to
the existence if a repeated item in the test. This item has surely affected the
scores of those who did not know the correct response and unfairly so
because they have lost two marks for one item. This is clearly unacceptable
for a high-stakes test like UTEPT.
The final point is that fairness is a broad concept which encompasses much
more than a mere DIF analysis of the items (see Davies, 2010; Kane, 2010;
Kunnan, 2010; Xi, 2010). Also, as Camilli (2006) points out, DIF analysis only
looks at the performance of different groups rather than individuals. Such
analyses cannot reveal the equally possible bias against particular
individuals. Therefore, in order to assess the fairness of UTEPT, further
research specifically focused on the performance of individuals rather than
groups is needed.
Acknowledgments
I am deeply indebted to Dr. Mohammad Ali Salmani Nodoushan for the
encouragement and support I received for the present study. Special thanks
are also due the anonymous reviewers for constructive comments on an
earlier draft of the paper.
The Author
Hossein Karami is an MA in TEFL in the Faculty of Foreign Languages,
University of Tehran, Iran. His research interests include language testing in
general, and validity and fairness in particular.
References:
Alderson, J. C., & Urquhart, A. (1985). The effect of students academic
discipline on their performance on ESP reading tests. Language Testing,
2, 192-204
Angoff, W. H. (1993). Perspectives on differential item functioning
methodology.
In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 324).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Baghaei, P. (2009) Understanding the Rasch model. Mashhad: Mashhad Islamic
Azad University Press.
Bond, T.G. & Fox, C.M. (2007) Applying the Rasch model: fundamental
measurement in the Human sciences. Lawrence Erlbaum.
Brown, J. D. (1999). The relative importance of persons, items, subtests and
languages to TOEFL test variance. Language Testing, 16, 217238.
Camilli, G. (2006) Test fairness. In R. Brennan (Ed.), Educational measurement
(4th ed.) (pp. 221-256). New York: American Council on Education &
Praeger series on higher education.
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language
proficiency tests. Language Testing, 2(2), 155163.
Chihara, T., Sakurai, T., & Oller, J. W. 1989. Background and culture as factors
in EFL reading comprehension. Language Testing, 6(2), 143-151.
Clauser, E. B. & Mazor, M. K. (1998) Using Statistical Procedures to Identify
Differentially Functioning Test Items. Educational Measurement: Issues
and Practice. 17, 31-44.
Davies, A. (2010) Test fairness: a response. Language Testing, 27(2), 171-176.
Elder,C. (1996). The effect of language background on foreign language test
performance:
The case of Chinese, Italian, and Modern Greek. Language Learning, 46, 233
282.
Engelhard, G. (2009). Using Item Response Theory and Model-Data Fit to
Conceptualize Differential Item and Person Functioning for Students
With Disabilities. Educational and Psychological Measurement 69, 4, 585-
602.
Geranpayeh, A., & Kunnan, A. J. (2007) Differential Item Functioning in Terms
of Age in the Certificate in Advanced English Examination. Language
Assessment Quarterly, 4, 190-222.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of
item response theory. Newbury Park, CA: Sage.
Kane, M. (2010) Validity and fairness. Language Testing, 27(2), 177-182.
Kim, M. (2001). Detecting DIF across the different language groups in a
speaking test. Language Testing, 18, 89114.
Kunnan, A. J. (2010) Test fairness and Toulmins argument structure.
Language Testing, 27(2), 183-189.
Linacre, J.M. (2010a). A User's Guide to WINSTEPS. Retrieved May 2, 2010
from http://www.winsteps.com/ .
Linacre, J.M. (2010b) Winsteps (Version 3.70.0) [Computer Software].
Beaverton, Oregon:Winsteps.com.
McNamara, T. & C. Roever (2006) Language Testing: The Social Dimension.
Malden, MA & Oxford: Blackwell.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd
ed.) (pp. 13103). New York: American Council on Education &
Macmillan.
Michell, J., (1990). An Introduction to the Logic of Psychological Measurement.
Hillsdale, N.J.: Erlbaum.
Michell, J. (2003). Measurement: A Beginners Guide. Journal Of Applied
Measurement, 4(4), 298308.
Pae, T. (2004). DIF for learners with different academic backgrounds.
Language Testing, 21, 5373.
Ryan, K., & Bachman, L. (1992). Differential item functioning on two tests of
EFL proficiency. Language Testing, 9, 1229.
Scheuneman, J. D., & Bleistein, C. A. (1989) A consumers guide to statistics for
identifying differential item functioning. Applied Measurement in
Education. 2, 255-275.
Smith, R. (2004) Detecting item bias with the Rasch model. Journal of Applied
Measurement, 5(4), 430-449.
Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2
vocabulary test. Language Testing, 17, 323340.
Wiberg, M. (2007) Measuring and detecting differential item functioning in
criterion-referenced licensing test. A Theoretic Comparison of Methods.
Educational Measurement, technical report N. 2
Wilson, M. (2005) Constructing measures: an item response modeling approach.
Lawrence Erlbaum Associates, London.
Wright B.D. and Stone M.H. (1979). Best test design. Chicago: MESA Press.
Xi, X. (2010) How do we go about investigating test fairness? Language
Testing, 27(2), 147-170.

Detecting Gender Bias in A Language Proficiency Test

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Detecting Gender Bias in A Language Proficiency Test

Caricato da

Copyright:

Formati disponibili

International Journal of Language Studies (IJLS), Vol. 5(2), 2011 (pp.

Potrebbero piacerti anche