Final Test Project

Running Head: ASSESSING SECOND LANGUAGE PROFICIENCY: FINAL PROJECT
Final Test Project 1
Proficiency Assessment Proposal

Angela Sharpe
Colorado State University
Introduction
Effective learner assessment is an important part of a workplace English program,
because the outcomes and results can have serious impacts on the learners in terms of
employment options. While adult assessment has traditionally involved standardized tests such
as the BEST Plus or BEST Literacy (Center for Applied Linguistics, 2014), many programs are
moving to other, more qualitative means of assessment such as portfolios, periodic observations
with focused checklists, and interviews with learners and supervisors. These methods can assess
a learners progress, as well as, reflect learning outcomes which can be more representative of
the real-world (Lytle & Wolfe, 1989). The test proposed in this project is a combination of
standardized items and alternative means of assessment, including a mock interview, in order to
prepare adults for the reality of the target language domain of applying and interviewing for jobs
and promotions.
One major factor contributing to the need to adequately assess language ability and
proficiency in adult language learners stems from the current shifting demographics in the
United States, which has resulted in a growing linguistic diversity in the workforce. According
to the 2007 census, 20% of individuals in the United States reported speaking a language other
than English in the home and this number is expected to continue to increase (U.S. Census
Bureau). This amounts to a large population of people adjusting to a new culture, a new
language, and new employment. Many companies are taking a proactive approach to dealing
with the language barriers of their employees. One such approach is offering Workplace English
classes to their non-native English speaking employees. Many companies view the investment in
such programs as a way to retain productive and satisfied employees, thus increasing the morale
and potentially shrinking the tremendous cost of high employee turn-over. This represents a
soft benefit for a small investment. The second language proficiency test proposed in this
project could fulfill a companys desire to assess language ability in a Workplace English
program for the purpose of properly placing employees in appropriate levels of English language
training. This could effectively maximize employees time in class as well as give an employer a
basis for determining and comparing the level of English ability and performance their
employees have gained through Workplace English classes. Unlike many other standardized
proficiency tests, this test is competency-based and is especially aimed at Workplace English
content and objectives by including tasks which replicate real-world communicative acts.
Overall organization of this proposal
This paper will detail the overall proposed pilot test project by describing the test through
the following characteristics: purpose, type, how the scores are to be interpreted, the TLU
domain from which the tasks can be used to make inferences, a definition of the constructs
assessed by the test, a description of the table of specification and the test tasks, and the
development and characteristics of the rubrics. In light of the fact that this a proposal for a test,
and the test has not been administered, the ideal participants and procedure for administration, as
well as the scoring procedures, will be only be described for the proposed test. The qualitative
and quantitative analyses for the proposed pilot test will be discussed in terms of item statistics,
descriptive statistics, reliability statistics, and the standard error of measurement from the
reliability statistics. Additionally, a section of this paper will also demonstrate ways in which
item performance can be used to evaluate the proposed pilot tests usefulness in terms of
reliability, construct related evidence for validity, and practicality.
Description of the Proposed Test

Test Purpose
This English as a second language test is aimed at assessing the sociocultural and
sociolinguistic needs of adult learners. The general language areas of the test focus on content
appropriate for gaining entry into the workplace in the United States. Therefore, the purpose of
the test is to measure the English proficiency according to learning outcomes which are based in
the real world, specifically language used for obtaining and maintaining employment in the
United States. The assessment integrates the skills of listening comprehension and speaking,
reading and writing, and vocabulary.
The language abilities this proficiency test will provide evidence of are: reading, writing,
and vocabulary, and listening comprehension and speaking abilities in a workplace interview
context. Chapelle, Enright, and Jamieson (2008) promote a task-centered approach to
assessment, whereby the tasks in the assessment focus on the types of tasks deemed important in
the real world context (e.g. the workplace). In this case, interpretations of the scores must take
into account the context of language use. Score interpretations, therefore, help to justify making
inferences between the scores on test tasks and ability in the real world tasks to which they
correspond.
The test could have an impact on at least three stakeholders. First the test-takers
themselves could be impacted by the scores on the proficiency test through their placement into a
level of English. This may amount to them spending more or less time than they had anticipated
learning English. The administrators or program coordinators may be impacted by the results of
the test, in that, they may have to provide more or less English instruction according to a
particular level than they had anticipated. The expense of financially supporting programs such
as Workplace English programs also has an impact on a companys fiscal budget. However, they
may ultimately end up being positively impacted or benefited from more linguistically competent
employees. Additionally, there may also be a positive impact on society as learners could
increase their ability to communicate within many contexts of society.
The usefulness of a test depends on a variety of criteria, one of which is authenticity or
the degree to which an assessment task corresponds to the target-language use domain (TLU).
Bachman and Palmer (2010) have formulated a framework which uses a set of characteristics to
adequately describe and correlate assessment tasks with real world tasks. The characteristics of
real-world tasks, as outlined in the framework, outline a conceptualization and design of
assessment tasks which simulate the skills and abilities required in the TLU domain. A more indepth description of the TLU domain will be given in a subsequent section. Assessment tasks
which are based upon the TLU domain characteristics allow us to determine the ways and extent
a learners language ability is engaged. These characteristics help test designers devise tasks
which can be generalized to the setting beyond the test itself. See appendix A for a description of
common TLU tasks.
Type of Test
Green (2013) defines a proficiency assessment as an instrument used to measure
information connected with whether or not a persons language ability adequately satisfies a
predetermined standard or need and is distinguished from educational assessments (i.e. an
assessment that centers on learning outcomes from a particular course of study). For this reason,
proficiency assessments are often used for the purpose of placement and/or gatekeeping
decisions such as immigration, educational, and employment opportunities. Adult proficiency
assessments, therefore, can have a profound effect and impact on quality of life, success level
achieved, and opportunities for advancement.
Interpretation of Scores
The interpretation of scores on the proposed test will be norm-referenced. The basic
purpose of any norm-referenced test is to spread students out on a continuum of language
abilities (Brown, 2003). This type of interpretation describes a learners performance as a
position relative to a known group. In this test, scores are interpreted according to performance
standards and measureable outcomes. The scores on this test are interpreted as indicators of a
learners proficiency level, such as beginner, intermediate, and advanced. In this way, the
emphasis is placed on discriminating among learners in order to measure language abilities upon
entrance into a language learning level. Miller, Linn, and Gronlund (2009) add that the goal of a
placement assessment is to determine the relative position of each student in the instructional
sequence as well as the mode of instruction that would most benefit them.
TLU Domain
A primary interest in language assessment is the ability to make generalized
interpretations about a test takers language ability through their performance on test tasks
relative to the language required in similar tasks in the target language use domain. Bachman and
Palmer (2010) define a target language use (TLU) domain as a specific setting outside of the test
itself which requires learners to perform similar language use tasks. As the tasks on this
assessment are from a specific target language use domain (the workplace), they are called target
language use tasks and are the sources from which interpretations of the test takers language
abilities are generalized. The tasks on this test represent one TLU domain, that of language for
obtaining employment. Each TLU task, however, can be used to make a generalization for
language ability within the TLU domain. For example, interpretations from the test takers
performances on tasks 1, 2, and 3 can be used to make generalizations about their language
abilities within the real-world domain of applying and interview for employment within the TLU
domain of the workplace. See Appendix A for an extended description of the TLU domain
according to a framework of characteristics by Bachman and Palmer (2010).
Construct Definition
An important component for test validity is defining the construct(s) which assessment
tasks are measuring. As this test is a proficiency test, it has a theory-based construction
according to a theoretical framework of language ability. The constructs for the tasks on this test
are defined in terms of a framework for language ability proposed by Bachman and Palmer
(1996; 2010). Task 1 assesses language ability within the construct of grammatical knowledge.
Specifically, the constructs of task 1 are receptive, recall, and meaning knowledge of vocabulary.
Task 2 assesses language ability within the constructs of grammatical knowledge. Specifically
the constructs of this tasks are syntactic knowledge and functional knowledge of interpreting
relationships between sentences and paragraphs, vocabulary knowledge of meaning in context,
and the strategy of skimming and scanning is assumed. Task 2 also measures textual knowledge
(cohesion and rhetorical organization) and comprehension of meaning in context, and
recognition of information and vocabulary. The constructs of task 3 are productive and are
measured in terms of language ability relating to pragmatic knowledge, functional knowledge
(ideational functions, manipulative functions [interpersonal functions], and sociolinguistic
knowledge of register.
Design of the Test: Table of Specifications
The purpose of a table of specifications is to ensure that there is a representative sample

of tasks in relation to what the test is intended to measure. The design of the table lists the
objectives across the top row and lists the content down the first column. The bottom row of the
table shows the percentage allocated for each objective. Similarly, the last column contains the
percentage allocation for each content section of the test. In this way, the table of specifications
provides the skeleton design of the test (see Appendix B for the actual table of specifications).
The table of specifications for this proposed pilot test includes three tasks representing three
learning objectives. Task 1 includes 15 objective matching items covering workplace vocabulary
meaning and a knowledge (recall) learning objective for each item. Task 2 includes 15
information transfer items focused on the learning objective of interpretation, meaning in
context, and recognition and transfer of information. Task 3 includes 6 speaking prompts
focused on the learning objectives of analysis, application, and synthesis of information. Task 1
and task 2 together account for 84% of the items on the assessment while task 3 accounts for
16% of the items.
Description of Test Tasks
The test includes three parts comprised of three different tasks (see Appendix C for actual
test). The tasks are sequenced from selection item tasks which assess receptive knowledge to
tasks which gradually require and assess more productive knowledge. The rationale for the
sequencing includes the test takers ability to use the information from one task in order to
complete the next task. The first task includes selection item types that assess learners receptive
knowledge (recall) of workplace vocabulary and terms associated with job applications and the
workplace. The second task assesses the learners receptive and productive knowledge by
requiring them to read paragraphs containing information that they must transfer to fill a gap
exercise in the form of a job application. An analysis using Compleat Lexical Tutor (Cobb,
2002; Heatley et al., 2002) reveals that the language used in the paragraphs is restricted to K-1,
K-2, and K-3 words. K-1 and K-2 words come from the General Service Word List (West, 1953,
cited in Bauman & Culligan, 1995) and represent the 1000 and 2000 most frequently used words
in English. Likewise, approximately 3% of the words are also represented on the AWL
(Coxhead, 2000 as cited in Cobb, n.d.). The rationale for constraining the vocabulary to these
levels was to make the paragraph accessible to learners with beginning and intermediate
vocabulary levels. Of particular interest to test developers is the "type-token ratio," which
indicates the number of different words in the text (types) divided by the number of words on
which they are based (tokens). If learners are being encouraged to increase the variety of words
(breadth) used in their vocabulary, they should be reading a texts with a higher type-token ratio
number. The following chart breaks down the token % for each list:
Current profile
(token %)
K-1 (76.16)
76.16
K-2 (10.60)
86.76
AWL (3.31)
90.07
OFF
100%
9.93
The off-list tokens are the proper names included in the paragraphs.
The following chart illustrates the lexical density for the paragraphs:
Freq. Level
K-1 Words :
K-2 Words :
AWL Words:
Families (%)
Types (%)
Tokens (%)
Cumul. token %
76 (69.72)
82 (65.08)
230 (76.16)
76.16
24 (22.02)
25 (19.84)
32 (10.60)
86.76
9 (8.26)
9 (7.14)
10 (3.31)
90.07
The third task assesses the learners receptive and productive knowledge by integrating
speaking and listening skills. In this task, the learners should be able to apply and synthesize
information from the previous tasks in their oral responses to six speaking prompts relevant to
the TLU domain of interviewing for a job. Again, the test includes 36 total items, worth a total
of 36 points. Task 1 accounts for 15 matching type items, task 2 includes 15 gap filling,
information transfer items where the input comes from four paragraphs, and task 3 includes 6
prompts where the oral response is graded according to a holistic rubric where the task receives
an overall single score on a 1 to 6 scale. As this assessment represents a proposed pilot test, the
time allotment may change as well as the task order according to observation and feedback
collected during a trialing stage. The decision to allow 60 minutes is based off of an example
description of assessment parts and the time allowed for each part included in Bachman and
Palmer (2010, p. 389), where the parts and number of tasks in each part were very similar to the
pilot test in this project.
The instructions for tasks 1 and 2 are briefly explained in writing before the task along
with the scoring method and recommended time allotment for the task. Task 1 and task 2 are
non-reciprocal tasks where there is a direct relationship between the information supplied in the
input and a successful response. The items in task 1 have a limited amount of input and can be
characterized as narrow scope. Tasks 2, however, can be characterized as having a broader scope
than task 1 as the language user has to process a lot of input in order to give a successful
response. Task 3 is a reciprocal task as the test taker must engage in language with an
interlocutor for a successful response to each prompt. This task also has an indirect relationship
between the input and a successful response language users cannot draw upon the language in
the input to give a successful response (see Appendix A for a TLU language task description).
The justification for the tasks, (i.e. their sequence, the number of items, and the time
allotted,) is that the test will allow decisions to be made for proper placement into levels/classes.
The sequence is set so that a learner is able to complete the easier more receptive knowledge
tasks first in the event that they are not able to complete the productive tasks. The stakeholders
who are most directly affected by this test, then, are the test takers. According to Bachman and
Palmer (2010), the potential consequences on the test takers of this assessment are: 1) a negative
or stressful experience in preparing for and/or taking the test, 2) the negative or disappointing
feedback they may receive about their performance on the test, and 3) the decisions, i.e.
placement, that may be made about them on the basis of their performance. A justification for
the content of the tasks and the learning objectives of the tasks, and again for the sequencing of
the tasks, in terms of teachers and educational institutions, is that the test is designed to help
alleviate the potential for a very mixed classroom in that it places learners according to the their
language ability as inferenced from their test performance. The underlying motivation for
developing this test stems from my own experience teaching a mixed-level workplace English
classroom which resulted in what I perceived as a somewhat unfair instructional situation for all
parties. The implication from the respective situation, then, is that an assessment of proficiency
for the purpose of placement has the goal, as in all classroom testing and assessment, to improve
learning and instruction (Miller, Linn, & Gronlund, 2009). See Appendix C for a copy of the test
and scoring key.
Development of Rubrics
The rubric used in task 3, the speaking portion of the test, is scored as levels of language
ability, in this case, speaking ability. The listening and speaking portion of the test (task 3) is
scored using a holistic rubric adapted from the rubric used by the National Reporting System
rubric used in many other adult education performance assessments. The language in the
criterion differs from the language used to describe outcomes in the NRS, in that, it is specific to
workplace English. Also, score reporting numbers where modified in this rubric to correspond to
the scoring in the proposed test. The NRS was established by the Department of Education as a
result of the Workforce Investment Act of 1998, which requires that each state develop and
implement a comprehensive accountability system to interpret and demonstrate individual
learner progress and performance (Mislevy & Knowles, 2002). Similar to criteria which
Bachman and Palmer (2010) use to define the usefulness of performance assessments, The NRS
consider performance assessments to be assessments which require test takers to demonstrate
their language skills and knowledge through tasks which closely resemble real-world situations
or setting (TLU domain). Each level in the oral proficiency rubric, except the beginning level,
contains three criterion from which the test taker must meet two in order to score into that
language level. The level number from the oral proficiency rubric is then added to the scores of
tasks 1 and 2 to indicate an overall level of proficiency. Each level has a descriptor of skills and
language ability. All components of language ability are considered together a single unitary
ability. In this way, the emphasis is placed upon what the test taker can do instead of what they
cannot do. This type of rubric often has a higher inter-rater reliability because it is easier to train
raters how to use it and it can save time by minimizing the number of decisions a rater has to
make compared with analytic scales which consider each language component separately. See
Appendix D for the rubrics.
Pilot Test Procedure

Participants
Ideally, this test would benefit working adults with varying levels of proficiency, in that,
they would be properly placed into a level of English so that they could build from what they
already know. This would include immigrants with limited literacy in their L1, immigrants with
adequate literacy in the L1, recent arrivals to the United States with various amounts of English
background, and long-term English learners.
Administration
Ideally the administration of the test would take place in an appropriate test taking area
such as a meeting room or classroom. The test could also be administered at another location so
long as it is conducive to testing, e.g. adequate light, quiet, tables and chairs available. The test
can be administered the first day of class so that by the next class session administrators and
instructors would have an indicator of which level to place students in. The test should take no
longer than the length of a class, approximately one hour.
Scoring Procedures
The test includes a total of 36 items worth 36 total points. Tasks one and two are scored
0=incorrect/1=correct for a total of 30 points. The speaking portion of the assessment is scored
using a 6 point holistic rubric. This score is then added to the correct number of items from task
one and two to obtain an overall score. This overall score can be interpreted as a language level
according to an overall language level rubric which describes functional and workplace skills
and outcome measures for each level. Test takers will see their score and be able to interpret
their score as a language level according to a chart included as part of the scoring form. See
Appendix E for example of score reporting form.
Test Results
Item Statistics
There are many types of item analyses that can be done once a pilot test has been
administered and scored. For norm-referenced tests, these analyses are important for
determining which items have the best difficulty to discriminate between high and low achieving
students as well as identifying faulty items. Miller, Linn, and Gronlund (2009) suggest using
item analyses to answer the following questions:
1. Did the item function as intended?
2. Was the test item of appropriate difficulty?
3. Was the test item free of irrelevant clues and other defects?
4. Were the distracters effective (in multiple choice items)?
(p. 351)
The second question in the list is especially relevant when analyzing items in a norm-referenced
assessment such as the pilot test developed for this project. One simple way to analyze items for
difficulty is to rank the scores in order from highest to lowest and compare the responses from
the highest scoring students to those of the lowest scoring students. For example, if this pilot test
were administered to 40 students, we would rank the tests in order from highest to lowest and
select the 10 tests with the highest scores and the 10 tests with the lowest scores for item
difficulty comparison. In a chart we could tabulate, for each item, the number of students in the
upper group and the lower group who selected each alternative in the items. From this
information calculate the difficulty of each item by finding the percentage of students who got
the item correct. Then we could calculate the discriminating power of each item by finding the
difference between the number of students in the upper group and lower group who got the item
correct. If more students in the upper group get an item correct than the lower group, the item is
discriminating positively because it is distinguishing between the high and low scorers. To find
the percentage of difficulty for an item we would need to find the number of students in both
groups who got the item correct out of the total number of students. For example, if 9 students in
the upper group and 5 students in the lower group got the item correct out of 20 students, the
item difficulty would be 70%. If we analyzed the lower groups alternative selection in an item
and found that each alternative was selected by at least one student in the group we could
conclude that the other alternatives (distracters) for an item are operating effectively. These
types of analyses could be done for task 1 and also for task 2 but the analysis would be more
subjective since there are not any distracters offered.
Another way to calculate item difficulty is using the formula:
P=100R/T
In this formula P represents item difficulty, R equals the number of students who got the item
correct and T equals the total number of students who tried the item. If we apply this to an item
in task 1 and use the scores from the highest 5 tests and the lowest 5 tests and pretend that 5
students in the upper group and 3 students in the lower group got the item correct the item
difficulty would be:
P=100*8/10=80%
Similarly, we could use the item discriminating power formula to find the index of
discrimination for an item:
D=(RU-RL)/(.5T)
D =(5-3)/(5)=.40
The index of discriminating power for this item is lower than the ideal index of .50, but anything
greater than or equal to .30 is thought to be discriminating well on a norm-reference assessment.
Miller, Linn, and Gronlund (2009, p.362) state that when using norm-referenced classroom
tests, item analysis provides us with a general appraisal of the functional effectiveness of the test
items, a means for detecting defects, and a method for identifying instructional weaknesses.
These types of analyses, have a more limited applicability for performance based assessments
such as a writing task. These analyses are very helpful for test development (pilot projects)
because they lead to the formation of an item bank consisting of strong items which can
discriminate between ability levels.
One other item statistic that can be used to evaluate items in task 1is the item mean. The
mean, or central tendency, is the average student response to an item. It can be calculated by
adding the number of students who got the item correct (or the number of points earned by all
students on the item) and dividing that total by the total number of students.
We could also look at the frequency and distribution of responses for an item. The
frequency is the number of students who chose an alternative and the distribution refers to the
percentage of students who chose each alternative. Task 1 has 15 items and 17 alternatives so we
could look at how many students chose a specific alternative for item 1 and compare that to the
percentage in the group overall who chose that alternative. Incorrect alternatives which are
frequently chosen may indicate common misunderstandings in a group of students, which could
mean the item is faulty or has an ambiguity.
Descriptive Statistics
Descriptive statistics are used to investigate relationships between variables in a sample
of test takers and between two or more groups of test takers. Descriptive statistical data helps to
describe the distributions of data in order to decide whether or not significant relationships or
differences exist between test takers and groups of test takers. Describing an aspect of data helps
to look at performances as percentages. For example, for tasks 1 and 2 test takers would have to
receive a raw score of 30 to receive 100% on the two tasks. If the test taker had a raw score of
27 between the two sections there percentage score would be 90%. These two types of scores are
samples of interval data which helps us represent how much or how little of skill measured has
been demonstrated by the test taker (Flahive, 2014). The scores could also be ranked as a means
of comparison by listing them from highest score to lowest score, however ranking the scores
(ordinally) does not always show the interval differences so using interval data gives a more
precise picture of the data. For example, if we show all of the interval differences for task 1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
We can find the central tendencies: mean, median, and mode of the data. The mean is found by
dividing the total of the scores or data points by the number of cases in the set. The median
describes the point in the data in which half of the other cases fall above and below. The median
is a good central tendency statistic if a distribution has outliers because it is less sensitive to them
than finding the mean. The mode is the most common value in a distribution of data. The range
in the data can be calculated by subtracting the lowest number in a set from the highest number
and then adding one.
Another important descriptive statistic is standard deviation. The standard deviation

provides information on the variability of a score in relation to the mean. In order to calculate
standard deviation (SD) we must calculate the mean and the variance since the SD is the square
root of the variance. In an interval distribution of raw scores we can calculate the variance by:
1. calculate the mean
2. subtract the mean from the raw score
3. square the difference for each # from 2
4. add all of the differences together
5. find the variance by dividing the sum from 4 by total # of scores minus 1 (N-1)
(Flahive, 2014)
The formula looks like this:
Example: hypothetical scores for tasks 1 and 2:

Student
1
2
3
4
5
6
7
Score
23
17
21
28
27
17
26
Mean
22.7
22.7
22.7
22.7
22.7
22.7
22.7
Difference Score
0.3
-5.7
-1.7
5.3
4.3
-5.7
3.3
Differences2
0.09
32.49
2.89
28.09
18.49
32.49
10.89
Sum=125.43
Variance=125.43/6=20.90
SD=4.572
Reliability and Standard Error of Measurement
Reliability in a test refers to the consistency of scores across tasks, forms, raters, and
time. Reliability is considered a statistical argument in that it is a quantitative measurement and
gives no qualitative information. Reliability is a characteristic of the results not of the test. The
coefficient alpha is a type of reliability measure that measures for internal consistency in tasks.
In this procedure, the test is given once and Cronbachs alpha formula is applied to the scores:
In this formula N is the total number of items, c-bar is the average inter-item covariance and vbar equals the average variance. Basically this formula shows how closely related a set of items
are as whole group by giving a reliability estimate between -1 and 1, the closer to 1 the more
reliable. The split-half method is also a measure of internal consistency. In this method the test
is given once and two equivalent halves are scored (odd items and even items) and the
Spearman-Brown formula is applied to correct the correlation between the two halves to fit the
whole test. Both formulas can be applied to a set of scores using SPSS software.
On more subjective tests, such as task 3 in the pilot test, inter-rater reliability becomes an
important quantitative measurement to show reliability. Inter-rater reliability measures the
correlation between raters ranking and scoring over time. It is used to assess each raters
consistency over time, to show the degree to which different raters give consistent estimates on
the same phenomenon, and to show agreement between the scores assigned by two different
raters. Training raters and using a clear, concise and reasonable rating scale can increase interrater reliability. The Spearman Brown formula can be applied to measure inter-rater reliability.
The Pearson formula is the most common formula applied to compute inter-rater reliability on
speaking assessment scores. Both of these formulas are included in Microsoft Excel and SPSS.
The standard error of measurement (SEM) is directly related to the reliability of a test. It
is a measurement of the amount of variability in a students performance due to random
measurement error. Because it is not possible to administer an infinite amount of parallel forms,
the SEM is an estimate of the variation that should be considered when interpreting scores. The
SEM demonstrates a window of performance in which somewhere within the window the true
score lies but cannot be calculated exactly because every assessment has some error. The SEM
helps determine the boundaries of error. Whereas the reliability of a test is between -1 and 1, the
SEM is described on the same scale as the test scores. A higher SEM indicates a lower
reliability. The SEM is dependent on the assessment itself and not the test-taker. The formula
for calculating SEM is the SD times the square root of 1 minus the reliability.
Discussion
Critique of Item Performance
As I discussed above, the ways to critique an items performance are by determining an
items difficulty and its discriminating power. It is also a good idea to have another professional,
preferably someone teaching the same class, look over the items for any ambiguities or defaults.
As I was not able to pilot my test it is impossible to critique them, however, I think task 2 may
have interesting results as it is a gap filling task scored on a binary scale. I think this task might
be better if turned into a more objective multiple choice task type.
Evaluation of Test Usefulness
A test is useful if it has utility for a given purpose. I believe this test includes tasks which
are familiar to the test takers as their purpose for taking a Workplace English class is to help
them with the language skills needed to secure and maintain employment. Therefore, I believe
that having the students do a mock interview represents a very authentic and interactive
performance task. The impact of the test I think would not cause washback on either the
employees or the employers, although there would be a financial investment for employer. The
practicality of the assessment depends on the investment that an employer wants to make in their
employees. If they only hired one instructor than this pilot test would not be practical as there
would be no need to place students into levels.
Reliability
The reliability of this test would depend a lot on inter-rater reliability of task 3. Some
strategies to increase the reliability of the pilot test would be to train raters well and correlate the
scoring scale to a benchmark. Also task 2 could be made into a multiple choice task type which
would make the items more objective.
Construct Related Evidence for Validity
I feel the test has very good face validity as the task characteristics are very similar to the
TLU domain, but without a sample it is difficult to assess whether or not there is construct
related evidence for validity. One piece of evidence for construct validity could be if the higher
level language ability students scored high on the test and the lower level language ability
students scored low on the test. Another scenario that would provide evidence for construct
validity is if students who had filled out job applications or done job interviews (in the TLU
domain, i.e. real-world) before taking the assessment placed higher than students who had not.
This finding would provide the evidence that the task characteristics clearly represent the
construct of interest, i.e. the real-world. The tasks required language which is representative of
language in the same types of tasks in the TLU domain making the tasks very interactive and
authentic.
Consequential Evidence for Validity
I feel this test would have a positive impact on test takers because it would place them in
a correct level within a Workplace English program, likewise it could also be used as an indicator
of language ability for an employer. Therefore, I think there is evidence of validity in that the
assessment would be useful to all individuals involved.
Reflection on Personal Significance of Test
This test proposal project was a challenge. However, I think everything that was
included in this project was relevant in helping us as future teachers see the importance of
developing test items and administering tests which are fair, reliable and valid. The descriptive
statistics used for item difficulty and discrimination are easy measures that can be used to glean a
lot of information from test items. One thing I will take away from this project pertains to
writing instructions for items and developing rubrics for outcome measures in that fairness is
paramount. I also realize how difficult it is to write questions that are interactive, authentic, and
valid.
References
Bachman, L. F., Palmer, A. S. (2010). Language assessment in practice: Developing language
assessments and justifying their use in the real world. Oxford: Oxford University Press.
Brown, J.D. (2003). Norm-referenced item analysis (item facility and item discrimination).
Shiken: JALT Testing & Evaluation SIG Newsletter, 7(2) p. 16-19.
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Test score interpretation and use.
Building a validity argument for the Test of English as a Foreign Language, 1-25.
Cobb, T. Web Vocabprofile [accessed 14 Dec. 2014 from http://www.lextutor.ca/vp/], an

adaptation of Heatley, Nation & Coxheads (2002) Range.
Cobb, T. (n.d.). Why & how to use frequency lists to learn words. Retrieved March 9, 2015,
from http://www.lextutor.ca/cgi-bin/vp/eng/output.pl
Flahive, D. (2014). Basic psychometric concepts: descriptive statistics; the building blocks of
research. [class handout].
Heatley, A., Nation, I.S.P. & Coxhead, A. (2002). RANGE and FREQUENCY programs.
Available at http://www.victoria.ac.nz/lals/staff/paul-nation.aspx.
Lytle, S. L., & Wolfe, M. (1989). Adult literacy education: Program evaluation and learner
assessment. Columbus, OH: ERIC Clearinghouse on Adult, Career, and Vocational
Education. (ERIC No. ED 315 665).
Miller, D.M., Linn, R. L., & Gronlund, N. E. (2009). Measurement and assessment in teaching.
8th ed. Upper Saddle River, N.J.: Merrill.
Mislevy, R. J. and Knowles, K.T. (Eds.) (2002). Performance assessments for adult education:
Exploring the measurement issues. Washington, DC: National Academy Press.
Stoynoff, S., & Chapelle, C. A. (2005). ESOL tests and testing: A resource for teachers and
program administrators. Alexandria, VA: TESOL.
West, M. (1953). General service list, In J. Bauman & B. Culligan (1995), About the GSL.
Retrieved March 9, 2015, from http://jbauman.com/aboutgsl.html
Appendix A
TLU Domain Description
Characteristics of the setting
physical characteristics
participants
time of task
Characteristics of the test rubric
instructions
language
Business or institution
English language learners
Morning, afternoon, evening
channel
specification of procedures and tasks
structure
time allotment
scoring method
criteria for correctness
procedures for scoring the response
explicitness of criteria and procedures
Characteristics of the input
format
channel
form
language
length
type
degree of speededness
vehicle
language of the input
language characteristics
organizational characteristics
grammatical
textual
pragmatic characteristics
functional
sociolinguistic
topical characteristics
Characteristics of the expected response
format
channel
form
language
length
type
language of the expected response
grammatical
textual
functional
sociolinguistic
Relationship between input and response
reactivity
scope of relationship
directness of the relationship
Visual and aural

Both language and non-language
English-target language
Short description
Input for interpretation
Slower than native speaker reading
Live
Simple but specific vocab, questions and imperatives,

syntax, graphology
Cohesion and coherence
Ideational, manipulative, heuristic
Register, genre
Job duties, qualifications, workplace vocabulary
Written and aural

Language (possibly some non-language)
Target language
Short written and spoken responses
Selected and limited production
Below native speaker writing speed
Graphology, simple vocab and simple verb tense/aspect,

statements
Cohesion, coherence, short answers
Register, genre
Qualifications, prior experiences, personal information
Non-reciprocal and reciprocal
Narrow and broad scope
Indirect and direct
TLU Task Characteristics
Characteristics of the setting

physical characteristics
participants
time of task
Characteristics of the test rubric
instructions
language
channel
specification of procedures and tasks
structure
time allotment
scoring method
criteria for correctness
procedures for scoring the response
explicitness of criteria and procedures
Characteristics of the input
format
channel
form
language
length
type
vehicle
language of the input
grammatical
textual
functional
sociolinguistic
Characteristics of the expected response
format
channel
form
language
length
type
language of the expected response
grammatical
textual
functional
sociolinguistic
Relationship between input and response
reactivity
English
Written and spoken
Briefly explained in speaking and writing
15 matching items, 15 information transfer, 6 speaking
prompts
60
0= wrong, 1=correct
Each correct answer = 1 point
Included in instructions
Visual and aural

Both language and non-language
English-target language
Single words and short phrases
Item and prompts
Slower than native speaker reading
Live
Simple but specific vocab, phrases and expressions,

present and past tense verbs
Short expressions
Register, genre
Job application terms, qualifications
Visual and written

Language (possibly some non-language)
Target language
One letter corresponding to answer, words or phrase
information transfer, short interview response
Selected response and limited production
Below native speaker writing speed
n/a
n/a
n/a
n/a
n/a
Non-reciprocal and reciprocal
scope of relationship
directness of the relationship
Narrow and broad scope

Direct and indirect
Appendix B
Table of Specifications
Learning Objectives
Text/Task
Knowledge
(recall)
Interpret
and transfer
Informatio
Analyze,
Apply and
Synthesize
#
items
%
item
s
n
Workplace
Vocabulary
Filling out a job
application
15:
Informatio
n
0
1.1-1.15
0
15:
15
42
2.1-2.15
0
6:
16
36
Interviewing for a 0
job
# items
15
15
3.1-3.6
6
% items
42
42
16
Appendix C
Workplace English Test
Task 1: Workplace Vocabulary (15 points)
15
42
100
Directions: Next to each vocabulary item listed in Column A, write the letter of the best
definition for the item from Column B. Each definition in Column B may be used once, more
than once, or not at all. Each item is worth one point. This section should take you no longer
than 20 minutes. An example has been done for you.
Column A
___E__0. Surname
______1. Position desired
Column B
A. Place you last worked
______2. Maiden name
B. Abilities, things you can do
______3. Previous employer
C. Money earned per hour
______4. Duties
D. Late night or overnight work shift
______5. Skills
E. Last name or family name
______6. Qualifications
F. Money earned per month or year
______7. Salary
G. Person applying for a job
______8. Wage
H. Effective, current, legal
______9. References
I. Womans last name before marriage
______10. Applicant
J. Allowed by law to work
______11. Job title
K. Specific job wanted or applied for
______12. Relocate
L. Skills, experience, education needed for a job
______13. N/A
M. Move to a different place for a job
______14. Graveyard shift
N. Responsibilities, things employee must do
______15. Legally entitled to work
O. Name of the work position

P. Names of people who know the applicant
Q. Not applicable, does not apply in this situation
Task 2: Application for Employment (15 points)

Directions: Read the story below about Mary. Use the information from the story to fill out the
job application for Mary. Each blank in the application is worth 1 point. This section of the test
should take you no longer than 20 minutes.
Mary Ortez is looking for a job as a head cook. She can start immediately. Her last job
was as a head cook in a kitchen at Freedom College in Arvada, Colorado. She worked in the
college kitchen for 4 years. Her duties included: ordering food and supplies each week, cleaning
the kitchen, and supervising the employees during her shift. Her position as a head cook had a
lot of responsibilities which required her to be flexible and dependable.
Before being a head cook, Mary worked as a baker for 2 years at Delicious Bakery in
Denver, Colorado. The bakery specialized in wedding cakes. In that position, Mary had to be
polite, organized, and creative to satisfy the customers. She also had to be very efficient in order
to make and deliver the wedding cakes on time.
In her first job, shortly after she moved to Colorado from Bogota, Colombia, Mary was a
dishwasher at Hungrys Family restaurant, located in Colorado Springs, Colorado. Shortly after
being hired, she was promoted in the family restaurant to a line cook position. In this position,
Mary was in charge of making the food orders quickly and accurately. At the family restaurant,
Mary worked the graveyard shift. Mary worked for a total of 5 years at Hungrys.
Although Mary moved to the United States 12 years ago from Colombia she mostly
speaks Spanish at home with her family but can speak English well. Mary graduated high school
in Colombia and has attended 2 years of General English as a second language training. She has
been a citizen of the United States for 8 years. Mary has a valid drivers license but she prefers to
take the bus to work if possible.
Job Application
Position Desired:
1.
Name/Address:
Date you can start:

2.
3. Last:
Street: 2550 Central Ave. West
City, State: Denver, CO 80511
4. First:
Apt #: 25 D
Phone: 303-897-4562
5. Are you legally able to work in the country? Yes________ No________

6. Do you have a valid Drivers License?
Yes _______ No________
Employment History
Name and Location
Length of
of Company
Job Title
Duties
Employment
7.
8.
9.
Freedom College
Arvada, Colorado
10.
11.
Designed cakes
Baker
Hungrys Family
12.
13.
Restaurant,
Colorado Springs,
CO
Education and Training

University:
N/A
High School:
Other education or
Columbia, South America

15.
training:
Task 3: Mock Interview (6 points)
Instructor reads directions to participant before beginning prompt
Baked cakes
Delivered cakes
14.
Directions: In this task you will be interviewing for a management position at a restaurant.
Listen to the interview prompts and respond to the best of your ability according to the
information in the prompt. There will be two warm-up questions and 6 interview questions.
Warm-up question suggests:
What is your name?
Where were you born?
How long have you lived here?
Interview prompts:
Interview prompts:
1.
2.
3.
4.
5.
6.
What is your current job and what position do you work in at your job?
What are your current duties in your job or in your last job?
What do you most enjoy/least enjoy about your current position?
What are your strengths as an employee?
What qualities do you think are important for a manager to have?
Why are you the right person for this job?
*****Thank you for your answers, you did great!*****
Scoring Key:
Part 1: Each item is worth one point for a total of 15 points.
K, I, A, N, B, L, F, C, P, G, O, M, Q, D, J
Part 3: Each blank area is worth one point for a total of 15 points. Test takers are expected
to transfer the information directly from the paragraphs.
1. Head Cook
2. Immediately
3. Ortez
4. Mary
5. Yes
6. Yes
7. 4 years
8. Head Cook
9. Ordering food and supplies
Cleaning the kitchen
Supervising the employees
10. Delicious Bakery, Denver, CO
11. 2 years
12. 5 years
13. Line cook (will also accept dishwasher)
14. making food orders quickly
15. 2 years of General English as a second language
Appendix D
Rubrics
Scoring Rubric for Task 3- Listening and Speaking
OUTCOME MEASURES DEFINITIONS: LISTENING AND SPEAKING
Score Interpretation
Criteria/descriptors for scoring
1. Individual cannot speak or understand

1
English, or only understands very few isolated

words.
1. Individual can understand basic greetings
and simple phrases.
2. Can understand simple questions related to
meets at least two
criterion
personal information if spoken slowly and may

need repeating.
3. Understands a limited number of words and
responds with simple learned phrases but
speaks slowly and with difficulty, and
demonstrates little or no control over grammar
and pronunciation.
1. Individual can understand common words,
simple phrases, and sentences which contain
familiar vocabulary if spoken slowly.
meets at least two

criterion
2. Individual can respond to simple questions

3
about personal information or everyday

situations and activities.
3. Can express immediate needs using simple
learned phrases or short sentences, but shows
limited control of grammar and pronunciation.
1. Individual can understand simple learned
phrases and limited new phrases including
meets two criterion
familiar vocabulary if spoken slowly.

4
2. Can ask and respond to questions using

phrases.
3. Can express basic survival needs and can
participate in some routine social conversations
but with some difficulty.
1. Individual can understand learned phrases
and short new phrases containing familiar
vocabulary.
meets two criterion
2. Can communication most needs and

participate in limited social conversations. Uses
new phrases without hesitation but there is
inconsistent control of grammar and
pronunciation.
3. Asks questions for clarification but there is
inconsistent control of grammar and
pronunciation.
1. Individual can understand and communicate
in a variety of contexts related to daily life and
work. Can understand main points of
Meets two criterion
discussions in familiar and most unfamiliar

6
contexts.
2. Shows ability to go beyond learned patterns
to construct new and more complex sentences.
Shows control of grammar and pronunciation.
3. Can clarify own or others meaning through
rewording.
Rubric for interpretation of score on whole test including score on task 3

OUTCOME MEASURES DEFINITIONS
FUNCTIONAL AND WORKPLACE SKILLS LEVEL DESCRIPTORS
ENGLISH AS A SECOND LANGUAGE
Language/Literacy
Score
Level
Interpretation
including
listening and
speaking score
Functional and Workplace Skills
Beginning ESL
Workplace
0-5
Literacy
Low Beginning
ESL
6-11
Workplace
Literacy
High Beginning
ESL
Workplace
12-17
Literacy
Low Intermediate
ESL
Workplace
Literacy
18-23
Individual functions minimally or not at all in

English. Often communicates through gestures
or with a few isolated words such as their name
and other personal information. Individual may
also have difficulty reading and writing in any
language.
Individual functions with difficulty in situations
that require immediate verbal skills. Can
provide limited personal information on simple
forms. Can handle very simple written or oral
English communication. Shows very limited if
any control of grammar and spelling.
Individual can function in some situations that

require immediate verbal skills. Individual can
function related to immediate verbal skills in
familiar social situations. Can provide basic
personal information on simple forms and
recognizes simple common forms of print found
in the workplace and community. Can handle
basic oral or written English communication
and can read most sight words and many other
common words. Individual can write some
simple sentences with limited vocabulary but
writing show limited control of grammar and
spelling.
Individual can interpret simple directions and

fill out simple forms but needs support on some
complex forms. Can handle tasks which include
some written or oral communication.
Individual can read simple material on familiar
topics and comprehend simple and compound
sentences in simple or linked paragraphs
containing familiar vocabulary. Individual can
write simple notes or messages on familiar
situations; shows some control of grammar and
spelling.
High Intermediate
ESL
Workplace
24-29
Literacy
Advanced ESL
Workplace
Literacy
30-36
Individual can meet basic survival and social

needs. Can follow simple oral and written
instructions and has some ability to
communicate using appropriate language and
has good interpersonal skills. Can write
messages and notes related to basic needs and
can complete most forms. Can handle most
situations that require oral skills so long as tasks
can be clarified orally. Individual can read text
on familiar topics, can use context to determine
meaning, can interpret actions required in
specific directions, can write simple paragraphs
with main idea and supporting details on
familiar topics. Can self and peer edit for
spelling and some grammar errors.
Individual can function independently to meet
most needs. Can use English in routine social
and work situations. Can communicate on the
telephone. Understands radio and television.
Can interpret charts, graphs and can complete
most forms. Individual can read moderately
complex text. Can infer meaning from context
and uses multiple strategies to understand
including, inference, prediction, and compare
and contrast techniques. Individual can write a
multiparagraph text which links an
introduction, body and conclusion with clear
and developed ideas. Occasionally makes
spelling and grammar mistakes.
Appendix E
Score Reporting Form
Score Reporting Form:
Name:___________________________________
Total Score: ____/36

Your total score places you into a class in the Workplace English program. Based on your
score you can see which class you have been placed into on the right side of the table.
Score
Class/Level
0-5
6 - 11
12 - 17
18 - 23
24 - 29
30 - 36
Beginning
Low Beginning
High Beginning
Low Intermediate
High Intermediate
Advanced

Final Test Project

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Final Test Project

Caricato da

Copyright:

Formati disponibili

Running Head: ASSESSING SECOND LANGUAGE PROFICIENCY: FINAL PROJECT

Final Test Project 1

Proficiency Assessment Proposal

Final Test Project 2

Final Test Project 3

Final Test Project 4

Description of the Proposed Test

Final Test Project 5

Final Test Project 6

Final Test Project 7

Final Test Project 8

The purpose of a table of specifications is to ensure that there is a representative sample

Final Test Project 9

Final Test Project 10

Final Test Project 11

Final Test Project 12

Final Test Project 13

Pilot Test Procedure

Final Test Project 14

Final Test Project 15

Final Test Project 16

Final Test Project 17

Final Test Project 18

Another important descriptive statistic is standard deviation. The standard deviation

Example: hypothetical scores for tasks 1 and 2:

Final Test Project 19

Final Test Project 20

Final Test Project 21

Final Test Project 22

Final Test Project 23

Cobb, T. Web Vocabprofile [accessed 14 Dec. 2014 from http://www.lextutor.ca/vp/], an

Final Test Project 24

Visual and aural

Simple but specific vocab, questions and imperatives,

Written and aural

Graphology, simple vocab and simple verb tense/aspect,

TLU Task Characteristics

Final Test Project 25

Characteristics of the setting

Visual and aural

Simple but specific vocab, phrases and expressions,

Visual and written

Final Test Project 26

Narrow and broad scope

Final Test Project 27

Task 1: Workplace Vocabulary (15 points)

Final Test Project 28

______2. Maiden name

B. Abilities, things you can do

______3. Previous employer

C. Money earned per hour

D. Late night or overnight work shift

E. Last name or family name

F. Money earned per month or year

G. Person applying for a job

H. Effective, current, legal

I. Womans last name before marriage

J. Allowed by law to work

______11. Job title

K. Specific job wanted or applied for

L. Skills, experience, education needed for a job

M. Move to a different place for a job

5. Are you legally able to work in the country? Yes No