Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1.0 SYNOPSIS
1
CONTENT
1.3 INTRODUCTION
1.4.1 Test
2
1.4.2 Assessment
1.4.3 Evaluation
3
1.4.4 Measurement
4
writing) and components (e.g. grammar, vocabulary, pronunciation) and an
approach to test design that focused on testing isolated discrete points of
language, while the primary concern was with psychometric reliability (e.g.
Lado,1961; Carroll,1968). Language testingresearchwas dominated largely
bythe hypothesis that language proficiency consisted of a single unitarytrait,
and a quantitative, statisticalresearch methodology (Oller, 1979).
The beginning of the new millennium is another exciting time for anyone
interested in language testing and assessment research. Current
developments in the fields of applied linguistics, language learning and
pedagogy, technological innovation, and educational measurement have
opened up some rich new research avenues.
5
3.0 Changing trends in Language Assessment-Malaysian context
6
On 3rd May 1956, the Examination Unit (later known as Examination
Syndicate) in the Ministry of Education (MOE) was formed on the
recommendation of the Razak Report (1956). The main objective of the
Malaysia Examination Syndicate (MES) was to fulfil one of the Razak Reports
recommendations, which was to establish a common examination system for
all the schools in the country.
RahmanTalib
RahmanTalib Report Report recommended
recommended the the
following actions:
1. Extend schooling age to 15 years old.
2.
2. Automatic
Automatic promotion
promotion toto higher
higher classes.
classes.
3.
3. Multi-stream
Multi-stream education
education (Aneka
(Aneka Jurusan).
Jurusan).
The following changes in examination were
Implementation made:
of the -- TheThe entry
entry of
of elective
elective subjects
subjects in
in LCE
LCE and
and
RahmanTalib SRP.
Report (1960) -- Introduction
Introduction examination
examination ofof the
the Standard
Standard 55
Evaluation
Evaluation Examination.
Examination.
-- The
The introduction
introduction of
of Malaysia's
Malaysia's Vocational
Vocational
Education Examination.
-- The
The introduction
introduction of
of the
the Standard
Standard 3 3 Dignostic
Dignostic
Test
Test (UDT).
(UDT).
The
The implementation
implementation of of Cabinet
Cabinet Report
Report
Implementation resulted
resulted in
in evolution
evolution of
of the
the education
education system
system
of the Cabinet to
to its
its present
present state,
state, especially
especially with
with KBSR
KBSR
and
and KBSM.
KBSM. Adjustments
Adjustments were were made
made in in
Report (1979) examination to fulfill the new curriculum's
needs
needs and
and to
to ensure
ensure itit is
is in
in line
line with
with the
the
National
National Education
Education Philosophy.
Philosophy.
9
ROLE
i AND PURPOSES OF
TOPICvi 2 ASSESSMENT IN ii
TEACHING AND LEARNING
iii
v
iv
Tutorial question
Examine the contributing factors to the changing trends of
language assessment.
Create and present findings using graphic organisers.
10
o Program evaluation
o Providing research criteria
o Assessment of attitudes and socio-psychological differences
Proficiency Tests
Achievement Tests
11
often cumulative, covering material drawn from an entire course or semester.
Diagnostic Tests
Aptitude Tests
This type of test no longer enjoys the widespread use it once had. An
aptitude test is designed to measure general ability or capacity to learn a
foreign language a priori (before taking a course) and ultimate predicted
success in that undertaking. Language aptitude tests were seemingly
designed to apply to the classroom learning of any language. In the United
States, two common standardised English Language tests once used were the
Modern Language Aptitude Test (MLAT; Carroll & Sapon, 1958) and the
Pimsleur Language Aptitude Battery (PLAB; Pimsleur, 1966). Since there is
no research to show unequivocally that these kinds of tasks predict
communicative success in a language, apart from untutored language
acquisition, standardised aptitude tests are seldom used today with the
exception of identifying foreign language disability (Stansfield & Reed, 2004).
Progress Tests
These tests measure the progress that students are making towards
defined course or programme goals. They are administered at various stages
throughout a language course to see what the students have learned, perhaps
after certain segments of instruction have been completed. Progress tests are
generally teacher produced and are narrower in focus than achievement tests
12
because they cover a smaller amount of material and assess fewer objectives.
Placement Tests
These tests, on the other hand, are designed to assess students level
of language ability for placement in an appropriate course or class. This type
of test indicates the level at which a student will learn most effectively. The
main aim is to create groups, which are homogeneous in level. In designing a
placement test, the test developer may choose to base the test content either
on a theory of general language proficiency or on learning objectives of the
curriculum. In the former, institutions may choose to use a well-established
proficiency test such as the TOEFL or IELTS exam and link it to curricular
benchmarks. In the latter, tests are based on aspects of the syllabus taught at
the institution concerned.
Discuss and present the various types of tests and assessment tasks
that students have experienced.
13
TOPIC 3 BASIC TESTING TERMINOLOGY
3.0 SYNOPSIS
Norm-Referenced
and Criterion-
Referenced
Types of Tests
14
Formative and Objective and
Summative Subjective
CONTENT
15
accepted as the index of attainment to a test-taker. Thus, CRTs are designed
to provide feedback to test-takers, mostly in the form of grades, on specific
course or lesson objectives. Curriculum Development Centre (2001) defines
CRT as an approach that provides information on students mastery based on
the criteria determined by the teacher. These criteria are based on learning
outcomes or objectives as specified in the syllabus. The main advantage of
CRTs is that they provide the testers to make inferences about how much
language proficiency, in the case of language proficiency tests, or knowledge
and skills, in the aspect of academic achievement tests, that test-
takers/students originally have and their successive gains over time. As
opposed to NRTs, CRTs focus on students mastery of a subject matter
(represented in the standards) along a continuum instead of ranking student
on a bell curve. Table 3 below shows the differences between Norm-
Referenced Test (NRT) and Criterion-Referenced Test (CRT).
16
Formative test or assessment, as the name implies, is a kind of
feedback teachers give students while the course is progressing. Formative
assessment can be seen as assessment for learning. It is part of the
instructional process.We can think of formative assessment as practice.
With continual feedback the teachers may assist students to improve their
performance. The teachers point out on what the students have done wrong
and help them to get it right. This can take place when teachers examine the
results of achievement and progress tests. Based on the results of formative
test or assessment, the teachers can suggest changes to the focus of
curriculum or emphasis on some specific lesson elements. On the other hand,
students may also need to change and improve. Due to the demanding nature
of this formative test, numerous teachers prefer not to adopt this test although
giving back any assessed homework or achievement test present both
teachers and students healthy and ultimate learning opportunities.
17
Formative Assessment Summative Assessment
Anecdotal records Final exams
Quizzes and essays National exams (UPSR, PMR, SPM,
STPM)
Diagnostic tests Entrance exams
Table 3.1: Common formative and summative assessments in schools
18
It may encourage guessing,which may have a considerable effect on
test scores.
2. Stem
Every multiple-choice item consists of a stem (the body of the item that
presents a stimulus). Stem is the question or assignment in an item. It is in a
complete or open, positive or negative sentence form. Stem must be short or
simple, compact and clear. However, it must not easily give away the right
answer.
3. Options or alternatives
They are known as a list of possible responses to a test item.
There are usually between three and five options/alternatives to
choose from.
4. Key
This is the correct response. The response can either be correct
or the best one. Usually for a good item, the correct answer is not obvious as
compared to the distractors.
5. Distractors
This is known as a disturber that is included to distract students from
selecting the correct answer. An excellent distractor is almost the same as the
correct answer but it is not.
19
When building multiple-choice items for both classroom-based and
large-scaled standardised tests, consider the four guidelines below:
Some have argued that the distinction between objective and subjective
assessments is neither useful nor accurate because, in reality, there is no such
20
thing as objective assessment. In fact, all assessments are created with
inherent biases built into decisions about relevant subject matter and content,
as well as cultural (class, ethnic, and gender) biases.
Reflection
1. Objective test items are items that have only one answer or correct
response. Describe in-depth the multiple-choice test item.
Discussion
4.0 SYNOPSIS
21
Topic 4 defines the basic principles of assessment (reliability, validity,
practicality, washback, and authenticity) and the essential sub-categories
within reliability and validity.
4.1 LEARNING OUTCOMES
Reliability
Interpretability Validity
22
Types of
Tests
Practicality
Authenticity
4.3 INTRODUCTION
24
to have raters determine which category each observation falls into and
then calculate the percentage of agreement between the raters. So, if
the raters agree 8 out of 10 times, the test has an 80% inter-rater
reliability rate. Rater reliability is assessed by having two or more
independent judges score the test. The scores are then compared to
determine the consistency of the raters estimates.
Intra-rater reliability is an internal factor. In intra-rater reliability,
its main aim is consistency within the rater. For example, if a rater
(teacher) has many examination papers to mark and does not have
enough time to mark them, s/he might take much more care with the
first, say, ten papers, than the rest. This inconsistency will affect the
students scores; the first ten might get higher scores. In other
words, while inter-rater reliability involves two or more raters, intra-
rater reliability is the consistency of grading by a single rater.
Scores on a test are rated by a single rater/judge at different times.
When we grade tests at different times, we may become
inconsistent in our grading for various reasons. Some papers that are
graded during the day may get our full and careful attention, while
others that are graded towards the end of the day are very quickly
glossed over. As such, intra rater reliability determines the
consistency of our grading.
b. Teacher-Student factors
c. Environment factors
Because students' grades are dependent on the way tests are being
administered, test administrators should strive to provide clear and
accurate instructions, sufficient time and careful monitoring of tests to
improve the reliability of their tests. A test-re-test technique can be
used to determine test reliability.
e. Marking factors
4.5 VALIDITY
28
writing ability, one might ask students to write as many words as they can in 15
minutes, then simply count the words for the final score. Such a test is
practical (easy to administer) and the scoring quite dependable (reliable).
However, it would not constitute (represent ) a valid test of writing ability
without taking into account its comprehensibility (clarity), rhetorical discourse
elements, and the organisation of ideas.
Content validity: Does the assessment content cover what you want to
assess? Have satisfactory samples of language and language skills been
selected for testing?
Construct validity: Are you measuring what you think you're measuring?
Is the test based on the best available theory of language and language
use?
Concurrent (parallel) validity: Can you use the current test score to
estimate scores of other criteria? Does the test correlate with other existing
measures?
29
Figure 4.5: Types of Validity
30
4.5.2 Content validity
What are the different types of validity? Describe any three types and
cite examples.
http://www.2dix.com/pdf-2011/testing-and-evaluation-in-esl-pdf.php
4.5.6 Practicality
4.5.7 Objectivity
4.5.9 Authenticity
4.6.0 Interpretability
37
TOPIC 5 DESIGNING CLASSROOM LANGUAGE
TEST
5.0 SYNOPSIS
Topic 5 exposes you the stages of test construction, the preparing of test
blueprint/test specifications, the elements in a Test Specifications Guidelines
And the importance of following the guidelines for constructing tests items.
Then we look at the various test formats that are appropriate for language
assessment.
38
3. draw up a test specification that reflect both the purpose and the
objectives of the test
4. compare and contrast Blooms taxonomy and SOLO taxonomy
5. categorise test items according to Blooms taxonomy
6. discuss the elements of test items of high quality, reliability and
validity
7. identify the elements in a Test Specifications Guidelines
8. demonstrate an understanding of the importance of following the
guidelines for constructing tests items
9. illustrate test formats that are appropriate and meet the
requirements of the learning outcomes
CONTENT
i determining vi pre-testing
ii planning vii validating
iii writing
iv preparing
v reviewing
39
5.3.1 Determining
The essential first step in testing is to make oneself perfectly
clear about what it is one wants to know and for what purpose. When
we start to construct a test, the following questions have to be
answered.
5.3.2 Planning
The first form that the solution takes is a set of specifications for
the test.This will include information on: content, format and timing,
criteria,levels of performance, and scoring procedures.
In this stage, the test constructor has to determine the content by
answering the following questions:
Describing the purpose of the test;
Describing the characteristics of the test takers, the nature of the
population of the examinees for whom the test is being designed.
Defining the nature of the ability we want to measure;
Developing a plan for evaluating the qualities of test usefulness, which
is the degree to which a test is useful for teachers and students, it
includes six qualities: reliability, validity, authenticity, practicality inter-
activeness, and impact;
Identifying resources and developing a plan for their allocation and
management;
Determining format and timing of the test;
Determining levels of performance;
Determining scoring procedures
40
5.3.3 Writing
Although writing items is time-consuming, writing good items is an art.
No one can expect to be able consistently to produce perfect items.
Some items will have to be rejected, others reworked. The best way to
identify items that have to be improved or abandoned is through
teamwork. Colleagues must really try to find fault; and despite the
seemingly inevitable emotional attachment that item writers develop to
items that they have created, they must be open to, and ready to
accept, the criticisms that are offered to them. Good personal relations
are a desirable quality in any test writing team.
5.3.4 Preparing
One has to understand the major principles, techniques and experience
of preparing the test items. Not every teacher can make a good tester.
To construct different kinds of tests, the tester should observe some
principles. In the production-type tests, we have to bear in mind that no
41
comments are necessary. Test writers should also try to avoid test
items, which can be answered through test- wiseness. Test-
wiseness refers to the capacity of the examinees to utilise the
characteristics and formats of the test to guess the correct answer.
5.3.5 Reviewing
Principles for reviewing test items:
The test should not be reviewed immediately after its construction,
but after some considerable time.
Other teachers or testers should review it. In a language test, it is
preferable if native speakers are available to review the test.
5.3.6 Pre-testing
After reviewing the test, it should be submitted to pre-testing.
The tester should administer the newly-developed test to a group of
examinees similar to the target group and the purpose is to analyse
every individual item as well as the whole test.
Numerical data (test results) should be collected to check the
efficiency of the item, it should include item facility and
discrimination.
5.3.7 Validating
Item Facility (IF) shows to what extent the item is easy or difficult. The
items should neither be too easy nor too difficult. To measure the facility
or easiness of the item, the following formula is used:
IF= number of correct responses (c) / total number of candidates (N)
And to measure item difficulty:
IF= (w) / (N)
The results of such equations range from 0 1. An item with a
facility index of 0 is too difficult, and with 1 is too easy. The ideal item is
one with the value of (0.5) and the acceptability range for item facility is
between [0.37 0.63], i.e. less than 0.37 is difficult, and above 0.63 is
42
easy.
Thus, tests which are too easy or too difficult for a given sample
population, often show low reliability. As noted in Topic 4, reliability is
one of the complementary aspects of measurement.
43
However, what exactly is an item for a test? An item is a tool, an
instrument, instruction or question used to get feedback from test-
takers, which is an evidence t of something that is being measured. An
item is an instrument used to get feedback, which is a useful information
for consideration in measuring or asserting a construct measurement.
Items can be classified as a recall and thinking item. A recall item is the
item that requires one to recall in order to answer, and a thinking item
refers to an item that requires test-takers to use their thinking skills to
attempt.
For instance, in a grammar unit test that will be administered at
the end of a three-week grammar course for high beginning adult
learners (Level 2). The students will be taking a test that covers verb
tenses and two integrated skills (listening/speaking and reading/writing)
and the grammar class they attend serves to reinforce the grammatical
forms that they have learnt in the two earlier classes.
Based on the scenario above, the test specs that you design
might consist of the four sequential steps:
1. a broad outline of how the test will be organised
2. which of the eight sub-skills you will test
3. what the various tasks and item types will be
4. how results will be scored, reported to students, and used in future
class (washback)
Besides knowing the purpose of the test you are creating, you
are required to know as precisely as possible what it is you want to test.
Do not conduct a test hastily. Instead, you need to examine the
objectives for the unit you are testing carefully.
5.5 Blooms and SOLO Taxonomies
5.5.1 Blooms Taxonomy (Revised)
Blooms Taxonomy is a systematic way of describing how a
learners performance develops from simple to complex levels in their
affective, psychomotor and cognitive domain of learning. The Original
Taxonomy provided carefully developed definitions for each of the six
major categories in the cognitive domain. The categories were
44
Knowledge, Comprehension, Application, Analysis, Synthesis, and
Evaluation. With the exception of Application, each of these was
broken into subcategories. The complete structure of the original
Taxonomy is shown in Figure 5.1.
45
Taxonomy by allowing these two aspects, the noun and verb, to form
separate dimensions, the noun providing the basis for the Knowledge
dimension and the verb forming the basis for the Cognitive Process
dimension as shown in Figure 5.2.
46
reflects different forms of thinking and thinking is an active process
verbs were used instead of nouns.
Level 1 C1
Level 2 C2
Illustrating
Exemplifying Instantiating Finding a specific
example or illustration of
a concept or principle
Categorising
Classifying Subsuming Determining that
something belongs to a
category
Abstracting
Summarising Generalising Abstracting a general
theme or major point(s)
Concluding
Inferring Extrapolating Drawing a logical
Interpolating conclusion from
Predicting presenting information
Contrasting
Comparing Mapping Detecting
Matching correspondences
between two ideas,
objects, and the like
Constructing models
Explaining Constructing a cause
and effect model of a
system
Level 3 C3
On the other hand, SOLO, which stands for the Structure of the
Observed Learning Outcome, taxonomy is a systematic way of
50
describing how a learners performance develops from simple to
complex levels in their learning. Biggs & Collis first introduced it, in their
1982 study. There are 5 stages, namely Prestructural, Unistructural,
Multistructural, which are in a quantitative phrase and Relational and
Extended Abstract, which are in a qualitative phrase.
51
52
Figure 5.3: SOLO Taxonomy
54
It may also be helpful in providing a range of techniques for differentiated
learning (Anderson, 2007; Hook & Mills, 2012).
The most powerful model for understanding these three levels and
integrating them into learning intentions and success criteria is the
SOLO model.
55
It can be effectively used for students to deconstruct exam questions to
understand marks awarded and as a vehicle for self-assessment and peer-
assessment.
What is an example of .?
58
What are the good characteristics of a test item?
Explain each characteristic of a test item in a graphic organiser.
http://books.google.com.my/books/about/Constructing_Test_Items.html?id=Ia3SGDfbaV
6.0 Test format
What is the difference between test format and test type? For example,
when you want to introduce new kinds of test, for example, reading test, which
is organised a little bit different from the existing test items, what do you say?
Test format or test type? Test format refers to the layout of questions on a test.
For example, the format of a test could be two essay questions, 50 multiple-
choice questions, etc.For the sake of brevity, I will consider providing the
outlines of some large-scale standardised tests.
UPSR
IELTS is a test of all four language skills Listening, Reading, Writing &
Speaking. Test-takers will take the Listening, Reading and Writing tests all on
the same day one after the other, with no breaks in between. Depending on
the examinees test centre, ones Speaking test may be on the same day as
the other three tests, or up to seven days before or after that. The total test
time is under three hours. The test format is illustrated below.
60
Figure 6: IELTS Test Format
6.0 SYNOPSIS
Topic 6 focuses on ways to assess language skills and language
content. It defines the types of test items used to assess language skills
and language content. It also provides teachers with suggestions on
ways a teacher can assess the listening, speaking, reading and writing
skills in a classroom. It also discusses concepts of and differences
between discrete point test, integrative test and communicative test.
CONTENT
62
in a context of longer stretches of spoken language( such as classroom
directions from a teacher, TV or radio news items, or stories).
Assessment tasks in selective listening could ask students, for example,
to listen for names, numbers, grammatical category, directions (in a map
exercise), or certain facts and events.
iv. Extensive : listening to develop a top-down , global
understanding of spoken language. Extensive performance
ranges from listening to lengthy lectures to listening to a
conversation and deriving a comprehensive message or
purpose. Listening for the gist or the main idea- and making
inferences are all part of extensive listening.
b. Speaking
In the assessment of oral production, both discrete feature
objective tests and integrative task-based tests are used. The first
type tests such skills as pronunciation, knowledge of what
language is appropriate in different situations, language required in
doing different things like describing, giving directions, giving
instructions, etc. The second type involves finding out if pupils can
perform different tasks using spoken language that is appropriate
for the purpose and the context. Task-based activities involve
describing scenes shown in a picture, participating in a discussion
about a given topic, narrating a story, etc. As in the listening
performance assessment tasks, Brown 2010 cited four categories
for oral assessment.
63
to participate in an interactive conversation. The only role of
listening here is in the short-term storage of a prompt, just long
enough to allow the speaker to retain the short stretch of
language that must be imitated.
2. Intensive. The production of short stretches of oral language
designed to demonstrate competence in a narrow band of
grammatical, phrasal, lexical, or phonological relationships.
Examples of intensive assessment tasks include directed
response tasks (requests for specific production of speech),
reading aloud, sentence and dialogue completion, limited picture-
cued tasks including simple sentences, and translation up to the
simple sentence level.
3. Responsive. Responsive assessment tasks include interaction
and test comprehension but at somewhat limited level of very
short conversation, standard greetings, and small talk, simple
requests and comments, etc. The stimulus is almost always a
spoken prompt (to preserve authenticity) with one or two follow-up
questions or retorts:
c. Reading
Cohen (1994), discussed various types of reading and meaning
assessed. He describes skimming and scanning as two different types
of reading. In the first, a respondent is given a lengthy passage and is
required to inspect it rapidly (skim) or read to locate specific
information (scan) within a short period of time. He also discusses
receptive reading or intensive reading which refers toa form of
reading aimed at discovering exactly what the author seeks to
convey(p. 218). This is the most common form of reading especially
in test or assessment conditions. Another type of reading is to read
responsively where respondents are expected to respond to some
point in a reading text through writing or by answering questions.
A reading text can also convey various kinds of meaning and reading
involves the interpretation or comprehension of these meanings. First,
grammatical meaning are meanings that are expressed through
linguistic structures such as complex and simple sentences and the
correct interpretation of those structures. A second meaning is
informational meaning which refers largely to the concept or messages
contained in the text. Respondents may be required to comprehend
merely the information or content of the passage and this may be
65
assessed through various means such as summary and prcis writing.
Compared to grammatical or syntactic meaning, informational meaning
requires a more general understanding of a text rather than having to
pay close attention to the linguistic structure of sentences. A third
meaning contained in many texts is discourse meaning. This refers to
the perception of rhetorical functions conveyed by the text. One typical
function is discourse marking which adds cohesiveness to a text.
These words, such as unless, however, thus, therefore etc., are crucial
to the correct interpretation of a text and students may be assessed on
their ability to understand the discoursal meaning that they bring in the
passage. Finally, a fourth meaning which may also be an object of
assessment in a reading test is the meaning conveyed by the writers
tone. The writers tone whether it is cynical, sarcastic, sad or etc.- is
important in reading comprehension but may be quite difficult to
identify, especially by less proficient learners. Nevertheless, there can
be many situations where the reader is completely wrong in
comprehending a text simply because he has failed to perceive the
correct tone of the author.
d. Writing
Brown (2004), identifies three different genres of writing which are
academic writing, job-related writing and personal writing, each of
which can be expanded to include many different examples. Fiction,
for example, may be considered as personal writing according to
Browns taxonomy. Brown (2010) identified four categories of written
performance that capture the range of written production which can
be used to assess writing skill.
66
the primary focus while context and meaning are of secondary
concern.
2. Intensive (controlled). Beyond the fundamentals of imitative
writing are skills in producing appropriate vocabulary within a
context, collocation and idioms, and correct grammatical features
up to the length of a sentence. Meaning and context are
important in determining correctness and appropriateness but
most assessment tasks are more concerned with a focus on form
and are rather strictly controlled by the test design.
3. Responsive. Assessment tasks require learners to perform at a
limited discourse level, connecting sentences into a paragraph
and creating a logically connected sequence of two or three
paragraphs. Tasks relate to pedagogical directives, lists of criteria,
outlines, and other guidelines. Genres of writing include brief
narratives and descriptions, short reports, lab reports, summaries,
brief responses to reading, and interpretations of charts and
graphs. Form-focused attention is mostly at the discourse level,
with a strong emphasis on context and meaning.
4. Extensive. Extensive writing implies successful management of all
the processes and strategies of writing for all purposes, up to the
length of an essay, a term paper, a major research project report,
or even a thesis. Focus is on achieving a purpose, organizing and
developing ideas logically, using details to support or illustrate
ideas, demonstrating syntactic and lexical variety, and in many
cases, engaging in the process of multiple drafts to achieve a final
product. Focus on grammatical form is limited to occasional
editing and proofreading of a draft.
There are many examples of each type of test. Objective type tests
include the multiple choice test, true false items and matching items
because each of these are graded objectively. In these examples of
objective tests, there is only
one correct response and the grader does not need to subjectively
assess the response.
Two other terms, select type tests and supply type tests are related
terms when we think of objective and subjective tests. In most
cases, objective tests are similar to select type tests where students
are expected to select or choose the answer from a list of options.
Just as a multiple choice question test is an objective type test, it
can also be considered a select type test. Similarly, tests involving
essay type questions are supply type as the students are expected
to supply the answer through their essay. How then would you
classify a fill in the blank type test? Definitely for this type of test,
the students need to supply the answer, but what is supplied is
merely a single word or a short phrase which differs tremendously
from an essay. It may therefore be helpful to once again consider a
continuum with supply type and select type items at each end of the
continuum respectively.
It is not by accident that we find there are few, if any, test formats that are
either supply type and objective or select type and subjective. Select type
tests tend to be objective while supply type tests tend to be subjective.
In addition to the above, Brown and Hudson (1998), have also suggested
three broad categories to differentiate tests according to how students are
expected to respond. These categories are the selected response tests, the
constructed response tests, and the personal response tests. Examples of
each of these types of tests are given in Table 6.1.
69
according to how students respond, are useful when we wish to
determine what students need to do when they attempt to answer test
questions.
b. Communicative Test
As language teaching has emphasised the importance of
communication through the communicative approach, it is not surprising
that communicative tests have also been given prominence. A
communicative emphasis in testing involves many aspects, two of
which revolve around communicative elements in tests and meaningful
content. Both these aspects are briefly addressed in the following sub
sections:
71
interview, editing the results, and engaging in spontaneous, but flawed
discourse(Alderson & Banerjee, 2002: 99), all of which are inauthentic
when viewed in terms of real life situations. Alderson himself argues
that because candidates in language tests are not interested in
communicating but to display their language abilities, the test situation
is a communicative event in itself and therefore cannot be used to
replicate any real world event (p. 98).
involve performance;
are authentic; and
are scored on real-life outcomes.
In short, the kinds of tests that we should expect more of in the future
will be communicative tests in which candidates actually have to
produce the language in an interactive setting involving some degree of
72
unpredictability which is typical of any language interaction situation.
These tests would also take the communicative purpose of the
interaction into consideration and require the student to interact with
language that is actual and unsimplified for the learner. Fulcher finally
points out that in a communicative test, the only real criterion of
success is the behavioural outcome, or whether the learner was able
to achieve the intended communicative effect (p. 493). It is obvious
from this description that the communicative test may not be so easily
developed and implemented. Practical reasons may hinder some of the
demands listed. Nevertheless, a solution to this problem has to be
found in the near future in order to have valid language that are
purposeful and can stimulate positive washback in teaching and
learning.
Exercise 1
73
SCORING, GRADING AND
TOPIC 7 ASSESSMENT CRITERIA
7.0 SYNOPSIS
Topic 7 focuses on the scoring, grading and assessment criteria. It provides
teachers with brief descriptions on the different approaches to scoring
namely:-objective, holistic and analytic.
CONTENT
74
7.2.1 Objective approach
75
Table 7.1: Holistic Scoring Scheme
Source: S.S. Moya, Evaluation Assistance Center (EAC)-East, Georgetown
University, Washington
RRating CCriteria
5 5-6 Vocabulary is precise, varied, and vivid.
Organization is appropriate to writing assignment
and contains clear introduction, development of
ideas, and conclusion.
Transition from one idea to another is smooth
and provides reader with clear understanding that
topic is changing.
Meaning is conveyed effectively.
A few mechanical errors may be present but
do not disrupt communication.
Shows a clear understanding of writing and
topic development.
4 4 Vocabulary is adequate for grade level.
Events are organized logically, but some part of
the sample may not be fully developed.
Some transition of ideas is evident.
Meaning is conveyed but breaks down at
times.
Mechanical errors are present but do not
disrupt communication.
Shows a good understanding of writing and
topic development.
3 Vocabulary is simple. Organization may be
extremely simple or there may be evidence of
disorganization.
There are a few transitional markers or
repetitive transitional markers.
Meaning is frequently not clear.
Mechanical errors affect communication.
Shows some understanding of writing and
topic development.
2 Vocabulary is limited and repetitious. Sample
is comprised of only a few disjointed
sentences.
No transitional markers.
Meaning is unclear.
Mechanical errors cause serious disruption in
communication.
Shows little evidence of discourse
understanding.
1 Responds with a few isolated words. No
complete sentences are written.
No evidence of concepts of writing.
76
0 No response.
The 6 point scale above includes broad descriptors of what a students essay
reflects for each band. It is quite apparent that graders using this scale are
expected to pay attention to vocabulary, meaning, organisation, topic
development and communication. Mechanics such as punctuation are
secondary to communication.
Bailey also describes another type of scoring related to the holistic approach
which she refers to as primary trait scoring. In primary trait scoring, a particular
functional focus is selected which is based on the purpose of the writing and
grading is based on how well the student is able to express that function. For
example, if the function is to persuade, scoring would be on how well the
author has been able to persuade the grader rather than how well organised
the ideas were, or how grammatical the structures in the essay were. This
technique to grading emphasises functional and communicative ability rather
than discrete linguistic ability and accuracy.
Components Weight
Content 30 points
Organisation 20 points
Vocabulary 20 points
Language Used 25 points
77
Mechanics 5 points
Each of the three scoring approaches claims to have its own advantages
and disadvantages. These can be illustrated by Table 7.2
EXERCISE
78
ITEM ANALYSIS AND INTERPRETATION
TOPIC 8
8.0 SYNOPSIS
Topic 8 focuses on item analysis and interpretation. It provides teachers with
brief descriptions on basic statistics terminologies such as mode, median, mean,
standard deviation, standard score and interpretation of data. It will also look at
some item analysis that deals with item difficulty and item discrimination.
Teachers will also be introduced to distractor analysis in language assessment.
79
CONTENT
Let us assume that you have just graded the test papers for your class. You
now have a set of scores. If a person were to ask you about the performance
of the students in your class, it would be very difficult to give all the scores in
the class. Instead, you may prefer to cite only one score.
Or perhaps you would like to report on the performance by giving some
values that would help provide a good indication of how the students in your
class performed. What values would you give? In this section, we will look at
two kinds of measures, namely measures of central tendency and measures
of dispersion. Both these types of measures are useful in score reporting.
Standard deviation refers to how much the scores deviate from the mean.
There are two methods of calculating standard deviation which are the
deviation method and raw score method which are illustrated by the following
formulae.
To illustrate this, we will use 20, 25,30. Using standard deviation method, we
come up with the following table:
81
Using the raw score method, we can come up with the following:
Table 8.2 : Calculating the Standard Deviation Using the Raw Score Method
Both methods result in the same final value of 5. If you are calculating
standard deviation with a calculator, it is suggested that the deviation
method be used when there are only a few scores and the raw score
method be used when there are many scores. This is because when
there are many scores, it will be tedious to calculate the square of the
deviations and their sum.
82
i. The Z score
The Z score is the basic standardised score. It is referred to as the
basic form as other computations of standardised scores must first
calculate the Z score. The formula used to calculate the Z score is as
follows:
Z score values are very small and usually range only from 2 to 2.
Such small values make it inappropriate for score reporting especially
for those unaccustomed to the concept. Imagine what a parent may
say if his child comes home with a report card with a Z score of 0.47 in
English Language! Fortunately, there is another form of standardised
score - the T score with values that are more palatable to the
relevant parties.
83
ii. The T score
The T score is a standardised score which can be computed using the
formula 10 (Z) + 50. As such, the T score for students A, B, C, and D in
the table 4.3 are 10(-1.28) + 50; 10 (-0.23) + 50; 10(0.47) + 50; and 10
(1.04) + 50 or 37.2, 47.7, 54.7, and 60.4 respectively. These values
seem perfectly appropriate compared to the Z score. The T score
average or mean is always 50 (i.e. a standard deviation of 0) which
connotes an average ability and the mid point of a 100 point scale.
How can En. Abu solve this problem? He would have to have
standardised scores in order to decide. This would require the following
information:
Using the information above, En. Abu can find the Z score for each raw
score reported as follows:
Based on Table 8.4, both Ali and Chong have a negative Z score as
their total score for both tests. However, Chong has a higher Z score
total (i.e. 1.07 compared to 1.34) and therefore performed better
when we take the performance of all the other students into
consideration.
84
THE NORMAL CURVE
a. Item difficulty
Item difficulty refers to how easy or difficult an item is. The formula
used to measure item difficulty is quite straightforward. It involves
finding out how many students answered an item correctly and
dividing it by the number of students who took this test. The formula is
therefore:
86
go on to describe that if the purpose of the test is for selection, then we
should utilise items whose difficulty values come closest to the desired
selection ratio for example, if we want to select 20%, then we should
choose items with a difficulty index of 0.20.
b. Item discrimination
Lets use the following instance as an example. Suppose you have just
conducted a twenty item test and obtained the following results:
c. Distractor analysis
Let us assume that 100 students took the test. If we assume that A is the
answer and the item difficulty is 0.7, then 70 students answered correctly.
What about the remaining 30 students and the effectiveness of the three
distractors? If all 30 selected D, then distractors B and C are useless in their
role as distractors. Similarly, if 15 students selected D and another 15
selected B, then C is not an effective distractor and should be replaced.
Therefore, the ideal situation would be for each of the three distractors to be
selected by an equal number of all students who did not get the answer
correct, i.e. in this case 10 students. Therefore the effectiveness of each
89
distractor can be quantified as 10/100 or 0.1 where 10 is the number of
students who selected the tiems and 100 is the total number of students
who took the test. This technique is similar to a difficulty index although the
result does not indicate the difficulty of each item, but rather the
effectiveness of the distractor. In the first situation described in this
paragraph, options A, B, C and D would have a difficulty index of 0.7, 0, 0,
and 0.3 respectively. If the distractors worked equally well, then the indices
would be 0.7, 0.1, 0.1, and 0.1. Unlike in determining the difficulty of an
item, the value of the difficulty index formula for the distractors must be
interpreted in relation to the indices for the other distractors.
From a different perspective, the item discrimination formula can also be
used in distractor analysis. The concept of upper groups and lower groups
would still remain, but the analysis and expectation would differ slightly from
the regular item discrimination that we have looked at earlier. Instead of
expecting a positive value, we should logically expect a negative value as
more students from the lower group should select distractors. Each
distractor can have its own item discrimination value in order to analyse how
the distractors work and ultimately refine the effectiveness of the test item
itself.
Item 1 8* 3 1 0
Item 2 2 8* 2 0
Item 3 4 8* 0 0
Item 4 1 3 8* 0
Item 5 5 0 0 7*
d. * indicates key
For Item 1, the discrimination index for each distractor can be calculated
using the discrimination index formula. From Table 8.5, we know that all the
students in the upper group answered this item correctly and only one student
from the lower group did so. If we assume that the three remaining students
from the lower group all selected distractor B, then the discrimination index for
item 1, distractor B will be:
90
This negative value indicates that more students from the lower group
selected the distractor compared to students from the upper group. This result
is to be expected of a distractor and a value of -1 to 0 is preferred.
EXERCISE
1. Calculate the mean, mode, median and range of the following set of
scores:
23, 24, 25, 23, 24, 23, 23, 26, 27, 22, 28.
2. What is a normal curve and what does this show? Does the final
result always show a normal curve and how does this relate to
standardised tests?
91
9.0 SYNOPSIS
Topic 9 focuses on reporting assessment data. It provides teachers with brief
descriptions on the purposes of reporting and the reporting methods.
CONTENT
94
Assessing and reporting a student's achievement and progress in
comparison to other students.
iii An outcomes-approach
Acknowledges that students, regardless of their class or grade, can be
working towards syllabus outcomes anywhere along the learning
continuum.
96
programming and discussing samples of student work and
achievements within and between schools. Teacher judgement
based on well defined standards is a valuable and rich form of
student assessment.
Is time efficient and manageable
Effective and informative assessment practice is time efficient and
supports teaching and learning by providing constructive feedback to
the teacher and student that will guide further learning.
Teachers need to plan carefully the timing, frequency and nature of their
assessment strategies. Good planning ensures that assessment and
reporting is manageable and maximises the usefulness of the strategies
selected (for example, by addressing several outcomes in one
assessment task).
Recognises individual achievement and progress
Effective and informative assessment practice acknowledges that
students are individuals who develop differently. All students must be
given appropriate opportunities to demonstrate achievement.
Effective and informative assessment and reporting practice is
sensitive to the self esteem and general well-being of students,
providing honest and constructive feedback.
Values and attitudes outcomes are an important part of learning that
should be assessed and reported. They are distinct from knowledge,
understanding and skill outcomes.
Involves a whole school approach
An effective and informative assessment and reporting policy is
developed through a planned and coordinated whole school approach.
Decisions about assessment and reporting cannot be taken
independently of issues relating to curriculum, class groupings,
timetabling, programming and resource allocation.
Actively involves parents
Schools and their communities are responsible for jointly developing
assessment and reporting practices and policies according to their local
needs and expectations.
Schools should ensure full and informed participation by parents in the
continuing development and review of the school policy on reporting
processes.
Conveys meaningful and useful information
Reporting of student achievement serves a number of purposes, for a
97
variety of audiences. Students, parents, teachers, other schools and
employers are potential audiences. Schools can use student
achievement information at a number of levels including individual, class,
grade or school. This information helps identify students for targeted
intervention and can inform school improvement programs. The form of
the report must clearly serve its intended purpose and audience.
Effective and informative reporting acknowledges that students can be
demonstrating progress and achievement of syllabus outcomes across
stages, not just within stages.
Good reporting practice takes into account the expectations of the
school community and system requirements, particularly the need for
information about standards that will enable parents to know how their
children are progressing.
Student achievement and progress can be reported by comparing
students' work against a standards framework of syllabus outcomes,
comparing their prior and current learning achievements, or comparing
their achievements to those of other students. Reporting can involve a
combination of these methods. It is important for schools and parents to
explore which methods of reporting will provide the most meaningful and
useful information.
10.0 SYNOPSIS
Topic 10 focuses on the issues and concerns related to assessment in the
Malaysian primary schools. It will look at how assessment is viewed and used
in Malaysia.
98
Use the different types of assessment in assessing language in school
(cognitive-level,school-based and alternative assessment)
CONTENT
SESSION TEN (3 hours)
100
that evaluates the performance of Malaysias education
system against historical starting points and international
benchmarks. The Blueprint also offers a vision of the
education system and students that Malaysia both needs and
deserves, and suggests
11 strategic and operational shifts that would be required to
achieve that vision. The Ministry hopes that this effort will
inform the national discussion on how to fundamentally
transform Malaysias education system, and will seek
feedback from across
the community on this preliminary effort before finalising the
Blueprint in December 2012.
101
oral tests (for languages) that assess subject learning. LP develops
the test questions and marking schemes. The tests are, however,
administered and marked by school teachers;
Psychometric assessment refers to aptitude tests and a
personality inventory to assess students skills, interests, aptitude,
attitude and personality. Aptitude tests are used to assess students
innate and acquired abilities, for example in thinking and problem
solving. The personality inventory is used to identify key traits and
characteristics that make up the students personality. LP develops
these instruments and provides guidelines for use. Schools are,
however, not required to comply with these guidelines; and
Physical, sports, and co-curricular activities assessment
refers to assessments of student performance and participation
in physical and health education, sports, uniformed bodies, clubs,
and other non-school sponsored activities. Schools are given the
flexibility to determine how this component will be assessed.
102
10.4 Cognitive Levels of Assessment
Knowledge
Learning objectives at this level: know common terms, know specific facts,
know methods and procedures, know basic concepts, know principles.
Question verbs: Define, list, state, identify, label, name, who? when? where?
what?
Comprehension
The ability to grasp the meaning of material. Translating material from one
form to another (words to numbers), interpreting material (explaining or
summarizing), estimating future trends (predicting consequences or effects).
Goes one step beyond the simple remembering of material, and represent the
lowest level of understanding.
Application
The ability to use learned material in new and concrete situations. Applying
rules, methods, concepts, principles, laws, and theories. Learning outcomes
in this area require a higher level of understanding than those under
comprehension.
103
problems, construct graphs and charts, demonstrate the correct usage of a
method or procedure.
Question verbs: How couldxbe used toy? How would you show, make use
of, modify, demonstrate, solve, or applyxto conditionsy?
Analysis
The ability to break down material into its component parts. Identifying parts,
analysis of relationships between parts, recognition of the organizational
principles involved. Learning outcomes here represent a higher intellectual
level than comprehension and application because they require an
understanding of both the content and the structural form of the material.
Synthesis
The ability to put parts together to form a new whole. This may involve the
production of a unique communication (theme or speech), a plan of
operations (research proposal), or a set of abstract relations (scheme for
classifying information). Learning outcomes in this area stress creative
behaviors, with major emphasis on the formulation of new patterns or
structure.
Learning objectives at this level: write a well organized paper, give a well
organized speech, write a creative short story (or poem or music), propose a
plan for an experiment, integrate learning from different areas into a plan for
solving a problem, formulate a new scheme for classifying objects (or events,
or ideas).
104
Evaluation
The ability to judge the value of material (statement, novel, poem, research
report) for a given purpose. The judgments are to be based on definite
criteria, which may be internal (organization) or external (relevance to the
purpose). The student may determine the criteria or be given them. Learning
outcomes in this area are highest in the cognitivehierarchy because they
contain elements of all the other categories, plus conscious value judgments
based on clearly defined criteria.
106
Centralised Assessment
Conducted and administered by teachers in schools using instruments,
rubrics, guidelines, time line and procedures prepared by LP
Monitoring and moderation conducted by PBS Committee at School,
District and State Education Department, and LP
School Assessment
The emphasis is on collecting first hand information about pupils learning
based on curriculum standards
Teachers plan the assessment, prepare the instrument and administer the
assessment during teaching and learning process
Teachers mark pupils responses and report their progress continuously.
107
Table 10.1: Contrasting Traditional and Alternative Assessment
Source: Adapted from Bailey (1998:207 and Puhl, 1997: 5)
Summative Formative
Intrusive Integrated
Judgmental Developmental
108
Alternative assessments are suggested largely due to a growing concern that
traditional assessments are not able to accurately measure the ability we are
interested in. They are also seen to be more student centred as they cater
for different learning styles, cultural and educational backgrounds as well as
language proficiencies.
Physical demonstration
Pictorial products
Reading response logs
K-W-L (what I know/what I want to know/what Ive learned) charts
Dialogue journals
Checklists
Teacher-pupils conferences
Interviews
Performace tasks
Portfolios
Self assessment
Peer assessment
109
Portfolios
Self appraisals are also thought to be quite accurate and are said
to increase student motivation. Puhl (1997), describes a case
study in which she believes self-assessment forced the students
111
to reread and thereby make necessary editing and corrections to
their essays before they handed them in. Nevertheless, in order
for self assessment to be useful and not a futile exercise, the
learners need to be trained and initially guided in performing their
self assessment. This training involves providing students with
the rationale for self assessment and how it is intended to work
and how it is capable of helping them.
3. I have difficulty with some questions, but I generally get the meaning
EXERCISE
In your opinion, what are the advantages of using portfolios as
a form of alternative assessment?
113
REFERENCES
Biggs, J. B., & Collis, K .F. (1991) Multimodal learning and the quality
of intelligent behaviour. In: H. Rowe (Ed.) Intelligence:
Reconceptualization and measurement. Hillsdale, NJ: Lawrence
Erlbaum. pp. 57-75.
114
Biggs, J.B.& Tang, C. (2009). Applying constructive alignment to
outcomes- based teaching and learning. Training Material. Quality
Teaching for Learning in Higher Education Workshop for Master
Trainers. Ministry of Higher Education. Kuala Lumpur.
Black, P. & Wiliam, D. (2009). Developing the theory of formative
assessment J. Gardiner, ed. Educational Assessment
Evaluation and Accountability, 1 (1), pp. 531.
Available at: http://eprints.ioe.ac.uk/1119/. (Retrieved 23 August
2013)
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T. and
McNamara, T. (1999). Dictionary of language testing. Cambridge:
University ofCambridge Local Examinations Syndicate and
Cambridge University Press.
Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S.,
Miller, J., & Newton, D. (2005).Frameworks for Thinking: A
handbook for teaching and learning. Cambridge: Cambridge
University Press.
117
EnglishPublications.
Smith, T.W. & Colby, S.A. (2007). Teaching for Deep Learning. The
Clearing House. 80 (5) pp. 205211.
Stansfield, C., & Reed, D. (2004). The story behind the Modern
Language Aptitude Test: An interview with John B. Carrol
(1916-2003). Language Assessment Quarterly, 1, pp.43-56.
Websites
http://www.catforms.com/pages/Introduction-to-Test-Items.html
(Retrieved 9.8.2013)
http://myenglishpages.com/blog/summative-formative-
assessment/ - (Retrieved 10.8.2013)
http://www.teachingenglish.org.uk/knowledge-database/objective-
test - (Retrieved 12.8.2013)
http://assessment.tki.org.nz/Using-evidence-for
learning/Concepts/Concept/Reliability-and-validity
118
PANEL PENULIS MODUL
PROGRAM PENSISWAZAHAN GURU
MOD PENDIDIKAN JARAK JAUH
(PENDIDIKAN RENDAH)
NAMA KELAYAKAN
NURLIZA BT OTHMAN KELULUSAN:
othmannurliza@yahoo.com
M.A TESL University of North Texas, USA
B.A (Hons) English North Texas State University, USA
Sijil Latihan Perguruan Guru Siswazah (Kementerian
Pelajaran Malaysia)
PENGALAMAN KERJA
4 tahun sebagai guru di sekolah menengah
21 tahun sebagai pensyarah di IPG
KELULUSAN
ANG CHWEE PIN M.Ed.TESL Universiti Teknologi Malaysia
chweepin819@yahoo.com
B.Ed. (Hons.) Agri. Science/TESL, Universiti Pertanian
Malaysia
PENGALAMAN KERJA
23 tahun sebagai guru di sekolah menengah
7 tahun sebagai pensyarah di IPG
119