Testing and Evaluation-Presentation - 18oct'05

Prof. Dr.
Dinay Kksal anakkale Onsekiz Mart University ELT Department 2011
LANGUAGETEACHINGLEARNINGANDTESTING
Heaton(1990:5)emphasizesthattherehasbeena tendencytoseparatetestingfromteachinginalarge numberofexaminationsinthepastandthatitis impossibletoworkineitherfieldwithoutbeingconstantly concernedwiththeotherbecausebothteachingand testingarecloselyinterrelated. Wemustcontructtestsasdevicestoreinforcelearning andmotivatethelearners.Wealsomustconstructthem asameansofassessingthelearnersperformanceinthe targetlanguage.Hestatesthatintheformertestingis gearedtoteachingwhileinthelatterthecaseisthe reverse.
Prof. Dr. Dinay Kksal
Testing and Evaluation
anakkale Onsekiz Mart University ELT Department
LANGUAGE TEACHING LEARNING AND TESTING

Testing is an important part of the learning and teaching process. Madsen(1983:3-4), discussing how testing helps learners learning English, stresses that well-made English tests help learners at least in two ways: 1. create positive attitudes toward instruction and classroom experiences by giving them a sense of success and a feeling that teachers evaluation of them matches what they have been taught.
Prof. Dr. Dinay Kksal Testing and Evaluation anakkale Onsekiz Mart University ELT Department 3

When discussing prejudices and problems concerning
assessment, Haris and McCan (1994:2) stresses that assessment may cause many students to feel panic and confusion and that there is often great pressure on them to succeed: those who cannot succeed are they become branded as failures. Such a competition creates more losers than winners. Many of these negative attitudes towards assessment come from the generalised feeling of a divorce between learning and teaching on the one hand, and assessment on the other. The fundamental reason for this is that assessment often does not feed back into the learning and teaching process.

2. Learn the language by requiring learners studying hard for
exams and raising a better awareness of course objectives and personal language needs.
Some teachers may feel that tests can be used as a way of motivating students to work harder and virtually all of us feel insecure and uncomfortable when we have to pass or fail students. However, tests are not the only way to motivate students.
Reasons for testing

Let us look in detail at why we should test our students. Firstly,
whenever a test is administered, there is a decision to be made: Haris (1969:2-3) enlists the chief objectives of language testing as follows: 1. To determine readiness for instructional programmes. 2. To classify or place individuals in appropriate language classes. 3. To diagnise the individuals specific strengths and weaknesses 4. To measure aptitude for learning. 5. To measure the extent of student achievement of the instructional goals 6. To evaluate the effectiveness of instruction.
Reasons for testing

Haris and McCan (1994:26), giving the reasons for testing, states
that we, language teachers, may want: to find out about a candidate's suitability to follow a course of study to compare a student's performance with that of other students. to find out how much a student has learned during the course or academic year ie compare what students can do at the end of the course compared with the beginning of the course. to find out how a student is progressing during a course of study and possibly identify problem areas before a course ends.
Reasons for testing

In all these cases, we need to make decisions about the
students. We must have a clear idea of the kind of decision to be made about our students so that we can identify the most appropriate kind of test for you. Haris and McCan (1994:26) maintains that there is a common misconception, held by students and teachers alike, that a test is something which is done to or at people rather than something which is done by them and for them. We should view the decisions which are made about students as decisions which are taken for them. Formal testing should be seen as a complement to other forms of assessment, eg self-assessment and informal assessment.
Reasons for testing

The basic differences are that if we have a well designed,
reliable and valid test, then the test will measure students' ability in a more objective way than more subjective forms of assessment such as informal observation and selfassessment. This is not to say that all formal testing is objective. Nor should we take the view that subjective is necessarily bad and that objective is necessarily good. For example all oral tests are subjectively marked and all multiple choice type tests are objectively marked.
Within a practical, comprehensive philosophy of language
instruction and testing, every test becomes a natural rung in the ladder toward the instructional goalthat is, toward some desired degree of proficiency in the target language and every instructional activity in which students participate becomes a language-testing activity. In such a comprehensive theory, tests express the essence of the instructional process as well as, or perhaps better than, any other activity. In other words, teaching itself is a testing procedure as much as it is an instructional one.
Tests in any classroom setting have a variety of functions
that we must understand before we choose and administer any test. Language tests in the classroom may serve the following purposes: Institutional. Good tests instruct students and enable them to improve their proficiency in the target language. Managerial. Such tests provide feedback to both teachers and students and help them manage instruction and studies practicesfor instance, by providing a sensible basis for grading. Motivational. The tests serve as rewards or as goals, urging students and teachers toward higher achievement relative to well-defined goals.
Diagnostic. The tests help teachers and students identify
specific instructional problems. Curricular. Good tests define the curriculum as a whole.
Inappropriate language testing may result in, reflect, or even constitute ineffective language teaching. From such a pragmatic perspective, it might be argued that language testing is language instruction and, conversely, that language instruction is language testing.
DEFINITION OF BASIC TERMS

Here it will be useful to define the confusing terms, which are often used synonymously (Bachman 1995:18): measurement, test and evaluation. MEASUREMENT He defines measurement as the process of quantifying the characteristics of persons according to explicit procedures and rules and states that this definition consists of three distinguishing features:

(i)Quantification It involves the assigning of numbers, which distin guishes measures from qualitative descriptions such as verbal accounts or nonverbal, visual representations. Non-numerical categories or rankings such as letter grades ('A, B, C . ..'), or labels (for example, 'excellent, good, average . . .') may have the characteristics of measurement However, when we actually use categories or rankings such as these, we frequently assign numbers to them in order to analyze and interpret them, and technically, it is not until we do this that they constitute measurement.

(ii) Characteristics We can observe physical attributes such as height and
weight directly. In testing, however, we are almost always interested in quantifying mental attributes and abilities, sometimes called traits or constructs, which can only be observed indirectly. These mental attributes include characteristics such as aptitude, intelligence, motivation, field dependence/independence, attitude, native language, fluency in speaking, and achievement in reading comprehension. Whatever attributes or abilities we measure, it is important to understand that it is these attributes or abilities and not the persons themselves that we are measuring. That is, we must be aware of the fact that there is no a single measure or even a battery of measures that can adequately characterize individual human beings in all their complexity.

(iii) Rules and procedures Quantification must be done according to explicit rules and pro cedures. That is, the 'blind' or haphazard assignment of numbers to characteristics of individuals cannot be regarded as measurement. In order to be considered a measure, an observation of an attribute must be replicable, for other observers, in other contexts and with other individuals. Practically anyone can rate another person's speaking ability, for example. But while one rater may focus on pronunciation accuracy, another may find vocabulary to be the most salient feature. Or one rater may assign a rating as a percentage, while another might rate on a scale from zero to five. Ratings such as these can hardly be considered anything more than numerical summaries of the raters' personal conceptualizations of the individual's speaking ability. This is because the different raters in this case did not follow the same criteria or procedures for arriving at their ratings.
DEFINITION OF BASIC TERMS DEFINITION OF BASIC TERMS

Measures, then,
are distinguished from such 'pseudomeasures' by the explicit procedures and rules upon which they are based. There are many different types of measures in the social sciences, including rankings, rating scales, and tests.

TEST Carroll provides the following definition of a test: a psychological or educational test is a procedure designed to elicit certain behavior from which one can make inferences about certain characteristics of an individual (1968:46).
From this definition, it follows that a test is a measurement
instrument designed to elicit a specific sample of an individual's behavior. As one type of measurement, a test necessarily quantifies characteristics of individuals according to explicit procedures. What distinguishes a test from other types of measurement is that it is designed to obtain a specific sample of behavior.
Prof. Dr. Dinay Kksal anakkale Onsekiz Mart University ELT Department
18
EVALUATION Evaluation can be defined as the systematic gathering of information for the purpose of making decisions (Weiss 1972).The probability of making the correct decision in any given situation is a function not only of the ability of the decision maker, but also of the quality of the information upon which the decision is based. Everything else being equal, the more reliable and relevant the information, the better the likelihood of making the correct decision. Few of us, for example, would base educational decisions on hearsay or rumor, since we would not generally consider these to be reliable sources of information. Similarly, we frequently attempt to screen out information, such as sex and ethnicity, that we believe to be Prof. Dr. Dinay Kksal irrelevant to a particular decision.
anakkale Onsekiz Mart University ELT Department 19

One aspect of evaluation, therefore, is the collection of
reliable and relevant information. This information need not be, indeed seldom is, exclusively quantitative. Verbal descriptions, ranging from performance profiles to letters of reference, as well as overall impressions, can provide important information for evaluating individuals, as can measures, such as ratings and test scores.

Evaluation, therefore, does not necessarily entail testing. By the
same token, tests in and of themselves are not evaluative. Tests are often used for pedagogical purposes, either as a means of motivating students to study, or as a means of reviewing material taught, in which case no evaluative decision is made on the basis of the test results. Tests may also be used for purely descriptive purposes.
Basic terms
Assessment This term is used to refer to a vatiety of ways
of collecting information on learners language ability or achievement. Testing and assessment are often used interchangeably, however the latter is an umbrella term which emcompasses measurement instruments administered on a one-off basis such as tests as well as qualitative methods of monitoring and recording student learning such as observation, simulations, or project work.
22
Basic terms
Assessment is also distinguished from evaluation concerned with the overall language programme and not just what individual learners have learnt. Here it will be useful to mention about the kinds of assessment: Profeciency assessment vs achievement assessment: Proficiency assessment refers to the assessment of learners general language abilities independent of a course of study and such assessment is often done by administering standardised language proficiency tests while achievement assessment is frequently carried out by the teacher based either on the specific course content or on the course objectives to establish what a learner has learnt in relation to a particular course.
Basic terms
Formative assessment vs summative
assessment Formative asseement is carried out by teachers during the learning process
APPROACHES TO LANGUAGE TESTING
When we have a quick glance at the history of language testing, we see that the shifts in emphasis in language teaching have inevitably had consequences for language testing. Testing techniques and theories, however, have been rather more resistant to change than theories about methodology and course design. Changes in approaches to language teaching inevitably resulted in attempts to develop testing techniques appropriate to the new pedagogy. Such fundamental questions as 'What makes a test a good test?' and 'How, should we go about constructing a test?' will receive quite different answers from adherents to different schools. Procedures acceptable to one approach may be anathema to another and so on.
25
The approaches to language testing can be categorized into five main groups
Era
Pychology Learning Theory
Linguistic s Theory of language
Methodology Testing / Aprroach Language teaching EssayTranslation Discretepoint
Pre-scientific Faculty
Traditional GrammarTranslation
Scientific Modern
Behaviouristi Structuralis Audio-lingual c m
Cognitivistic Generativis Cognitive-code Integrative m learning theory
Communicati Humanistic Semanticis Communicativ Pragmatic ve m e Approach Prof. Dr. Dinay Kksal Inovative
Psycholinguis PragmatismFunctionalanakkale Onsekiz Mart tic University ELT Department notional
Functional
26
PRESCIENTIFIC/TRADITIONAL APPROACH
Essay-translation approach Essay-translation approach, commonly refered to as prescientific stage of language testing, requires no special skill or expertise in testing , best characterized by the use of tests such as essay-writing, translation, grammatical analysis.
Based on the language teaching philosophy in grammartranslation method, learners are required to write open-ended written compositions which consists of passsages for translation from and into the target language with a heavy literary and cultural bias, free compositions in it and selected items of grammatical, textual or cultural interest. One must rely completely on the subjective judgement of the experienced teacher. It is assumed that a person who can teach can also judge the proficiency of his learners. There is a lack of concern for statistical matters or for such notions as objectivity and reliability. Prof. Dr. Dinay Kksal
STRUCTURALIST APPROACH
Discrete point approach/ pyschometric-structuralist trend)
This approach, commonly referred to as psychometric-
structuralist trend, or as scientific stage of language testing, tries to identify and test the test-takers, or learners mastery of separate elements of the target langauge (phonology, voacabulary, and grammar), completely divorced words and sentences from context on the gounds that it makes it possible to cover a larger sample of language forms in a comparatively short time. Communicative skills (listening, speaking, reading and writing) are also tested separately because it is of great importance to test one thing at a time. to provide objective measures using various statistical techniques to ensure reliability and certain kinds of validity.
Prof. Dr. Dinay Kksal anakkale Onsekiz Mart University ELT Department 28
Unlike essay-translation approach, the key concern here is
In this period of language testing, on the theoretical side, it
is agreed that language learning is chiefly considered as the systematic acquisition of a set of habits; on the practical side, testers wanted and structuralists knew how to deliver long lists of small items that could be tested objectively. testers notion of discrete skills to be measured, we witness the flourishing of standardised tests with emphasis on what is called discrete-point item. Such tests aimed to serve the following goals: 1. diagnosing learner strengths and weaknesses; 2. prescribing curricula at particular skills; and 3. developing specific teaching strategies to help learners overcome particular weaknesses.
Based on the structural linguists view of language and
4. planning for remedial instructions 5. helping the educators to design effective programms 6. facilitating test administration and scoring procedures They are considered disadventagous because they have outdated and weak psychological and linguistic backgrounds are not efficient in measuring the learners proficiency with little or no relevance with the actual use of language in real life situations do not give any idea about the contribution of every single language element to the overall langauge use.
The assumption that language components and skills cannot
operate independently and that language elements should be tested as a whole since they all funtion together to make language operate as a means of communication gave way to the holistic approach. The need for assessing the practical language skills of foreign students who wanted to study in the UK and USA in paralel with the need within the communicative movement in teaching for tests which measured productive language skills led a demand for language tests that required language users to have an integrated performance.
INTEGRATIVE APPROACH
Integrative/Holistic Approach This approach, primarily concerned with the total communicative effect of discourse and meaning, involves the testing of language in context. Integrative tests , best characterized by the use of cloze tests, dictation, oral interviews, translation and essay writing, are designed to test the test-takers ability to use two or more skills simultaneously. The arguments for the tests based on this approach are: 1. They have stronger linguistic and pyschological backgrounds in comparison with the discrete point-tests. 2. They pertain to the contextualised form of language. Prof. Dr. Dinay Kksal
Testing and Evaluation anakkale Onsekiz Mart University ELT Department 32
INTEGRATIVE APPROACH
The arguments against such tests are as follows:
1. They are time-consuming to administer and score. 2. Reliability and valididity of these tests are a matter of discussion. 3. The items in dictation and or cloze tests are interdependent. 4. Scores on dictation tests depend on understanding the spoken form of the language and the ability to speed write. 5. They required trained raters
COMMUNICATIVE APPROACH
Communicative Approach (functional tests)
Communicative tests, primarily concerned with how langauge is used in communication, aim to include tasks that approximate to those facing the learners in real life situations. Hymes theory of Communicative Competence has changed the scope of language testing. Hymes (1970)maintains that knowing a language requires more than knowing its rules .There are culturally specific rules of use, which relates the language used, to features of the communicative context.
34
The model of Communicative Competence ,developed by
Canale and Swain (1980) involves four types of competence ; Grammatical Competence : the kind of knowledge of systematic features of grammar ,lexis and phonology . (discrete-point testing) Socio-cultural Competence : Knowledge of rules of language use in terms of what is appropriate to different types of interlocutors , to different settings ,in different topics. Strategic Competence: the ability to compensate in performance for incomplete and imperfect linguistic resources in a second language Discourse Competence : the ability to deal with extended use of language in context
In communicative language tests , language has been used in a manner appropriate to its context .That is the language introduced in tests should be useful and authentic. There are two important features of the communicative language tests; These are performance tests,requiring assessment to be carried out ,when the learner lor the candidate is engaged in the act of communication ,either receptive, or productive, or both. They pay attention to the social roles candidates are likely to assume in real world settings and offer a means of specifying the demands of such rules in detail.
The British ELTS Test (International Language TestingSystem ) and The American TOEFL (Test of English as a Foreign Language )are good examples of tests widely used in English .
KINDS OF FOREIGN LANGUAGE TEST AND TESTING PROGNOSTIC TEST PLACEMENT/ENTRY TEST SELECTION TEST DIAGNOSTIC TEST PROGRESS TEST ACHIEVMENT TEST PROFICIENCY TEST LANGUAGE APTITUDE TEST DIRECT AND INDIRECT TESTS NORM-REFERENCED AND CRITERION-REFERENCED TESTS INTEGRATIVE AND PRAGMATIC TESTS COMMUNICATIVE TESTS TEACHER-MADE TESTS VS.STANDARDIZED TESTS
PROGNOSTIC TEST
It refers to tests used to predict the future course of action about the test-takers and make decisions about the most appropriate channel of educational or occupational career for the test-takers and their future success on the basis of their present capabilities. These tests are categorized into two main groups: (i) Selection test and (ii) placement test.
PLACEMENT/ENTRY TEST
The term Placement test refers to the measurement of the test-takerscapability in pursuing a certain path of langauge learning. They are administered to learners at the beginning of a course or the academic year to identify their levels. This type of test will indicate at which level a learner will learn most effectively in a situation where there are different levels or streams. The aim is to produce groups which are homogeneous in level that will use institutional and teacher time most effectively. The larger the groups of learners to be designed, the more homogeneous the groups will need to be and therefore the more reliable the entry or placement test will need to be. This type of test is less useful where students are grouped alphabetically or according to age rather than ability. Entry or placement tests are not very common in state run institutions.
SELECTION TEST
These tests are used to provide information concerning the test-takers acceptance into a particular programme. For instance, according to the scores the test-takers get from TOEFL, they are accepted into academic programme in the universities in USA.
DIAGNOSTIC TEST
As the name suggests, this type of test is used to find out problem areas. Where other types of tests are based on success, diagnostic tests are based on failure. We want to know in which areas a student or group of students are having problems, which parts of a course or learning objectives those students cannot cope with. One way of looking at this type of test is to consider it as a technique based on eliciting errors rather than correct answers or language. Diagnostic information is vital for teachers in order to design further course activities and work out remedial activities. The information can also be useful for learners, as they can analyze their own problems. Diagnostic testing is present in many progress tests for the simple reason that the progress tests identify problem areas. However, a reliable diagnostic test is difficult to design.
PROGRESS TEST
Progress tests, administed during courses, may be used after certain blocks of study, eg after x number of units, at the end of each week, each term etc during courses, aiming to find out information about how well classes as a whole and individual students have grasped the learning objectives, how well the course content is functioning within the specified aims and objectives and future course design.
PROGRESS TEST
These tests help teachers identify easily how well students
are progressing in a very short period of time, eg a progress test of half an hour can give a great deal of information about the class if the test is well designed and if the test samples widely from the course content. Progress tests can also provide important feedback to the learners. When linked with self-assessment, feedback can help learners to identify their own problems and to set their own goals for the future learning.
ACHIEVEMENT TEST
This type of test is designed to measure how much a learner has learnt rfom a particulr course or sylllabus. For instance, an achievement test may be a reading comprehension test based on the reading passages in a coursebook. It helps the teacher to judge the success of his/her teaching and determine the learners weaknesses. A proficiency test may use similar test items but as mentioned before they are not linked to any particular textbook or language syllabus. The achievement test serves two purposes: determining progress and diagnosing learner weaknesses.
44
PROFICIENCY TEST
This type of test is designed to measure the test-takers general level of langauge mastery . It is not linked to a particular course of instruction. Some proficiency tests have been standardized for world-wide use, such as the American TOEFL used to test the English langauge proficiency of foreign students who wish to study in the United States.
PROFICIENCY TEST
This type of test aims to describe what students are
capable of doing in a foreign language and are usually set by external bodies such as examination boards. Proficiency tests enable students to have some proof of their ability in a language. They also provide potential employers with some guarantee of proficiency in a language because examination boards are seen as bodies which set standards in an impartial way and boards' examinations are generally considered to be reliable and valid. Some proficiency tests, while claiming to be communicative, often have a large language component such as grammar or vocabulary. This can have a negative washback effect on teaching in terms of examination preparation.
DIRECT AND INDIRECT TESTS

Test formats and procedures that involve the use of
language for communication in real life situations are referred to as direct tests. For intance, oral interviews. Testing is said to be direct when it requires the test-takers to perform precisely the skill we intend to measure. That is, if we want to know how well they speak, we need to get them to speak and if we want to know how well they write compositions, we need to get them to write compositions. The tasks and texts shoul be as authentic and realistic as possible

The advantages of direct testing 1. Providing that we are clear about what abilities/skills we wish to assess, it is relatively simple to create the conditions which will elicit the behaviour on which we can base our judgements. 2. In the case of the productive skills, the assessment and interpretation of the test-takers performance is also quite straightforward. 3. Direct testing may have positive washback effect since practice for the test involves practice of the skills we want to develop. 4. Direct tests are usually constrasted with indirect tests

When a test attempts to measure the abilities underlying
the skills we are interested in, it is called indirect test. For instance, one section of the TOEFL in which the testtakers have to identify erroneous or inappropriate elements in formal standard English was developed as indirect measure of writing ability. Another example is testing pronunciation ability by a paper and pencil test in which test-takers have to identify pairs of words which rhyme with each other. Since the relationship between performance on these tests and performance of the skills in which we are usually more interested tends to be weak in strength and uncertain in nature, they doProf. Dr.give us feedback. not Dinay Kksal

Should you aim for direct or indirect testing? To help
with this decision you may find the following helpful: Indirect testing makes no attempt to measure the way language is used in real life, but proceeds by means of analogy. Some examples that you may have used are: Most, if not all, of the discrete point tests mentioned above. Cloze tests Dictation (unless on a specific office skills course) Indirect tests have the big advantage of being very testlike. They are popular with some teachers and most administrators because can be easily administered and scored, they also produce measurable results and have a high degree of reliability.
.

Direct tests, on the other hand, try to introduce authentic tasks, which model the students real life future use of language. Such tests include: Role-playing. Information gap tasks. Reading authentic texts, listening to authentic texts. Writing letters, reports, form filling and note taking. Summarising. Direct tests are task oriented rather than test oriented, they require the ability to use language in real situations, and they therefore should have a good formative effect on your future teaching methods and help you with curricula writing. However, they do call for skill and judgment on the part of Prof. Dr. Dinay Kksal the teacher.
DISCREET POINT AND INTEGRATIVE TESTING

You may see or hear these words when being asked to assess or
even write a test so lets see what they mean. Discrete Point tests are based on an analytical view of language. This is where language is divided up so that components of it may be tested. Discrete point tests aim to achieve a high reliability factor by testing a large number of discrete items. From these separated parts, you can form an opinion is which is then applied to language as an entity. You may recognise some of the following Discrete Point tests: 1. Phoneme recognition. 2. Yes/No, True/ False answers. 3. Spelling. 4. Word completion. 5. Grammar items. 6. Most multiple choice tests. Such tests have a down side in that they take language out of Prof. Dr. Dinay relationship to the concept or use of context and usually bear no Kksal whole language.
DISCREET POINT AND INTEGRATIVE TESTING

Integrative tests In order to overcome the above defect, you should
consider Integrative tests. Such tests usually require the testees to demonstrate simultaneous control over several aspects of language, just as they would in real language use situations. Examples of Integrative tests that you may be familiar with include: 1. Cloze tests 2. Dictation 3. Translation 4. Essays and other coherent writing tasks 5. Oral interviews and conversation 6. Reading, or other extended samples of real text
COMMUNICATIVE TESTS
These tests require test-takers to perform communicatively
like that of real-life situations. Communicative tests are concerned not only with these different aspects of knowledge but on the test-takers ability to demonstrate them in actual situations. So, how should you go about setting a Communicative test? Firstly, you should attempt to replicate real life situations. Within these situations communicative ability can be tested as representatively as possible. There is a strong emphasis on the purpose of the test. The importance of context is recognised. There should be both authenticity of task and genuiness of texts. Tasks ought to be as direct as possible.
COMMUNICATIVE TESTS
When engaged in oral assessment we should attempt to
reflect the interactive nature of normal speech and also assess pragmatic skills being used.
Communicative tests are both direct and integrative. They
attempt to focus on the expression and understanding of the functional use of language rather than on the more limited mastery of language form found in discreet point tests. subject to criticism in some quarters, yet as language teachers see the positive benefits accruing from such testing, they are becoming more and more acceptable.
The theoretical status of communicative testing is still
56
COMMUNICATIVE TESTS
They will not only help you to develop communicative
classroom competence but also to bridge the gap between teaching, testing and real life. They are useful tools in the areas of curriculum development and in the assessment of future needs, as they aim to reflect real life situations. For participating teachers and students this can only be beneficial.
COMMUNICATIVE TESTS
While preparing a communicative language test the following criteria should be taken into consideration: Be precise about the skills and performance conditions Identify the most important components of language use in particular contexts. Ensure that the sample of communicative language ability in our tests is as representative as possible. Ensure that tests meet the performance conditions of a context as fully as possible. The important role of context should be stressed, because language without a context cannot be meaningful. Use authentic materials and genuine texts, because using inauthentic materials may cause some problems. Tests of communicative language ability should be as direct as possible and attempt
COMMUNICATIVE TESTS
to reflect the real life situation. Unsimplified language should be used as input. The tasks should be conducted under normal time
constrains. Minimize any detrimental effect that may cause disturbance for the candidate. The holistic and qualitative assessment of productive skills should be utilized.
TEACHER-MADE TESTS VS.STANDARDIZED TESTS

The differences between teacher-made tests and
standardised tests can be listed as follows: 1. As the name suggests, teachers determine the content of the test whereas the content of the standardised tests are determined by the curriculum, that is, what to teach and what to test is defined in the syllabus. 2. The directions in teacher-made tests may be understood by the students of the teacher who gives those directions only in his classes but may not mean anything to the students from other schools, while the ones in standardised tests can be understood by students from all schools and nationalities because they follow a uniform procedure and culture free.

3. The norms in teacher-made tests are local and may be interpreted differently by different teachers, while in standardised tests there are norms described by the specialists to be neither low nor too high. If the standards are too high, the learners will be discouraged and if they are too low, the learners will find the tests very easy and will not put forth their maximum efforts. The standards must be uniform, well balanced and clearly defined according to the the expectations, objectives and the level of proficiency to be attained by the learners. Tecaher-made tests that are generally hurriedly constructed by a single teacher are dependent on the teachers intuition, knowledge and experience while standardised tests that are constructed by a group of experts are pretested after the test items are prepared. Then upon the item analysis the test items are revised before the test is administered.
4.

5. In teacher-made tests reliability is usually unknown. It can be high if they are carefully constructed while in standardized tests it is high, frequently above 0.90.
NORM-REFERENCED(NRTs) AND CRITERION-REFERENCED TESTS(CRTs)

Norm Referenced Tests
Norm Referenced tests answer such questions as, how student A compares with student B. Attainment or Achievement tests should be specifically designed to answer this question. Criterion Referenced tests answer such questions as, How much has student Y learnt? Or, How much does student X know? Proficiency tests should be designed to answer such questions. NRTs thus differ from CRTs in focus, timing, purpose and theoretical motivation and reflect different perspectives and goals. Brown (1995) classifies CRTs and NRTs according to their test characteristics and logistical dimensions
Criterion Referenced Tests
63
NORM-REFERENCED(NRTs) AND CRITERION-REFERENCED TESTS(CRTs)

Differences between NRTS and CRTS (Brown, 1995, p.12).
CRTs
Foster learning Diagnosis, progress, achievement Classroom specific Know content to expect
NRTs
Classify/group students Aptitude, proficiency, placement Overall, global Do not know content Percentile
Test Characteristics Underlying Purposes Types of Decisions Levels of Generality Students' Expectations
Score Interpretations Percent Score Strategies ReportTests and students answers
toOnly scores go to students
Logistical Dimensions Group Size Range of Abilities Test Length Time Allocated Relatively small group Relatively homogeneous Relatively few questions Large group Wide range of abilities Large number of questions Long (2-4 hours) administration
Prof. Dr. Dinay Kksal Relatively short time
Cost
anakkale Onsekiz Mart Teacher time & duplication Test booklets, tapes, proctor University ELT Department
64
PRAGMATIC TESTS
BASIC CONSIDERATION IN TEST DESIGN:CHARACTERISTICS OF GOOD LANGUAGE TESTS

The concept of validity:testing the test 1. Construct validity 2. Content validity 3. Face validity The concept of reliability Practicality
The concept of validity

The purpose of a test is to measure a learners knowledge
or ability, but it can only be considered valid if it actually measures what it is intended to measure. The concept of validity is significant for all aspects of the design of all types of test, whatever the approach or methods used for learning. A consideration of various types of validity will include the following: Construct validity Content validity Face validity Washback validity Criterion-related validity
Prof. Dr. Dinay Kksal Testing and Evaluation anakkale Onsekiz Mart University ELT Department
67
The concept of validity:testing the test

Construct Validity Mousavi (1997) defines construct validity as a form of
validity which is based on the degree to which the items in a test reflect the essential aspects of the theory on which the test is based (construct). Thus, if a test is based on a different construct from that used during the learning process, it cannot be said to have construct validity. Weir (1990) tells us that some researchers consider construct validity to encompass the other types of validity (Anastasi, 1982; Cronbach, 1971). A test is like a definition of a construct and must therefore adequately reflect the construct in order to be valid.
68

We can discover what a test measures by relating test
scores to other external data. Weir (1990) views such data as a necessary condition for establishing test validity, but not of itself sufficient. Test designers should attempt to establish a priori construct validity for a test, even if there is no adequate framework for the construction of tests, as is the case with the communicative approach. Following the test, statisitical procedures can be applied to determine the extent to which the test is valid.

Convergent and discriminatory validity are two processes
used to investigate the statistical relationship between a test and the construct being measured. Convergent validity means that the same trait is tested in different ways and there is a high level of correlation between the results. Discriminant validity requires that the results of two different methods of testing the same construct have a high correlation, whereas the same method of testing applied to different constructs should have a lower correlation value.

Statistical a posteriori validation thus occurs if a test
correlates closely with the behaviour it is intended to measure. However, another school of thought emphasises the need for a priori construct validation of test design. Since validity is based on the construct, and the construct is in turn based on a theory, the theory must be developed as fully as possible, so that clear statements can be made about what should be measured. Following the test, statistical investigations can then be used to confirm validity.

Content Validity Content validity refers to the extent to which a test contains representative samples of and measures those aspects of behaviour it is designed to measure. Some scholars consider that construct and content validity overlap to a great extent, especially in the case of tests of general proficiency. Communicative tests are often intended to pinpoint areas of deficiency and content validity is very useful in this case, as it determines the extent to which the tasks used in the test reflect the universe of tasks being tested. Anastasi (in Weir, 1990) gives guidelines for establishing content validity:

1.
2.
3.
the behaviour domain to be tested must be systematically analysed to make certain that all major aspects are covered by the test items and in the correct proportions the domain under consideration should be fully described in advance, rather than being defined after the test has been prepared content validity depends on the relevance of the individuals test responses to the behaviour area under consideration, rather than on the apparent relevance of item content
73

Defining the area of language to be tested is thus of
paramount importance in establishing content validity, but this can sometimes be very difficult, especially if the area to be covered is very wide or imprecise. Testers should, however, aim to design a test which is as relevant as possible. It is very important to specify clearly what exactly is being tested and to construct items accordingly. One way of investigating whether this has been achieved is to carry out an introspective study with some of the test candidates. This can help establish whether the processes used by candidates actually conform to the test specifications.

Face Validity If, to an observer, a test appears to measure what it
purports to measure, then it is said to have face validity. This is a subjective evaluation which thus has no place in objective assessments of validity and is therefore rejected as being of no relevance by some scholars. However, the question of face validity may be of significance to test takers, administrators and other interested parties. If there is no perceived validity, and the test seems unacceptable to candidates, their performance may be affected. Test designers must not overemphasise the importance of face validity to the detriment of objective validity, but it seems it should be taken into consideration for practical reasons.

Washback Validity Washback validity is defined by Mousavi (1997) as the
degree to which a test satisfies students, teachers and future users of test results. If tests are designed to provide feedback for teachers and students and thus identify areas of weakness or strength, then the tests possess washback validity. They contribute to learning and can also have an effect on the curriculum. In the case of such tests, there is obviously a close relationship between the test and the teaching which preceeded it.

Criterion-related Validity Whereas the types of validity discussed so far are closely
related to what the test measures, in the case of criterionrelated validity this relationship is not so crucial. Criterionrelated validity is concerned with the correlation of test scores to external performance criteria and can be divided into concurrent and predictive validity. Cohen (in CelceMurcia, 1991) explains them as follows:

1.
Concurrent validity: test results are compared with results from another test given at about the same time. Predictive validity: test results are compared with results from another test or another type of measure obtained at a later date.
2.

Although some researchers maintain that external validation
is always preferable to internal analysis of a test, it is possible that a test valid in this way may not possess construct validity. Even if it does, it may be difficult to find similar tests to be used as concurrent criteria in this method of validation, especially in communicative testing. In such cases, it may be possible to use other non-test criteria, although it would be extremely difficult to establish these.
Criterion-related validity measures can be very useful,
especially if the results of a test will have an effect on the candidates futures. Criterion-related validity can indicate the equivalence of one test with another.
79
The concept of reliability

Reliability refers to the consistency of test scores. It simply
means that a test would give similar results if it were given at another time. It is about producing precise and repeatable measurements on a clear scale of measurement units. Such tests give consistent results across a wide range of situations. This is achieved by carefully piloted trials of the test. Sometimes, several versions of the test may be used on a controlled population of testees. The outcomes of these trials are carefully analysed in order to establish consistency. When a consistent set of figures is achieved the test may be deemed reliable.
The concept of reliability

Three important factors effect test reliability.
Test factors such as the formats and content of the questions and the length of the exam must be consistent. Administrative factors are also important for reliability. These include the classroom setting (lighting, seating arrangements, acoustics, lack of intrusive noise etc.) and how the teacher manages the exam administration. Affective factors in the response of individual students can also affect reliability. Test anxiety can be alleviated by coaching students in good test-taking strategies. Prof. Dr. Dinay Kksal
Social issues in language testing

Social issues in language testing 1. Ethics in language testing 2. Accountability 3. Washback 4. Test impact
Testing Ethics
The moral and social considerations involved
in the construction and use of tests.
Washback
The influence of tests on teaching and learning is called
the washback or backwash effect. If your students have to do a test or maybe a public examination at the end of the course this will affect the syllabus. If we have a good test, this should affect teaching in a positive manner. If we have a bad test, this might affect teaching in a negative manner. What is a 'good' test and what is a 'bad' test? A test can have a positive influence if it contains authentic, real-life examples of the type of tasks which your learners will need to perform in the future.
Washback
Tests can have a negative influence if they
contain artificial tasks not linked to real future needs. Teaching methods will probably reflect these tasks and the learning process could end up revolving around what we might term 'exam practice'. Your own tests will also have an effect on your students' learning. If you test mainly grammar, your students will assume that this is the most important thing to learn and may make less effort during other more communicative activities.
86
Washback
Hughes (2003:53-57) makes the following suggestions to
achieve beneficial or positive backwash. We should test the abilities whose development we want to encourage. If the teacher want to encourage oral ability he must test oral abilility. Teacher generally tend to test what is easiest to test rather than what is most important to test(p.53) because oral interview, a direct way of testing oral ability, requires subjective scoring, or marking. However, we know that it is possible to obtain high level of reliability through analytic rating.
Washback
We should train teachers about the
innovations in language teaching and testing. Guidance and training has to be given to teachers when a new test is introduced. If these are not given, the test will not produce the intended effect. New tests require changing teaching since test techniques must be paralel to teaching techniques. Dr. Dinay Kksal Prof.
Washback
We should sample widely and
unpredictably The sample should be representative of the full scope of the specifications. It is necessary to include a wide range of tasks when testing a skill if we want backwash to be beneficial. Similarly, the content of a test shouldnt be predictable if we do not want teaching and learning to concentrate on what can be predicted(p.54). So tests must contain the full range of specifications.
Washback
We should use direct testing If we want learners to learn to write a
scientific report, then we should get them to write a scientific report in the test. Direct testing refers to the testing of performance skills through the use of authentic texts and tasks.
Washback
We should make testing criterion-
referenced Test specifications should make clear what test-takers can do, with what degree of success if we want learners to have a clear picture of what they are expected to achieve. Learners who can perfom the tasks at the criterial level are considered successful, regardless of how other learners perform. These things help motivate learners.
Washback
we should base achievement tests on
objectives To provide a better picture of what has been actually achieved in the learning and teaching process, it is necessary to base achievement tests on objectives, rather than on detailed teaching and coursebook content.
Washback
We should ensure that the test should
be understood by students and teachers If we want to get beneficial washback effect, language teachers and learners need to understand and know what the test demands of them. The rationale and specifications and sample items should be made available to those responsible for test preparation. This kind of information will also increase test reliability. Prof. Dr. Dinay Kksal
Accountability
Accountability A final element is accountability. As professionals, teachers should be able to provide learners, parents, institutions and society in general, with clear indications of what progress has been made and if it has not, why that is so. We should be able to explain the rationale behind the way assessment takes place and how conclusions are drawn, rather than hiding behind a smoke screen of professional secrecy.
Test impact
The wider effect of tests on the community as a whole
beyond the classroom, including the school, is referred to as test impact. For instance, such tests asTOEFL and IELTS which are used as gatekeeping mechanisms for international education and administered to huge number of candidates all over the world.
Test impact
The impact of test use operates at two levels:
Test taking and Use of test scores Macro:Society,education System Micro:Individuals
Impact
Test impact
The impact of test use needs to be considered
within the values and goals of society and educational programme in which it takes place and according to the potential consequences of such use.
Holistic vs. Analytic Scoring of Writing

Holistic Scoring:
Often referred to as "impressionistic" scoring Involves the assignment of a single score to a piece of writing on the basis of an overall impression of it. Individual features of a text, such as grammar, spelling, and organization, should not be considered as separate entities. Has the advantage of being very rapid (Hughes 1989: 86).
"Since, in holistic scoring, the entire written text is evaluated as a whole, it is important to establish the specific criteria upon which the evaluation is to be based prior to undertaking the evaluation. This does not mean establishing a catalogue of precise individual errors that might appear, but rather deciding what impact the errors that are present have on the overall tone, structure, and comprehensibility of the writing sample" (Terry 1989: 49).

Holistic Scoring Scale used by Educational Testing
Service for evaluating the Advanced Placement Examination in foreign languages:
works well and can be altered to fit the level of the
students and the focus of instruction. numerical scale that ranks performance at levels described as "superior," "competent," and "incompetent." for each level, the descriptions can be changed to reflect the kind of performance that teachers expect at a given level of language ability. reliability of this scoring method is considered good when the raters are trained to establish common standards based on practice with the kinds of writing samples that they will be evaluating (Cooper 1977).

Analytic Scoring:
A method of scoring that requires a separate score for each of a number of aspects of a task, such as grammatical accuracy, vocabulary, idiomatic expression, organization, relevance, coherence. Disposes of the problem of uneven development of subskills in individuals. Scorers are compelled to consider aspects of performance which they might otherwise ignore. The very fact that the scorer has to give a number of scores will tend to make the scoring more reliable. In some schemes, each of the components is given equal weight. In other schemes, the relative importance of the different aspects, as perceived by the tester (with or without statistical support), is reflected in weightings attached to the various components.

Disadvantages: It takes more time than holistic scoring. Concentration on the different aspects may divert
attention from the overall effect of the piece of writing. Inasmuch as the whole is often greater than the sum of its parts, a composite score may be very reliable but not valid (Hughes 1989: 93-94) "Students should receive a copy of the analytic scoring criteria in advance. They will then know what is expected of them and how to interpret the evaluation of their written work. They can readily see where their strengths and weaknesses lie and can, over time, visualize their progress with subsequent evaluated samples of their writing" (Terry 1992: 247).
101
REFERENCES
Harris David 1979. testing English as a Second Language.
New York: McGraw-Hill Book Company. Birjandi Parviz, Bagheridoust, Esmaeil, Rarviz Mossalanejad 1999. Language Testing: A Concise Collection for Graduate Applicants. An Alternative resolution to the Complications of Language Testing. Tehran: Shahid Mahdavi Publications. Brown, J. D. (1995). Differences between norm-referenced and criterion-referenced tests. In J. D. Brown, & S. O. Yamashita (Eds.), Language testing in Japan (pp. 12-19). Tokyo: The Japan Association for Language Teaching. Brown, James Dean. 1996. Testing in Language Programs. New Jersey: Prentice Hall. Bynom, Anthony. 2001.Testing: Basic Concepts: Basic Terminology originally published in ENGLISH TEACHING Prof. Dr. Dinay Kksal professional, Issue 20, July 2001.
eir, Cyril J. 1990. Communicative Language Testing. New York: Prentice Hall. ughes, Arthur. 2003. Testing for Language Teachers.2nd edition. Cambridge: Cambridge University Press. aker David. 1989. Language Testing: A Critical Survey and Practical Guide. London: Edward Arnold. eaton, J.B. 1990. Writing English Language Tests. 3rd impression. London and New York: Longman.
cNamara, Tim. 2000. Language Testing. Oxford: anakkale Onsekiz Mart Testing and Evaluation University ELT Department Oxford University Press.
103
Bachman, Lyle F. 1995. Fundemental
Considerations in Language Testing. Oxford: Oxford University Press. Bachman, Lyle F. And Palmer Adrian S. 2000. Language Testing in Practice. Oxford: Oxford University Press. Genesee, Fred and Upshur John A. 1998. Classroom-based Evaluation in Second Language Education. Third Impression. Cambridge: Cambridge University Press. Haris, Michael and McCann Paul. 1994. Assessment. Heinemann. Prof. Dr. Dinay Kksal Mousavi Abbas Seyyed. 1999. A Dictionary of anakkale Onsekiz Mart 104 Testing and Evaluation University ELT Department Language Testing. 2nd edition. Rahnama
Terry, Robert M. 1989. "Teaching and Evaluating Writing as
a Communicative Skill." Foreign Language Annals. 22 (1): 43-54. _____. 1992. "Improving Inter-rater Reliability in Scoring Tests in Multisection Courses." AAUSC Issues in Language Program Direction: Development and Supervision of Teaching Assistants in Foreign Languages. Boston:Heinle & Heinle Publishers. Cooper, Charles R. and Lee Odell, (Eds.). 1977. "Holistic Evaluation of Writing." In C.R. Cooper and L. Odell (Eds.), Evaluating Writing. Urbana, IL: National Council of Teachers of English, 3-31.
John W. Oller, Jr. Foreign Language Testing, Part 1: Its
Breadth ADFL Bulletin 22, no. 3 (Spring 1991): 33-38

Testing and Evaluation-Presentation - 18oct'05

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Testing and Evaluation-Presentation - 18oct'05

Caricato da

Copyright:

Formati disponibili

Prof. Dr.

Dinay Kksal anakkale Onsekiz Mart University ELT Department 2011

Testing and Evaluation

anakkale Onsekiz Mart University ELT Department

LANGUAGE TEACHING LEARNING AND TESTING

LANGUAGE TEACHING LEARNING AND TESTING

LANGUAGE TEACHING LEARNING AND TESTING

Reasons for testing

Reasons for testing

Reasons for testing

Reasons for testing

Within a practical, comprehensive philosophy of language

Tests in any classroom setting have a variety of functions

Diagnostic. The tests help teachers and students identify

DEFINITION OF BASIC TERMS

DEFINITION OF BASIC TERMS

DEFINITION OF BASIC TERMS

DEFINITION OF BASIC TERMS

DEFINITION OF BASIC TERMS DEFINITION OF BASIC TERMS

DEFINITION OF BASIC TERMS

Testing and Evaluation

DEFINITION OF BASIC TERMS

Testing and Evaluation

DEFINITION OF BASIC TERMS

DEFINITION OF BASIC TERMS

Testing and Evaluation

APPROACHES TO LANGUAGE TESTING

Testing and Evaluation

Pychology Learning Theory

Linguistic s Theory of language

Methodology Testing / Aprroach Language teaching EssayTranslation Discretepoint

Behaviouristi Structuralis Audio-lingual c m

Cognitivistic Generativis Cognitive-code Integrative m learning theory

Psycholinguis PragmatismFunctionalanakkale Onsekiz Mart tic University ELT Department notional

Testing and Evaluation

This approach, commonly referred to as psychometric-

Unlike essay-translation approach, the key concern here is

Testing and Evaluation

Based on the structural linguists view of language and

Testing and Evaluation

Testing and Evaluation

Testing and Evaluation

Testing and Evaluation

Testing and Evaluation

Testing and Evaluation

anakkale Onsekiz Mart University ELT Department

DIRECT AND INDIRECT TESTS

DIRECT AND INDIRECT TESTS

DIRECT AND INDIRECT TESTS

DIRECT AND INDIRECT TESTS

DIRECT AND INDIRECT TESTS

DISCREET POINT AND INTEGRATIVE TESTING

DISCREET POINT AND INTEGRATIVE TESTING

Communicative tests are both direct and integrative. They

The theoretical status of communicative testing is still

Testing and Evaluation

anakkale Onsekiz Mart University ELT Department

TEACHER-MADE TESTS VS.STANDARDIZED TESTS

TEACHER-MADE TESTS VS.STANDARDIZED TESTS

TEACHER-MADE TESTS VS.STANDARDIZED TESTS

NORM-REFERENCED(NRTs) AND CRITERION-REFERENCED TESTS(CRTs)

Criterion Referenced Tests