Educational Research

Quantitative research methods
in educational planning
Series editor: Kenneth N.Ross
1
Module
T. Neville Postlethwaite
Institute of Comparative Education
University of Hamburg
Educational research:
some basic concepts
and terminology
UNESCO International Institute for Educational Planning

Quantitative research methods in educational planning
These modules were prepared by IIEP staff and consultants to be used in training
workshops presented for the National Research Coordinators who are responsible
for the educational policy research programme conducted by the Southern and
Eastern Africa Consortium for Monitoring Educational Quality (SACMEQ).
The publication is available from the following two Internet Websites:

http://www.sacmeq.org and http://www.unesco.org/iiep.
iiep/web doc/2005.01
International Institute for Educational Planning/UNESCO

7-9 rue Eugène-Delacroix, 75116 Paris, France
Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66
e-mail: information@iiep.unesco.org
IIEP web site: http://www.unesco.org/iiep
September 2005 © UNESCO
The designations employed and the presentation of material throughout the publication do not imply the expression of
any opinion whatsoever on the part of UNESCO concerning the legal status of any country, territory, city or area or of
its authorities, or concerning its frontiers or boundaries.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means: electronic, magnetic tape, mechanical, photocopying, recording or otherwise, without permission
in writing from UNESCO (International Institute for Educational Planning).
Graphic design: Sabine Lebeau

Typesetting: Sabine Lebeau
Printed in IIEP’s printshop
Module 1 Educational research: some basic concepts and terminology
Content
1. Introduction 1
2. Types of educational research 2
3. Three types of research questions

in educational planning 6
Descriptive questions 6
Correlational questions 8
Causal questions 9
4. Identifying research issues for educational

planning 11
5. Sequential stages in the research process 16

General and specific research questions 16
Literature review 16
Research design 17
Instrumentation 18
Pilot testing 20
Data collection 22
Data analysis 24
Research report 28
1
6. Conclusion 29
Appendix A
Terminology used in educational research 30
Formative and summative evaluation 31
Assessment, evaluation, and research 33
Measurement 33
Surveys and experiments 34
Tests 36
1. Test items 36
2. Sub-scores/Domain scores 37
Variable 37
1. Types of variables 38
Validity and reliability 39

1. Validity 39
2. Reliability 41
Indicator 42
Attitude scales 43
Appendix B
Further reading suggestions 46
Introductory texts 46
Examples of educational research studies that aimed
to have an impact on educational planning 47
Encyclopedias and handbooks 48
Journals 48
Appendix C
Exercises 49
II
Introduction 1
Research is the orderly investigation of a subject matter for the

purpose of adding to knowledge. Research can mean ‘re-search’
implying that the subject matter is already known but, for one
reason or another, needs to be studied again. Alternatively, the
expression can be used without a hyphen and in this case it
typically means investigating a new problem or phenomenon.
Within the realm of educational planning, many things are always

changing: the structure of the education system, curriculum
and textbooks, modes of teaching, methods of teacher training,
the amount and type of provisions to schools such as science
laboratories, textbooks, furniture, classroom supplies, and so on.
These changes may lead to an improvement, or a worsening, in
the quality of an educational system. Sometimes they may result
in no impact upon quality – in which case major government
expenditures on such changes have been wasted. The educational
planner working within this kind of environment must be able to
undertake assessments of the effects of major changes and then
provide policy advice that will consolidate and extend the post
productive courses of action, and also intercept and terminate
existing practices that are shown to be damaging and wasteful.
© UNESCO 1
2 Types of educational research

There are many types of educational research studies and there are
also a number of ways in which they may be classified. Studies may
be classified according to topic whereby the particular phenomena
being investigated are used to group the studies. Some examples
of educational research topics are: teaching methods, school
administration, classroom environment, school finance, etc. Studies
may also be classified according to whether they are exploratory or
confirmatory.
An exploratory study is undertaken in situations where there is

a lack of theoretical understanding about the phenomena being
investigated so that key variables, their relationships, and their
(potential) causal linkages, are the subject of conjecture. In
contrast a confirmatory study is employed when the researcher has
generated a theoretical model (based on theory, previous research
findings, or detailed observation) that needs to be tested through
the gathering and analysis of field data.
A more widely applied way of classifying educational research

studies is to define the various types of research according to the
kinds of information that they provide. Accordingly, educational
research studies may be classified as follows:
1. Historical research generates descriptions, and sometimes

attempted explanations, of conditions, situations, and events
that have occurred in the past. For example, a study that
documents the evolution of teacher training programs since the
turn of the century, with the aim of explaining the historical
origins of the content and processes of current programs.
2
2. Descriptive research provides information about conditions,
situations, and events that occur in the present. For example,
a survey of the physical condition of school buildings in order
to establish a descriptive profile of the facilities that exist in a
typical school.
3. Correlational research involves the search for relationships

between variables through the use of various measures of
statistical association. For example, an investigation of the
relationship between teachers’ satisfaction with their job and
various factors describing the provision and quality of teacher
housing, salaries, leave entitlements, and the availability of
classroom supplies.
4. Causal research aims to suggest causal linkages between

variables by observing existing phenomena and then searching
back through available data in order to try to identify plausible
causal relationships. For example, a study of factors related to
student ‘drop out’ from secondary school using data obtained
from school records over the past decade.
5. Experimental research is used in settings where

variables defining one or more ‘causes’ can be manipulated
in a systematic fashion in order to discern ‘effects’ on other
variables. For example, an investigation of the effectiveness
of two new textbooks using random assignment of teachers
and students to three groups – two groups for each of the new
textbooks, and one group as a ‘control’ group to use the existing
textbook.
6. Case study research generally refers to two distinct

research approaches. The first consists of an in-depth study
of a particular student, classroom, or school with the aim of
producing a nuanced description of the pervading cultural
setting that affects education, and an account of the interactions
© UNESCO 3
that take place between students and other relevant persons.

For example, an in-depth exploration of the patterns of
friendship between students in a single class. The second
approach to Case Study Research involves the application of
quantitative research methods to non-probability samples
– which provide results that are not necessarily designed to be
generalizable to wider populations. For example, a survey of the
reading achievements of the students in one rural region of a
particular country.
7. Ethnographic research usually consists of a description of

events that occur within the life of a group – with particular
reference to the interaction of individuals in the context of the
sociocultural norms, rituals, and beliefs shared by the group.
The researcher generally participates in some part of the normal
life of the group and uses what he or she learns from this
participation to understand the interactions between group
members.
For example, a detailed account of the daily tasks and
interactions encountered by a school principal using
observations gathered by a researcher who is placed in the
position of ‘Principal’s Assistant’ in order to become fully
involved in the daily life of the school.
8. Research and development research differs from the

above types of research in that, rather than bringing new
information to light, it focuses on the interaction between
research and the production and evaluation of a new product.
This type of research can be ‘formative’ (by collecting evaluative
information about the product while it is being developed with
the aim of using such information to modify and improve
the development process). For example, an investigation of
teachers’ reactions to the various drafts and redrafts of a new
mathematics teaching kit, with the information gathered at
each stage being used to improve each stage of the drafting
4
Types of education research
process. Alternatively, it can be ‘summative’ (by evaluating

the worth of the final product, especially in comparison to
some other competing product). For example, a comparison
of the mathematics achievement of students exposed to a new
mathematics teaching kit in comparison with students exposed
to the established mathematics curriculum.
© UNESCO 5
3 Three types of research

questions in educational
planning
In research on issues concerned with educational planning, the
main educational research questions can be subsumed under three
categories: descriptive, correlational, and causal.
Descriptive questions
In the field of educational planning, the research carried out on
descriptive questions is often focused on comparing the existing
conditions of schooling with: (i) legislated benchmark standards,
(ii) conditions operating in several other school systems, or (iii)
conditions operating in several sectors of a single school system.
Some examples are:
• What is the physical state of school buildings in the country?

Do some districts or regions have better or worse school
buildings than others? (Behind these two questions are the
implications that the Ministry of Education wishes to ensure
that all schools have a minimum standard of school building
while at the same time ensuring that there are not large
differences among schools with respect to the state of their
buildings.)
6
• Do the supplies and equipment in classrooms in the schools
match the legislated standards set by the Ministry? (The
supplies and equipment might be textbooks, exercise books,
pencils, erasers, seats and desks. The Ministry may have norms
that each student in a particular grade must have one mother
tongue textbook, one math textbook, one science textbook and
one social studies textbook, four exercise books, three pencils
and one eraser in a year, and that each student must have
one seat and one writing place. The research required in this
situation then consists of undertaking a count of the supplies
and equipment in all schools, or in a scientific sample of schools
that can be used to estimate the situation in all schools, and
then matching every school or classroom against the Ministry’s
norms. The main aim of this kind of research study would be to
examine whether there are particular districts or regions which
are under-supplied or over-supplied.)
• Where systems of education have teacher housing, how

adequate is the housing? (Some Ministries take the view that if
the personal needs of teachers such as teacher housing do not
satisfy the teachers then they will not be committed teachers.
So, again, the research will consist of fact-finding about teacher
housing and about teachers’ satisfaction with these housing
conditions.)
• What is the level of achievement in the core subject areas at a

particular level of schooling? Does such achievement accord
with the Ministry’s view of what should have been learned by
all students or particular sub-groups of students? A further
question is often raised about student achievement – is it better,
worse, or the same as last year’s achievement for the particular
grade group? And, again, are there differences between regions,
or urban and rural children, and so on? (By the early 1990s
there was a thrust from some systems of education – especially
that of the United States of America – to have comparisons
© UNESCO 7
of achievement (at the same age level) of different national

systems of education for countries at a similar level of economic
development.)
Correlational questions
Behind these kinds of questions, there is often an assumption
that if an association is found between variables then it provides
evidence of causation. However, care must be exercised when
moving between the notions of association and causation. For
example, an ‘association’ may be discovered between the incidence
of classroom libraries and average class reading scores. However,
the real ‘cause’ of higher reading scores may be that students from
high socio-economic backgrounds, while they tend to be in classes
with classroom libraries, read better than other students because
their home environments (in terms of physical, emotional, and
intellectual resources) facilitate the acquisition of reading skills.
Some examples are:
• Do students in poorer school buildings have lower achievement

scores than those in better buildings?
• Do students in better equipped classrooms have better

achievement scores than those in less well-equipped
classrooms?
• Do students in schools where the teachers have better teacher

housing have higher achievement than students in schools
where teachers have poorer teacher housing?
• Do male students do better than female students in the

scientific/technical subject areas?
8
Three types of research questions in educational planning
Causal questions
Causal questions are usually the most important to educational
planners. For example, in some schools it is considered normal for
children to have a desk at which to sit. In other schools the children
sit on the ground and write on their laps. It is important to know if
schools (with a particular socio-economic background of children)
with a shortage of desks and seats achieve less well than schools
(with a similar socio-economic background of children) with an
adequate supply of desks and chairs. Or, to put the question in a
different way, is it the desks and chairs, or something else, which
really cause the better achievement? It may be a better supply
of books or better qualified teachers or, or, or.... It is, therefore,
important to disentangle the relative influence of each of the many
input and process factors in schools on achievement.
As will be seen from another module in this series on ‘Research

Design’ both survey and experimental designs can be used to assess
the relative influence of many factors on educational achievement.
It is unusual in education to find only one factor influencing student
educational achievement. It is rather the case that several, or even
many, factors from outside and inside the school influence how well
or poorly students achieve in school.
Thus, causal questions take one of two forms. Some examples are:
• All other factors being equal do students with Textbook A

achieve better than students with Textbook B?
• What is the relative effect on school achievement of the

following factors:
• the socio-economic level of students in the school;

• the general parental help given to the children with their
homework;
© UNESCO 9
• peer group pressure;

• the condition of the school buildings;
• the supplies and equipment in the classroom;
• the curriculum;
• the quality of teaching, etc.
Given that many factors will affect student achievement, then it is
those factors that have a large influence which must be of concern
to educational planners. Once the important factors have been
identified then the planners can decide on the action they wish to
take. Let us assume that it is found that the existence of sufficient
desks and chairs does have a major influence on achievement, then
it is up to the Ministry of Education to ensure that sufficient desks
and chairs are made available. But, once this has been accomplished
it will be necessary to undertake further research to discover what
now is the most important factor influencing achievement.
10
Identifying research issues 4
for educational planning
The reason why all educational planners should be prepared to
undertake research is that it is important to be sure of the facts
before making suggestions for changes in educational policies and
practices. The maxim must always be “when in doubt, find out”.
The questions listed above are only examples and are very general
in nature. It is up to each Ministry and educational planning office
to pose its own questions in order to remove doubt. The formulation
of research questions is, however, not an easy matter.
A wise educational researcher, educational planner, and ex-Minister

of Education, Dr. C. E. Beeby from New Zealand, once wrote:
“I have suggested areas of research that seem to me to be of special

importance. But not once have I asked a specific question to which I want an
answer… I know enough about research to be aware that the formulation of
the proper question may take as much skill and professional insight as the
finding of an answer to it, and it may be a skill in which the administrator
is not adept. So, the research workers must be involved in the asking of the
questions, and must be prepared, in turn, to play a necessary, but secondary,
part in devising the policies that may follow from the research, where
their expertise is limited”. (ACER Radford Memorial Lecture, Melbourne,
Australia 1987).
© UNESCO 11
The examples of questions given earlier are general questions.

There is a lot of work involved in turning such general questions
into research questions. How this is done is dealt with in another
module in this series that is entitled: “Specification of Research
Aims”.
Before proceeding to a discussion of the sequential steps in the

research process, some examples are given of issues on which
research was undertaken by educational planners in the 1980s.
1. Indonesia national evaluation of grade nine

achievement levels
(Jiyono and Suryadi, 1992)
• A description of the achievement of the Grade nine students in
Mathematics, Science, Social Studies, Moral Education, Mother
Tongue, and English.
• The identification of the relative importance of the in – and
out-of-school factors associated with achievement in each
subject matter.
• A comparison of the achievement in 1981 with the
achievement of the students in the same grade in 1976 (Jiyono
and Suryadi, 1982).
2. Indonesia non formal learning behaviour

(Mappa, 1982)
• To compare the ‘booklet only’ learning groups with ‘booklet
plus radio’ learning groups in a distance education project on
the achievement of learners in: literacy, general knowledge,
and numeracy.
• To measure the attitudes of learners toward innovation, health
and nutrition, marriage, family planning and agriculture.
12
Identifying research issues for educational planning
3. Thailand adult education project

(Thongchua et al, 1982)
• To measure the skills and knowledge gained by participants in
typing and sewing courses of different duration.
• To identify variables having an effect on the achievement of
participants at the end of the course.
• To investigate whether the graduates took up employment in
typing/sewing within six months of the end of the course.
• To assess how participants utilized the skills six months after
completing the course.
4. Thailand community secondary schools project

(Sawadisevee et al, 1982)
• To estimate the projected target of the number of teachers and
students involved in the Community Schools Project.
• To evaluate whether the vocational teaching and learning
program had responded to student needs and community
needs.
• To assess whether the schools had been able to provide
community development activities and services.
5. Malaysia remedial reading project

(Norisah bt Atan et al, 1982)
• To assess the impact of parent-teacher involvement on
student’s reading performance.
• To establish the reading objectives attained by the students.
• To describe the levels of parent participation in the reading
activities of their children.
• To describe the levels of teacher participation in the reading
activities of students.
© UNESCO 13
6. Malaysia moral education project

(Asmah bt Mohd Taib et al, 1982)
• To compare attitudinal outcomes according to three methods
of moral education.
• To assess differential attitudinal outcomes of urban and rural
students.
• To measure to what extent teachers are able to apply the three
methods of moral education and to use the materials supplied
by the Ministry.
These are only a few selected examples of research conducted by

Ministries of Education in the 1980s . To take a final example from
the 1990s, a sixth grade survey in Zimbabwe had the following
aims.
7. Indicators of the quality of education: a summary of a

national study of primary schools in Zimbabwe
(Ross and Postlethwaite, 1992)
• What are the baseline data for the selected inputs to
Zimbabwe primary schools?
• What percentage of schools in Zimbabwe fall below the norms
for equipment and supplies?
• How equitably are these resource inputs distributed across
primary schools in Zimbabwe?
• What is the level of achievement in the schools, and to what
extent does achievement vary across the major administrative
regions of Zimbabwe?
• What are the linkages between selected inputs to Zimbabwe
schools and the learning outcomes of pupils? Which of these
inputs can be identified as the most likely to have a beneficial
input on pupil achievement through the Ministry of Education
reallocation of or increase in input resources?
14
Identifying research issues for educational planning
Indeed in the mid-1990s there were another seven southern African

countries that undertook sixth grade surveys with either identical or
very similar aims to the above.
Most studies were concerned with inputs to schools and the

relationship of inputs to achievement outcomes. In research studies
emanating from universities it often occurs that specific aspects
of education and different forms of outcomes are researched.
For example the self-esteem of students, the different kinds of
motivation of students, grade-repeating, and different modes of
teaching are favourite topics.
© UNESCO 15
5 Sequential stages in the research

process
General and specific research questions

The types of general research questions asked have been shown
above. In order for the research to proceed in a focused and
systematic manner, these questions must be refined to form more
specific research questions that indicate exactly which target
populations and which variables or factors should be included in
the research study.
Literature review
The review of literature aims to describe the ‘state of play’ in the
area selected for study. That is, it should describe the point reached
by the discipline of which the particular research study will form
a part. An effective literature review is not merely a summary
of research studies and their findings. Rather, it represents a
‘distillation’ of the essential issues and inter-relationships associated
with the knowledge, arguments, and themes that have been
explored in the area. Such literature reviews describe what has been
written about the area, how this material has been received by other
scholars, the major research findings across studies, and the major
debates in terms of substantive and methodological issues.
16
Research design
Given the specific research questions that have been posed, a
decision must be taken on whether to adopt an experimental design
for the study or a survey design. Further, if a survey design is to
be used, a decision must be taken on whether to use a longitudinal
design, in which data are collected on a sample at different points
of time, or a cross-sectional design, in which data are collected at a
single point of time.
Once the variables on which data are to be collected are known,

the next questions are: Which data collection ‘units’ are to be
employed? and Which techniques should be used to collect these
data? That is, should the units be students, the teachers, the school
principals, or the district education officers. And should data be
collected by using observations, interviews, or questionnaires?
Should data be collected from just a few hand-picked schools
(case study), or a probability sample of schools and students (thus
allowing inferences from the sample to the population), or a census
in which all schools are included? For a case study, the sample is
known as a ‘sample of convenience’ and only limited inferences can
be made from such a sample.
For research that aims to generalize its research findings, a more

systematic approach to sample selection is required. Detailed
information on the drawing of probability samples for both
experimental and survey design is given in another module in this
series entitled ‘Sample Design’.
© UNESCO 17
Instrumentation
Occasionally, data that are required to undertake a research study
already exist in Ministry files, or in the data archives of research
studies already undertaken, but this is rarely the case. Where
data already exist, the analysis of them is known as “secondary
data analysis”. But, usually, primary data have to be collected.
From the specific research questions established in the first step
of a research study it is possible to determine the indicators and
variables required in the research, and also the general nature of
questionnaire and/or test items, etc. that are required to form these.
Decisions must then be taken on the medium by which data owe
to be collected (questionnaires, tests, scales, observations, and/or
interviews).
Once these decisions have been taken, the instrument construction

can begin. This usually consists of the writing (or borrowing)
of test items, attitude items, and questionnaire items. The items
should be reviewed by experienced practitioners in order to ensure
that they are unambiguous, and that they will elicit the required
information. The broad issue of ‘Instrumentation’ (via both tests
and questionnaires) has been taken up in more detail in several
other modules in this series.
18
Sequential stages in the research process
Figure 1 Stages in the research process
Stage 1 Identification of research issues in terms of general and specific

Research aims research questions
â
Search for, and review of, other previous studies that (a) identify
controversie debates, and knowledge gaps in the field; (b) elucidate
Stage 2
theoretical foundations that need to be tested empirically; and/or (c)
Literature
provide excellent models in terms of design, management, analysis,
reporting, and policy impact
â
Stage 3 Development of overall research design including specification of the
Research design information that is to be collected from which individuals under what
"review" research conditions
â
Construction of operation definitions of key variables and selection/
Stage 4
preparation of instruments (tests, questionnaires, observation
Instrumentation
schedules, etc.) to be employed in the measurement of these variables
â
Pilot testing of instruments/data collection and recording procedures
Stage 5
and techniques. Use of results to revise instruments and to refine all
Pilot testing
data collection procedures
â
Stage 6
Data collection and data preparation prior to main data analysis
Data collection
â
Stage 7
Data summarization and tabulation
Data analysis
â
Stage 8
Writing of research report(s)
Research report
© UNESCO 19
Pilot testing
At the pilot testing stage the instruments (tests, questionnaires,
observation schedules, etc.) are administered to a sample of the
kinds of individuals that will be required to respond in the final
data collection. For example, school principals and/or teachers and/
or students in a small number of schools in the target population.
If the target population has been specified as, for example, Grade
5 in primary school, knowledge should exist in the Ministry, or in
the inspectorate, about which schools are good, average, and poor
schools in terms of educational achievement levels or in the general
conditions of school buildings and facilities. A ‘judgement sample’
of five to eight schools can then be drawn in order to represent a
range of achievement levels and school conditions. It is in these
schools that the pilot testing should be undertaken.
The two main purposes of most pilot studies are:
a. To assess whether a questionnaire has been designed in

a manner that will elicit the required information from
the respondents. This process allows weaknesses in the
questionnaire to be detected so that they can be removed before
the final form is prepared.
Typical weaknesses that are found in questionnaires include:
• Ambiguities in the phrasing of questions.

• Excessive complexity in the language that has been used.
• Inappropriate response categories for some questions.
• Some questions are redundant.
b. To assess whether test items can be understood by the students,

that the items are pitched at the appropriate level of complexity
(assessed by the ‘Difficulty Index’), provide a stable measure
20
of student ability (assessed by the ‘Reliability Index’), and lead

to the construction of total test scores that are meaningful in
terms of the student ability being examined (assessed by the
‘Validity Index’).
Typical weaknesses that are found in tests include:
• Some items have either no correct answer or more than one

correct answer.
• Some distractors in multiple choice items are not functioning.
• Some items measure abilities different from the ability
measured by other items (assessed by the ‘Discrimination
Index’).
• Some items contain internal ‘tricks’ that result in high ability
students performing worse than low ability students.
At the same time that the instruments are subjected to pilot testing,
it is desirable to assess the effectiveness of the data collection
procedures being used. These procedures include the steps to be
followed for ensuring that the correct number of instruments with
appropriate identification numbers on them for district, school, and
student arrive at the schools punctually. Furthermore, there are
procedures for selecting and then administering the questionnaires
to the school principal, the teachers (all selected teachers) and
the students (all students, one class of students, or a random
sub-sample of students within a selected school). These activities
address the following important questions: Are any problems
evident in the procedures? How can the procedures be improved?
The same can be said about the procedures for entering data,
cleaning data, and merging files. This work is usually undertaken
by the planning office data processing unit, but again the results
of the pilot testing experience can help to ‘de-bug’ the procedures.
Once the instruments and procedures have been finalized, the main
data collection can begin.
© UNESCO 21
Data collection
When a probability sample of schools for the whole of the target
population under consideration has been selected, and the
instruments have been finalized, the next task is to arrange the
logistics of the data collection. If a survey is being undertaken in
a large country, this can require the mobilization of substantial
resources and many people.
The management of this research stage will depend on the existing

infrastructure for data collection. In many countries there are
regional planning officers and within each region there are district
education officers. These people are often used for collecting data.
However, the problem of transport can loom large, especially in
situations where there is a shortage of transportation and spare
parts, and where office vehicles are booked weeks in advance. In
such countries, two to three months of careful prior planning will
be required.
In countries where schools are inundated with requests for data

collection and where schools have the rights to refuse to participate
in a data collection exercise, permission for the data collection must
be sought well in advance. In some cases, the sample design must
allow for replacement schools. This is a tricky matter as can be
seen in the module on ‘Sample Design’. The replacement of sample
schools is often incorrectly carried out, which can then cast doubt
on the results of the whole of the data collection.
In yet other countries, there is a practice of mailing the

questionnaires to the schools and hoping that they will be returned.
This often results in only a 40 to 60 percent response rate. This
is disastrous and should be avoided. All efforts must be made to
have the data collected by well-trained data collectors who visit the
schools.
22
It is also important to stress that it is the educational planners

who must control the selection of the students to be tested. From
previous experience, it can be shown that if the district education
officers are allowed to select the schools a bias will creep into the
results. If the school head selects classes within the school there will
be more bias, and if the teacher selects the students the bias will be
greatest.
The instructions for the completion of tests and questionnaires must

be clear. When the testing of students is involved, it is important
to have a special “Manual for Test Administration”. This manual
explains how to arrange the testing room, and provides standard
instructions that are given to the students about how to complete
the test, questionnaire, or attitude statements, when to start and
when to finish.
Finally, instructions must be clear on how the data collection

instruments are to be returned from the field. When data collection
instruments are returned to the National Planning Office, checks
must be undertaken to ensure that all instruments have the correct
identification numbers.
It is becoming popular in many countries to use optical scanning

systems. Many examination centres use these systems, and many
planning offices are beginning to use this mode of data collection.
However, success in this approach does require detailed prior
planning of the lay-out of pages, and that the appropriate type of
paper be used. In most data collections conducted by educational
planning offices, the data are entered onto computer diskettes
directly from a keyboard. In many cases, a standard data entry
program, such as a word-processing program or text-editing
program, can be used. However, more specialized data entry
programs are available which ensure more valid and verified forms
of data entry. These specialized computer programs can be adapted
to suit the needs of a variety of data collections. For example, for
© UNESCO 23
open-ended questions, the expected minimum and maximum

values are entered and if a value outside of this range is entered,
the computer will give an error message. At this point the person
entering the data must check against the data collection instrument.
Usually he or she has made an error and can then correct it.
The net result of using such software is to optimize the accuracy of

the final data set. This prevents many problems from occurring at
the data analysis stage of a study.
After the data have been entered, cleaned, and merged – which
often requires the student, teacher, and school files for a particular
school to be joined into one record – the data analysis can begin.
Data analysis
If there are unequal probabilities of selection for members of the
sample, or if there is a small amount of (random) non-response,
then the calculation of sampling weights has to be undertaken.
For teacher and school data there are choices which can be made
about weighting. For example, if one is conducting a survey and
each student in the target population had the same probability of
entering the sample, then the school weights can either be designed
to reflect the probability of selecting a school, or school weights
can be made proportional to the weighted number of students in
the sample in the school. In this latter case, the result for a school
variable means the school value given is what the ‘average student’
experiences. This matter has been discussed in more detail in the
module on ‘Sample Design’.
24
a. Descriptive
Typically, the first step in the data analyses is to produce descriptive
statistics separately for each variable. These statistics are often
called univariates. Some variables are continuous – for example
‘size of school’ which can run from, say, 50 to 2,000. In this case
the univariate statistics consist of a mean value for all schools,
the standard deviation of the values, and a frequency distribution
showing the number of schools of different sizes. Other variables
are proportions or percentages. Such a variable could be the
percentage of teachers with different types of teacher training.
These descriptive statistics describe the characteristics of the

students, teachers, and schools in the sample. If a good probability
sample has been drawn, then generalizations (within narrow limits)
can be made about the target population.
Often comparisons between the Ministry norms and sample

averages are made. For example, if the Ministry has stated that each
student should have 1.25 square meters of space in the classroom,
then this norm can be examined for each school by dividing the
total number of square meters of classroom floor space in the school
by the total enrolment of students in the school. This statistic may
then be used to give direct feedback to the educational planners in
charge of buildings about the extent to which their norms are being
met.
A further use of univariates is to examine the means or percentages

for particular groups in the sample. This may be urban vs. rural
schools, or the schools in different regions in the country, or for
boys’ schools vs. girls’ schools vs. co-educational schools. This
procedure is known as cross-tabulation or break-downs. In other
words, the data are cross-tabulated (or cross-classified) or broken
down into segments. A simple example is shown below.
© UNESCO 25
Table X Educational facilities provided for primary

schools in Country X
Rural Urban
All schools
Educational provision Schools schools
Mean S.D. Mean S.D. Mean S.D.
Desks per classroom X X X X X X
Chairs per classroom X X X X X X
Floor space per student X X X X X X
Pens/Pencils per student X X X X X X
The first pair of columns presents the mean value and standard
deviation for all schools in the sample for desks per classroom,
chairs per classroom, floor space per student, etc. However, the total
sample is ‘broken down’ in the second and third pairs of columns
into rural and urban schools.
b. Correlational
In this case product moment correlations or cross tabulations can
be calculated. There are statistical tests which can be applied to
determine whether the association is more than would occur by
chance. When the association between two variables is examined,
this is known as ‘bivariate’ analysis.
c. Causal
If the research design used is an experimental one, then tests can be
applied to see if the performance of the experimental group (that is,
the group subjected to the new treatment) is better than the control
group.
26
There are statistical techniques for determining this. However, the

use of this approach depends on the application of randomization
in order to ensure that the two groups are ‘equivalent’ in all other
respects.
If the research design is based on a survey, then it is possible

to calculate the influence of one variable on another with other
variables being “held statistically constant”. Where calculations are
made of the relationships among more than two variables at the
same time, this is known as ‘multivariate analysis’. It is possible
to build causal models using the technique of path analysis. This
technique requires the development of a causal model which
describes not only the variables (or indicators) in the model, but also
the pattern of causation among them. Analyses can be conducted to
estimate the ‘fit’ of the data to the model. (See Figure 2)
An example of a path model could be:
Figure 2. Example of path model
Possessions in home Attitudes to school Achievement
Parent-child interaction Motivation
© UNESCO 27
In this example, it is posited, that the wealth of the home

(represented by “Possessions in the Home”) influences ‘Attitudes
to School’ which in turn influences both ‘Motivation’ and
‘Achievement’. ‘Parent-child Interaction’ influences ‘Motivation’
which also influences ‘Achievement’ but ‘Parent-child Interaction’
also has a direct effect on ‘Achievement’.
In survey type designs – even when causal models and path

diagrams are used – association does not prove causality. However,
if there is no association, then this casts doubt upon the original
assumption of causality. If there is a strong association, and if a
strong association is found repeatedly in several studies, then there
is reasonable ground for assuming causality.
Research report
There are three major types of research reports. The first is the
Technical Report written in great detail and showing all of the
research details. This is typically read by other researchers. It is
this report that provides evidence that the research was conducted
soundly. This is usually the report which is written first.
The second report is for the senior policy makers in the Ministry
of Education. It is in the form of an Executive Summary of about 5
or 6 pages. It reports the major findings succinctly and explains, in
simple terms, the implications of the findings for future action and/
or policy.
The third General Report is usually in the form of a 50 to 100

page booklet and is written for interested members of the public,
teachers, and university people. This report presents the results in
an easily understood and digestible form.
28
Conclusion 6
Each system of education has its political goals, its general
educational goals, and its specific educational objectives. For
example, some political goals stress equality of opportunity, others
stress quality of education, and many stress both.
In every system of education changes are made by educational

planners with the aim of improving the quality of education. These
changes can include a revised curriculum, new methods of teacher
training, increasing the amount of provisions to schools, changing
the structure from a selective to a comprehensive system, reducing
class size and many other changes. In some cases, innovations
need to be tried out to identify their likely shortcomings, effects,
and side effects before they are implemented. In other cases,
student achievement over time in one or more subjects needs to be
monitored, or where there are optional subjects the percentage of a
grade or age group selecting such subjects needs to be known. Or,
the attitudes and perceptions of students need to be assessed.
Educational planners are responsible for the planning of the various

component parts of a system of education. Decisions must be
taken on what to do to improve equality, quality, or both of each
component part. As much information as possible is required to
help decision-makers to operate successfully within the temporal,
financial and political constraints within which they work.
This module has provided an introductory overview of the meaning

of research, different types of research questions that are posed in
many countries, and the sequential stages in the research process.
This overview has also illustrated the main steps involved in this
process and has provided examples of the kind of work that is
required at each stage.
© UNESCO 29
Appendix A
Terminology used in educational

research
Educational research, like many other fields of research, has its own
jargon. This appendix attempts to present a brief explanation of
some of the terms commonly used in educational research studies
undertaken by educational planners. Some have already been
explained. The list provided below is not exhaustive, however, it
aims to help educational planners to begin to read the research
literature that impacts upon their work.
Operational research aims

Operational research aims are used to provide clear guidance
concerning the conduct of educational research studies. These aims
emerge from the general and specific research questions that are
formulated for the study. In particular, operational research aims
make it clear exactly what aspects of the educational environment
should be measured.
For example, we may have a general research question that asks:

“What are the linkages between inputs to schools and student
achievement outcomes?” This question can be refined to form
several specific research questions of which the following is one
illustration: “What are the effects of teacher housing, school
facilities, and classroom supplies on student achievement in
reading?”
30
The operational research aims emerging from these general and
specific research questions could include the following:
“To determine the likely effects of the following variables on

students’ reading achievement in Grade 6: ‘Teacher Housing’ (as
measured by the existence of water, existence of electricity, number
of persons per square meter, extent of repairs required, and distance
in travel time to school); ‘Quality of School Building’ (as measured
by the number of complete classrooms, number of incomplete
classrooms, number of open air classrooms, and amount of repair
required); and ‘Classroom Equipment’ (as measured by the number
of desks and chairs per student, existence of a blackboard, number of
textbooks per student, and number of exercise books per student)”.
Formative and summative evaluation

Formative evaluation is conducted during the development or
improvement of an educational program or product. It is usually
conducted, and used, by personnel who are part of the team work-
ing on the program or product. However, it may sometimes be un-
dertaken by an internal or external evaluator, or a combination or
both.
Summative evaluation is conducted after the program or

product has been completed. It is usually conducted by a researcher
who was not part of the team working on the program or product,
and is often undertaken for the benefit of an external audience or
decision-maker. It, too, may be undertaken by internal or external
evaluators or a combination of both.
Both types of evaluation require research rigor. An example can be

taken from the field of curriculum. If new textbooks or teaching/
learning units are to be developed, it is important that the units
are written according to certain agreed specifications, that teachers
© UNESCO 31
are trained to teach the new units, and that the units are tried out
in a range of schools. Examples of the kinds of questions asked in
formative evaluation would be: Do the specific objectives which
have been developed cover the general objectives to be learned
which are in the curriculum? Can the teachers cope with the new
units? Are there any ‘gaps’ in the curriculum units which result in
a poor coverage of some of the specific objectives? Can the layout of
the curriculum units be changed so as to make the material more
interesting for students?
In the curriculum development area, it is often specified that each

objective should be mastered by 80 percent of students, and for
a whole curriculum unit 80 percent of students should master 80
percent of the objectives. These performance levels may be used as
benchmarks which can guide the identification of weaknesses in
new curriculum units. If weaknesses are apparent then action can
be taken to revise the procedures, the content, and/or the way in
which the content is presented.
Summative evaluation, in this example, could focus on two areas of

student performance. The first area would involve an evaluation of
the achievement of students with respect to the material contained
in the textbook as a whole or on all of the curriculum units together.
This would involve the testing of all students on either all objectives,
or on a selection of objectives from all of the curriculum units.
Again, criteria of acceptability must be set in order to judge whether
the totality of units are ‘working’. The second kind of summative
evaluation could be to compare the learning from ‘Textbook A’ with
that from ‘Textbook B’. In this case, it is assumed that the objectives
to be learned from both textbooks are the same.
In a situation where funds for evaluation are limited, it makes

more sense to place the emphasis on formative evaluation. If a
new curriculum is not developed by using systematic formative
32
Appendix A
evaluation then the application of a summative evaluation becomes

meaningless.
Assessment, evaluation, and research

Assessment usually refers to persons. It covers activities such as
grading (formal and informal), examining, certifying, and so on.
Most educational systems record student achievement in some way:
With a number, letter code, or comment such as ‘good’, ‘satisfactory’,
or ‘needs improvement’. One exception to this generalization is
the use of the word ‘assessment’ as in National Assessment of
Education Progress (NAEP) in the United States. Such assessments
are based on the testing of probability samples within the nation. In
this case, the objective is not to know how any individual student is
achieving, but to discover more about the achievement of groups of
students in different regions of the country.
Evaluation involves the general weighing of the value or worth of

something in terms of the objective sought, or in comparison with
other programs, curricula, or organizational schemes.
Research is the orderly and systematic investigation of a

phenomenon for the purpose of adding to knowledge.
Measurement
Measurement is a process that assigns a numerical description to
some attribute of an object, person, or event. Just as rulers and stop-
watches can be used to measure, for example, height and speed, so
can other quantities of educational interest be measured indirectly
through the use of achievement tests, questionnaires and the like.
© UNESCO 33
Surveys and experiments

Surveys involve the collection of information at one or several
points in time from scientifically designed probability samples of
students, teachers or schools. The information is usually collected
by means of questionnaires and tests, but also sometimes by means
of observation schedules and interviews. The information collected
is usually from a probability sample selected from a tightly defined
population, but can also be from a full count of schools in the form
of census data. The data are generally for descriptive purposes, but
may also be analyzed for relationships among variables. In some
cases, causal path models are developed and tested.
Although surveys can never prove causality, it is assumed that if

sufficient replications of a study or set of relationships are made, all
of which show a particular relationship which is generally deemed
to be causal, then it is reasonable to infer causality.
There are two main groups of surveys. The most common is a

“cross-sectional survey”. This involves the collection of data at
a specific point in time and is rather like taking a photograph
on one day. The data collected can involve what the situation is
now (for example, the number of books in the school library or
students’ achievement in one or more subject areas) or retrospective
information about what level of education a student’s parents had
or prospective information about what type of further education a
student wishes to pursue.
‘Longitudinal Surveys’ involve following a particular group of

students or schools over a period of time. Some longitudinal studies
follow persons from birth to death, others over a certain number of
years, others over the period of one school year and others over a
period of 3 or 6 months. If students leave school and are followed
up for a year or two into their employment, this kind of longitudinal
study is often known as a ‘Tracer Study’.
34
Appendix A
Longitudinal studies tend to cost more than cross-sectional surveys

and, although in theory they are superior to cross-sectional surveys
in terms of determining causal patterns, the costs and problems of
knowing the students’ names and addresses (in countries with strict
data protection laws) often debar researchers and planners from
undertaking this kind of work.
Experiments usually involve data collections where schools and

students have been randomly assigned to different experimental
treatments. Examples of treatments could be “Having a Classroom
Library vs. Not Having a Classroom Library” or “Textbook A vs.
Textbook B” or “Teaching Method X vs. Teaching Method Y”. In
the first case the experiment would consist of randomly assigning
students and teachers to classrooms with libraries and classrooms
without libraries. If the aim of the study is to see to what extent
classroom libraries cause ‘better’ reading comprehension, then
an appropriate test of reading comprehension is used and if, after
a period of time, those with libraries achieve more on a reading
comprehension test, then this is said to be due to the existence of
classroom libraries.
If there is no difference in reading comprehension achievement

between the two groups, then the existence of classroom libraries
(and how they were used) is said not to make a difference.
In conducting an experiment the researcher must be careful

about the factors that may limit the validity of the experiment. In
particular, the following four questions need to be posed: (a) Does
an empirical relationship between the operationalized variables
exist in the population (statistical conclusion validity)?, (b) If the
relationship exists, is it causal (internal validity)?, (c) Can the
relationship between the operationalized variables be generalized to
the treatment and outcome constructs of interest (construct validity,
and cause and effect)?, and (d) Can the sample relationship be
generalized to or across the populations of units, settings, and times
(external validity)?
© UNESCO 35
Much has been written about experiments (see the module on

‘Research Design’) but, in practice, it is often difficult to conduct
experiments in educational settings, because Ministries of
Education and School Principals are loath to permit the researcher
to use ‘randomization’ as fully as is required for a valid experiment.
Tests
A test is an instrument or procedure that proposes a sequence of
tasks to which a student is to respond. The results are then used to
form measures to define the relative value of the trait to which the
test refers.
1. Test items
A test may be an achievement, intelligence, aptitude, or practical
test. A test consists of questions, known as items.
An item is divided into two parts: the stem and the answer. Stems
pose the question. For example, a stem could be:
• What is the sum of 40 and 8?
• or in Reading Comprehension it could be a reading passage
followed by specific questions.
The answer could be an ‘open-ended’, a ‘closed’, or a ‘fill-in’ answer.
For example, in the first stem given above an open or fill in answer
could require the student to write the answer in a box. Or, it could
be put into multiple choice format as follows:
• What is the sum of 40 and 8?
a. 84 b. 50 c. 48 d. 408
In this closed format the student is requested to tick the correct
answer.
36
Appendix A
2. Sub-scores/Domain scores
The score of a student on the whole test is known as the “total test
score”. A sub-score refers to the achievement of the students on a
sub-set of items in the overall test. Thus, for example, in a Science
test it may be considered desirable to classify the items into Biology
items, Chemistry items, and Physics items. Each of these constitutes
a domain of the test and the scores on each are known as ‘sub-
scores’ or ‘domain scores’.
Another approach to reclassifying items is into ‘information’ items,

items measuring the ‘comprehension’ of a principle, and items
where the ‘application’ of skills are required. These are the first
three categories of Bloom’s Taxonomy of Educational Objectives
(Cognitive Domain) and are widely used. Again, the scores on each
of these sub-classifications are known as sub-scores.
Variable
The term variable refers to a property whereby the members of a
group being studied differ from one another. Labels or numbers
may be used to describe the way in which one member of a group is
the same or different from another.
Some examples are:
• ‘Sex’ is a variable with two values: The individuals being

measured may be either male or female. The values attached for
computer processing purposes could be 1 or 2.
• ‘Occupational Category’ is a variable that may take a range of

values depending on the occupational classification scheme that
is being used.
© UNESCO 37
With variables like ‘height’, ‘age’, and ‘intelligence’, the measurement

is carried out by assigning descriptive numerical values. Thus an
individual may be 1.5 meters tall, 50 years of age, and have an
Intelligence Quotient of 105.
1. Types of variables
Variables may be classified according to the type of information
which different classifications or measurements provide. There are
four main types of variables: nominal, ordinal, interval, and ratio.
a. Nominal
This type of variable permits statements to be made only about
equality or difference. Therefore we may say that one individual is
the ‘same as’ or’ ‘different from’ another individual. For example,
colour of hair, religion, country of birth.
b. Ordinal
This type of variable permits statements about the rank ordering
of the members of a group. Therefore we may make statements
about some characteristics of an individual being ‘greater than’ or
‘less than’ other members of a group. For example, physical beauty,
agility, happiness.
c. Interval
This type of variable permits statements about the rank ordering
of individuals. It also permits statements to be made about the
‘size of the intervals’ along the scale which is used to measure the
individuals and to compare distances at points along the scale.
It is important to note that interval variables do not have true
zero points. The numbering of the years in describing dates is an
interval scale because the distance between points on the scale is
comparable at any point, but the choice of a zero point is a socio-
cultural decision.
38
Appendix A
d. Ratio
This type of variable permits all the statements which can be made
for the other three types of variables. In addition, a ratio variable
has an absolute zero point. This means that a value for this type of
variable may be spoken of as ‘double’ or ‘one third of’ another value.
For example, physical height or weight.
Note: It is not legitimate to apply simple arithmetic operations

to nominal or ordinal variables. Addition and subtraction are
possible with interval variables. Where ratio variables are used it is
permissible to multiply and divide as well as to add and subtract.
Validity and reliability

Whenever individuals are measured on variables in educational
research, there are two main characteristics of the measurement
which must be taken into consideration: Validity and Reliability.
1. Validity
Validity is the most important characteristic to consider when
constructing or selecting a test or measurement technique. A valid
test or measure is one which measures what it is intended to measure.
Validity must always be examined with respect to the use which
is to be made of the values obtained from the measurement
procedure. For example, the results from an arithmetic test may
have a high degree of validity for indicating skill in numerical
calculation, a low degree of validity for indicating general reasoning
ability, a moderate degree of validity for predicting success in future
mathematics courses, and no validity at all for predicting success in
art or music.
© UNESCO 39
There are three important types of validity in educational research:

content validity, criterion-related validity, and construct validity.
a. Content validity
This type of validity refers to the extent to which a test measures
a representative sample of subject-matter content and behavioural
content from the syllabus which is being measured. For example,
consider a test which has been designed to measure “Competence
in Using the English Language”. In order to examine the content
validity of the test one must initially examine the subject-matter
knowledge and the behavioural skills which were required to
complete the test, and then after this examination compare these
to the subject-matter knowledge and behavioural skills which
are agreed to comprise correct and effective use of the English
language. The test would have high content validity if there was a
close match between these two areas.
b. Criterion-related validity
This type of validity refers to the capacity of the test scores to
predict future performance or to estimate current performance
on some valued measure other than the test itself. For example,
‘Reading Readiness’ scores might be used to predict a student’s
future reading achievement, or a test of dictionary skills might be
used to estimate a student’s skill in the use of the dictionary (as
determined by observation).
In the first example, the interest is in prediction and thus in the

relationship between the two measures over an extended period of
time. In the second example the interest is in estimating present
status and thus in the relationship between the two measures
obtained concurrently.
40
Appendix A
c. Construct validity
This type of validity is concerned with the extent to which test
performance can be interpreted in terms of certain psychological
constructs. A construct is a psychological quality which is assumed
to exist in order to explain some aspect of behaviour. For example,
“Reasoning Ability” is a construct. When test scores are interpreted
as measures of reasoning ability, the implication is that there is
a quality associated with individuals that can be properly called
reasoning ability and that it can account to some degree for
performance on the test.
2. Reliability
Reliability refers to the degree to which a measuring procedure
gives consistent results. That is, a reliable test is a test which would
provide a consistent set of scores for a group of individuals if it was
administered independently on several occasions.
Reliability is a necessary but not sufficient condition for validity.

A test which provides totally inconsistent results cannot possibly
provide accurate information about the behaviour being measured.
Thus low reliability can be expected to restrict the degree of validity
that is obtained, but high reliability provides no guarantee that a
satisfactory degree of validity will be present.
Note that reliability refers to the nature of the test scores and not to
the test itself. Any particular test may have a number of different
reliabilities, depending on the group involved and the situation in
which it is used. The assessment of reliability is measured by the
‘Reliability Coefficient’ (for groups of individuals) or the “Standard
Error of Measurement” (for individuals).
© UNESCO 41
Indicator
An indicator generally refers to one or more pieces of numerical
information related to an entity that one wishes to measure. In
some cases, it consists of information about only one variable
and this information may be gathered by only one question on a
questionnaire. For example, consider an indicator of classroom
library availability. In this case the indicator may be assessed by a
single variable (which has only two values) that is measured by one
question on a questionnaire:
Do you have a classroom library?

• Yes
• No
In some cases, an indicator may be made up of several variables. For

example, an indicator of the ‘Material Possessions of the Home’ may
need the addition of several variables. But, these variables may come
from one question in a questionnaire:
Possession No Yes
Car
Refrigerator
T.V.
Video
etc.
Which of the following exist in your home?
(Check once each row)
42
Appendix A
In other cases, several variables coming from several questions will

need to be summed. For example, in a study in Indonesia (Mappa,
1982) an indicator of the ‘Material Conditions of the Home’ was
produced. Four variables were used: Quality of floor, wall, roof, and
lighting.
• Quality of floor was scored as:

bamboo = 1, wood = 2, cement = 3, and tiled = 4;
• Quality of walls was scored as:
woven palm leaves = 1, bamboo = 2, wood = 3, and cement = 4;
• Quality of roof was scored as:
palm leaves = 1, tile = 2, zinc = 3, and shingle = 4;
• Quality of lighting was scored as:
candle = 1, kerosine lamp = 2, hurricane lamp = 3, electricity = 4.
In this case there were four variables measured by four questions

on a questionnaire. An indicator for each home was produced by
adding the values for each of the four questions.
Attitude scales
Probably the least debated definition of an attitude is: “a moderate
to intense emotion that prepares or predisposes an individual to
respond consistently in a favourable or unfavourable manner when
confronted with a particular object” (Anderson, 1985). In education,
attitudes which are often measured are ‘Like School’, ‘Interest
in Subject-matter’, and ‘Teacher Satisfaction with Classroom
Conditions’. Each of these titles implies a high to low measure.
Thus, ‘Like School’ implies a measure that ranks students from
those who love school to those who hate school.
© UNESCO 43
Several ways of measuring and scaling attitudes have been devised.

These are Thurstone Scales, Likert Scales, Guttman Scales and the
Semantic Differential. These have been covered in detail in another
module entitled “Questionnaire and Attitude Scale Construction”.
Strongly Strongly
Disagree Incertain Agree
disagree agree
1. School is not very

enjoyable
2. I enjoy everything
about school
3. I am bored most of
the time at school
4. There are many
subjects I don’t like
5. The most enjoyable
part of my life is the
time I spend at school
6. I generally dislike my
schoolwork, etc.
Since the Likert Scale is the one most frequently used in educational
research, a short explanation is given here. For example, consider
the development of a ‘Like School’ scale to be used for 14-year-
students. The researcher must first of all listen carefully to how
14-year-old students describe their like or dislike of schools. Both
positive and negative statements are used, after editing, to form a
set of statements about ‘Like School’. An example is given below:
The respondent is asked to check Strongly Disagree, Disagree,

Uncertain, Agree, or Strongly Agree for each statement. Positive
statements (like the second item) would be given the values 1 for
Strongly Disagree to 5 Strongly Agree. Negative statements (like
the first item) are coded to have 5 for Strongly Disagree and 1 for
44
Appendix A
Strongly Agree in order to have all reactions to statements being

in the same direction. Only six statements have been given in the
example above and hence the lowest score for a student would be 6
(6 x l) and the highest would be 30 (6 x 5).
© UNESCO 45
Appendix B
Fur ther reading suggestions

Introductory texts
Borg, W.R. and Gall, M.D. (1989). Educational research: An
introduction. New York: Longman..
Hopkins, C. D. & Antes, R. L. (1990). Classroom measurement and

evaluation (3rd ed.). Itasca, Illinois: Peacock.
Oppenheim, A.N. (1992). Questionnaire design, interviewing attitude

measurement. London: Pinter.
Keeves, J.P. (1988). Educational research, methodology, and

measurement: An international handbook. Oxford: Pergamon.
Thorndike, R. L. & Hagen, E. (1977). Measurement and Evaluation in

Psychology and Education (4th ed.). New York: John Wiley.
Wolf, R.M. (1991). Evaluation in education (4th Ed.). New York:

Praeger.
46
Examples of educational research studies
that aimed to have an impact on educational
planning
Asmah bt Mohd Taib, Ahmad, Siatan, Solehan bin Remot, &
Nordin, Abu Bakar. (1982). Moral education in Malaysia.
Evaluation in Education, 6 (1), 109-136.
Jiyono & Suryadi, Ace. (1982). The planning, sampling, and some
preliminary results of the Indonesian Repeat 9th Grade survey.
Mappa, Syamsu. (1982). Radio broadcasting program for out of

school education in South Sulawesi (Indonesia). Evaluation in
Education, 6 (1) 31-51.
Murimba, S., et al (Ed.) (1995). The analysis of educational research

data for policy development: An example from Zimbabwe.
International Journal of Educational Research, 23 (4).
Norisah bt Atan, H. Naimah bt Haji Abdullah, Nordin, Abu Bakar,

and Solehan bin Remot. (1982). Remedial reading support
program for children in Grade 2 in Malaysia. Evaluation in
Education, 6 (1), 137-160.
Ross, K.N. and Postlethwaite, T.N. (1992). Indicators of the quality

of education: A summary of a national study of primary shcools
in Zimbabwe. Paris: UNESCO (IIEP).
Sawadisevee, Amara, Padungrat, Jitsai, and Sukapirom, Rungruong.

(1982). Community secondary schools project in Thailand.
Viboonlak Thongchua, Phaholvech, Nonglak, and Jiratatprasoot,

Kanjani. (1982). Adult education project – Thailand. Evaluation
in Education, 6 (1), 53-81.
© UNESCO 47
Encyclopedias and handbooks

Alkin, M. (Ed.). (1992). American Educational Research Association
sixth encyclopedia of educational research (6th ed.). New York:
Macmillan.
Husen, T. and Postlethwaite, T.N. (1994). International Encyclopedia

of Education: Research and Studies (2nd Edition), Volumes 1-12.
Oxford, Elsevier Science.
Keeves, J.P. (Ed.). (1988). Educational research, methodology and

measurement: An international handbook. Oxford: Pergamon.
Walberg, H.J. and Haertel, G. (1990). International encyclopedia of

educational evaluation. Oxford: Pergamon.
Journals
American Educational Research Journal
Applied Measurement in Education
Assessment in Education
Comparative Education Review
Educational Assessment
Educational Evaluation and Policy Analysis
International Journal of Educational Research
International Journal of Educational Development
International Review of Education
Journal of Education Policy
Research Papers in Education: Policy and Practice
Review of Educational Research
Studies in Educational Evaluation
48
Appendix C
Exercises
The following exercises are concerned with examining the

general aims of an education system, establishing specific and
operationalized aims, and then proposing research activities that
will assess to what extent the education system is achieving its
stated aims.
Five general aims are taken from a small publication «Planning

for Successful Schooling» which was prepared by the Ministry of
Education in the State of Victoria in Australia during 1990:
1. To expand educational opportunities for all students.
2. To encourage excellence in all areas of learning and to assist all

students to develop their full potential.
3. To strengthen community participation in and satisfaction with

the state school system.
4. To develop and improve the skills, potential and performance

of school principals, teachers, and administrative and support
staff.
5. To manage and control financial and physical resources in ways

which maximize educational benefits for all students.
© UNESCO 49
E XERCISE 1 ( INDIVIDUAL WORK )
Select one of the five general aims above that you believe would
probably receive a high priority in your country. For that general
aim write five specific research questions. For each of the five
specific research questions, prepare several operationalized
research aims that focus on the performance of the education
system in meeting these aims. Then, write down a broad outline
of the sequence of activities that would need to be undertaken in
order to assess the system’s performance with respect to these
aims.
E XERCISE 2 ( SM ALL GROUP WORK )
Collate the work of the individual members separately for each

group. Compare and refine (or change) the wording of the specific
research questions and operationalized aims for each general aim.
When this has been completed, discuss and write down, in outline
form only, the sequence of activities to be undertaken in the
research study or studies.
E XERCISE 3 ( PLENARY GROUP WORK )
Collate the work of the small groups. Again refine the wording.
Then write down in detail the sequential activities to be
undertaken in a research study for each general aim covered to
provide valid, reliable, and useful information for decision-makers
to assess to which extent the general aims have been addressed.
50
Since 1992 UNESCO’s International Institute for Educational Planning (IIEP) has been
working with Ministries of Education in Southern and Eastern Africa in order to undertake
integrated research and training activities that will expand opportunities for educational
planners to gain the technical skills required for monitoring and evaluating the quality
of basic education, and to generate information that can be used by decision-makers to
plan and improve the quality of education. These activities have been conducted under
the auspices of the Southern and Eastern Africa Consortium for Monitoring Educational
Quality (SACMEQ).
Fifteen Ministries of Education are members of the SACMEQ Consortium: Botswana,

Kenya, Lesotho, Malawi, Mauritius, Mozambique, Namibia, Seychelles, South Africa,
Swaziland, Tanzania (Mainland), Tanzania (Zanzibar), Uganda, Zambia, and Zimbabwe.
SACMEQ is officially registered as an Intergovernmental Organization and is governed

by the SACMEQ Assembly of Ministers of Education.
In 2004 SACMEQ was awarded the prestigious Jan Amos Comenius Medal in recognition
of its “outstanding achievements in the field of educational research, capacity building,
and innovation”.
workshops presented for the National Research Coordinators who are responsible for
SACMEQ’s educational policy research programme. All modules may be found on two
Internet Websites: http://www.sacmeq.org and http://www.unesco.org/iiep.
2
Module
Ian D. Livingstone
From educational policy issues

to specific research questions
and the basic elements
of research design



Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66

Module 2 From educational policy issues to specific research questions and the basic elements of research design
Content
1. Introduction 1
Preparing the ground 4
2. The systematic identification of educational

policy research issues 4
Preparing the ground 4
1. Policy makers 5
2. Researchers, on the other hand 6
Models of research utilization 9

Defining the issues 13
The content of policy research 15
1. Learning goals and curriculum 16
2. Assessment, guidance and selection 19
3. Demography, enrolment and structure 21
4. Finance and administration 23
5. Selection and training of educators 25
6. Monitoring and inspection 27
Filling out the grid 31
3. Setting priorities for educational policy

research issues 35
1
4. Clarifying priority educational policy research

issues and developing specific research
questions and research hypotheses 40
The research question 41
The research hypothesis 50
1. The hypothesis should be relational 50
2. The hypothesis should be non-trivial 51
3. The hypothesis should be testable 51
4. The hypothesis should be clear and concise 52
Generating research hypotheses 57
A note on qualitative approaches 61
5. Moving from specific research questions and

research hypotheses to the basic elements
of research design 64
Putting it into operation 64
6. Summary 76
7. Annotated biliography 77
II
Introduction 1
In an ideal world, educational research has a vital role to play in

the improvement of education, whether this be in the development
of theory to better explain why things occur the way they do in
particular learning situations, or stimulating ideas for innovative
practices, or developing new procedures and materials to enhance
the efficiency and effectiveness of instruction. Educational research
also has the role of providing attested information to improve the
quality of decision-making for educational policy. It is this last
objective which forms the object of this module.
At the outset it should be admitted that this happens all too

rarely, for reasons which will be explained in the next section.
Policy decisions are often taken in the absence of good research,
and sometimes in spite of the findings of available research.
Furthermore, creating a well-researched policy does not mean
taking any action on that policy! But at least it is a beginning. It
is the objective of this module to assist researchers to interact
with policy makers in fruitful ways, so that gaps are bridged and
research results made available in forms which are helpful to all.
Numerous definitions of policy research exist. A simple and useful

one is ‘research undertaken by qualified researchers in order to
produce evidence on issues of importance to policy-makers’. The
hope is that such evidence can then be used to help in formulation
or revision of laws or educational policy guidelines. The intention
always is that decision-oriented research should provide results
which are useful for resolving current problems in education.
© UNESCO 1
The distinction is sometimes made between ‘basic’ research and

‘applied’ (or mission-oriented or decision-oriented) research.
The following definitions may be helpful here: (Wallen, 1974)
“The findings of a basic research study:
• should apply to a great many people and/or situations;
• should be related to many other studies and/or theories;
• need not have obvious or immediate implications for practice.
In contrast, the findings of an applied research study:
• are applicable to a specific situation (they may or may not apply

elsewhere);
• need not relate to a broader field of knowledge or other

research;
• have immediate and obvious implications for practice”.
The distinction is not a hard-and-fast one, of course, and both types

of research have an important place. Both have their own standards
of rigour and validity. But it is likely that research carried out to
inform policy makers will lie towards the applied, decision-oriented
or mission-oriented end of the research spectrum.
This module concentrates on such decision-oriented research,

and seeks to help researchers identify important issues needing
attention, through a systematic ‘mapping’ of the educational
territory. It then proceeds to find ways to establish priorities, using
a consensus-building approach to select projects from the infinite
number of problems which exist ‘out there’. Finally it comes down
to specifics, with a discussion of ways to develop specific aims from
2
Introduction
general aims, and operationalize these through the use of research

questions and hypotheses. The last section gives some illustrations
of exactly how this can be carried out in a systematic way.
© UNESCO 3
2 The systematic identif ication

of educational policy research
issues
Preparing the ground

Since most policy research budgets are limited, it is desirable to
have a sound procedure for identifying general issues needing
research, to ensure that all important problem areas are considered
and to allow for the setting of research priorities. For this to occur, it
is helpful to know right at the beginning exactly who are the parties
involved, and what their various expectations of the research may
be.
Who are the policy makers?

They are likely to be:
• politicians, senior government officials, or Members of

Parliament;
• chief executives and senior administrators in ministries and

government agencies;
• influential people in national associations representing various

interest groups, e.g., Employers Associations, Chambers of
Commerce, Trade Unions.
4
Who are the researchers?
They could be:
• staff of research divisions which are part of government

departments, ministries or agencies;
• staff of universities or other tertiary institutions who carry out

research, either as part of their own work or on contract;
• consultants employed by international agencies;
• staff of private research institutions or R&D establishments;
• individual, independent, self-employed research workers.
Bringing people together

It is very necessary to bring the people involved in the research
together at an early stage, and create a climate of dialogue. The
different conditions under which policy makers and researchers
work can often lead to tension between the two groups. This must
not be allowed to occur, if the outcomes of the research are to be
fruitful.
1. Policy makers
• want research to deal with their own particular problems, and
may not necessarily be interested in the relationship of these
issues to the broader socio-political context, the ‘fabric’ of
society.
• are not usually trained as social science researchers, and are

likely to be unfamiliar with the content, methods or jargon of
educational research. This is one important reason for early
consultation.
© UNESCO 5
• characteristically want results immediately! They tend to work

on a different time scale, and are impatient of the slowness of
the educational research process, when urgent decisions are
needed.
• need to realise that research cannot answer value questions.

Some examples of such value questions which are politically
determined would be: how to design the ‘best possible’ entry
examination to junior high school; whether comprehensive
secondary schools are ‘better’ than, say, separate academic
and vocational schools; and whether the ‘mainstreaming’ of
handicapped children into regular classes should be introduced.
2. Researchers, on the other hand

• are specialists, skilled in a relatively narrow range of
paradigms, or ways of approaching research problems.
Some prefer intensive qualitative methods, others are more
comfortable with quantitative research techniques. It is
advisable not to get involved in intense debates over such
matters. Most problems can be answered best by a blend of a
variety of approaches, making full use of the expertise, and
resources of both time and money available.
• may be somewhat remote from the ‘real’ world of social

conflict, political pressure and financial constraint, and come
up with recommendations offering solutions which are not
feasible in practical terms. Or they may offer short-term
solutions, addressing immediate problems, where alternate
recommendations may be more appropriate to solve likely
future problems.
• often write in a particular research jargon, which means that

they find difficulty in communicating to the policy makers who
are interested in their findings. Training is needed in writing
6
The systematic identification of educational policy research issues
simply and clearly, and in presenting findings in such a way

that they are not misinterpreted and distorted, and yet at the
same time, do bring out what the outcomes of the research are,
however cautiously.
• may not be used to tight deadlines. They may come from an

academic environment of scholarly research, and find they
under-estimate the time needed to do a thorough job, and so
require time extensions.
• find their role unclear. For example: Should they present

their findings as objectively as possible, and leave others to
interpret them, and do the necessary lobbying for any changes
recommended? Or should they accept that true impartiality
is impossible? Even in the choice of problem, selection of
measures, methods of data analysis, and interpretation of
findings, value judgements are being made. According to this
view, researchers should constantly be concerned about the
effect of their work on society, and set out to have an impact. In
any event, what is certain is that for true research objectivity,
the investigator must approach problems with no preconception
of what they want to find (as distinct from what they expect to
find). This is a subtle but extremely important distinction.
Anticipating difficulties
If the ground is not properly prepared by a good dialogue between
the policy maker and the researcher, major difficulties can arise.
• Suppression of research results can occur. Nothing will sour

the relationship between a policy maker and a researcher more
than the suppression of the findings of the research, because
they did not demonstrate what the policy maker wanted to
find, or proved an embarrassment to them and their cherished
policies. The independence of the researcher needs to be made
very clear at the beginning, and accepted by both parties.
© UNESCO 7
• Delaying tactics are another way of avoiding potential

embarrassment, if those in authority feel that there may be even
a hint of criticism in the research report and recommendations.
They deliberately ‘drag their feet’, there are substantial delays
in the final editing, publication and acceptance of the report. It
appears to have disappeared down a ‘black hole’, never to see
the light of day.
• Attempted interference with the research process can also

occur when it is suggested that certain recommendations
should not be made. It is always good practice for the
researcher to show an early draft of the recommendations to
the policy makers commissioning the research. This avoids the
suggestion of possibly naive and unworkable solutions. But a
wholesale modification of a series of recommendations, under
pressure from the commissioner of the research, is thoroughly
undesirable.
The above discussion indicates something of the different worlds

from which the policy maker and researcher come, and highlights
the question of ‘ownership’. Who owns the data, the research,
the outcomes, the final report? The policy maker doing the
commissioning, the institution paying for it (these may not be the
same) or the researcher doing the work? Who is allowed to publish
and disseminate the results? In what form may the results be
released? When can this occur?
It is crucial to determine all these things at the outset, in a spirit of

co-operation, so that mutual tension and distrust are avoided.
8
Models of research utilization

It is common to criticise policy makers for failing to take research
findings sufficiently into account when formulating policy. Such
criticisms often fail to acknowledge the complexity of the whole
process. Policy makers are required to take into account social,
political, economic, and educational realities, as well as the values
and attitudes of interest groups and the manifestos of political
parties. Research-based information is only one of the inputs into
the policy-making process. It would be simplistic to believe that it
was the major one. It may simply be one means of contributing to
a general discourse on the nature of society, and its current and
potential problems.
To avoid the frustrations and weaknesses noted earlier and promote

an effective contribution from policy-oriented research it is desirable
that:
• the subject of the research be relevant to the concerns of the

policy makers. Dialogue between researchers and policy makers
is a necessary, but not a sufficient, condition for effective policy-
oriented research.
• policy makers should be prepared and able to identify the issues

which they wish the researchers to address, and the types of
information they seek; this will often relate to variables which
can be manipulated by policy.
• information from the research should be provided in time, that

is, before the policy decisions have to be made, in a form readily
understandable by those who have to make the decisions,
preferably in non-technical language and in summary form.
• multi-disciplinary approaches should be used, where

appropriate, since social, political or economic considerations
are likely to be important.
© UNESCO 9
• policy researchers should be alert to the fact that their research

may alter the balance in the ‘power structure’ between
interested parties, and change their levels of influence. An
awareness of political and administrative realities is therefore
critically important.
In order for research to be utilised, it is therefore necessary to

understand exactly how findings are presented and disseminated.
This can conveniently be described in terms of ‘models’, idealised
ways in which a complex process can be defined. There is a very
large literature on the ways in which research knowledge is
disseminated and used (Husen and Kogan, 1984).
Two very simple such models are shown in Figures. The first, the
linear model, may be useful in the physical sciences, but has not
generally been found to be appropriate in the social sciences, and
education in particular. One does not begin with a problem, devise
a research strategy to solve it, obtain a solution, and promulgate the
results as Figure 1 would indicate.
Figure 1. Linear model of research utilization

In the social sciences, the diffusion model is more appropriate,
in which research knowledge is disseminated over a period of
time, and gradually seeps into the consciousness of all the parties
affected, perhaps in ways not fully appreciated or recognised.
Researchers build on previous research, and action eventually
follows a growing perception on the part of policy makers, driven
by public pressure, that something should be done. This is often
a matter of political expediency, and often in the face of economic
constraint. It is altogether a much more fuzzy process, as shown in
Figure 2.
10
Figure 2. Diffusion model of research utilization

A much more complex series of seven models is given by Carol
Weiss (1979). They can be summed up as follows:
1. The first model is the linear one referred to above, in which basic
research will lead to applied research, which in turn will lead to
development, and then application.
2. The second model is the problem solving one, in which missing

knowledge is identified, and then social science research
findings are gathered, either from existing knowledge or from
specially commissioned research. The research findings are
then interpreted in the context of various decision options
which are possible, and the best policy chosen. Typically, this
model leads to rather optimistic expectations about what
research can actually do in solving real life problems.
3. The third model is the interactive one, which assumes that some
sort of back-and-forth dialogue will take place between policy
makers and researchers (often through intermediaries), and
that this will result in a compromise acceptable to all parties,
and allow sound policy directions to be determined. This model
too tends to err on the optimistic side.
4. The fourth model is the political one, in which researchers

produce certain findings which are then used as political
ammunition both by the ruling party in power and by the
opposition (where such a political system exists). A less
desirable variant of this model occurs when the politicians
make up their minds about what policy they want, and then
commission research to ‘justify’ their conclusions!
5. The fifth is the tactical one. Here the policy makers delay
making a decision on a matter about which they are
© UNESCO 11
uncomfortable by commissioning a long research study, or

maybe several research studies, on the issue. They thus ‘bury
the problem’ under the guise of doing research, and say that
they cannot act until research results are forthcoming!
6. The sixth is the enlightenment model, in which research findings

slowly filter through to the public, and gradually shape the
way people think about particular issues or problems. In
many societies, particularly open, democratic ones in which
the government is prepared to release research findings, even
unpalatable ones, an informed public is a very powerful lobby
group, and can influence policy decisions gradually over a
period of time. The existence of scholarly journals and informed
discussion of policy issues through the mass media are
characteristic of the enlightenment model.
7. The seventh and last is the embedded model. Research is part

of the whole intellectual enterprise of the society, embedded
in its ways of thinking and behaving. It is only one of the
many influences in policy development, and must take its due
place alongside many other considerations, political, social and
economic.
E XERCISE 1
Go back over the seven models of research utilization above, and

find examples of as many as you can from your own country, as
they relate to specific areas of policy making in education.
Which model (or models) do you think would be most likely to

apply in your own country now? Why?
12
Defining the issues

Before any issues become the object of research, or any particular
project is decided upon, there are some important matters to be
taken into account.
• The issue needs to be an important one, and not something

trivial. Students seeking a post-graduate qualification at a
university may have more scope here to choose something of
particular interest to them (or perhaps the professor supervising
their research dissertations). The topic they choose may push
back the frontiers of knowledge a little, but not necessarily
be a matter of national concern. But research designed for the
consumption of policy makers commonly draws on public
funds, or funds from international agencies, and so needs to be
seen to be of significance for the education system as a whole.
The following questions need to be asked: Does the research
need to be done so that the system can be improved, either
quantitatively with an increased throughput of educated
students, or qualitatively by providing for better teaching? Will it
result in better classroom practice? Will it generate higher levels
of achievement, greater equality of opportunity, or increased
equity of outcomes? Will it provide a better-equipped work-force,
or more socially aware and responsive citizens in the future? Will
it highlight ways to bring about increased efficiency?
• The issue needs to be researchable. Many problems which are

not researchable exist in education systems. Some of these,
such as whether or not a comprehensive secondary schooling
system is ‘best’, have been alluded to above. Others in this
category would be questions of whether moral or religious
instruction should be given in school, or what values should
be incorporated in the curriculum, or whether corporal
punishment should be allowed, or whether class grouping
or ‘streaming’ is desirable for instructional purposes. Issues
© UNESCO 13
such as these lie outside the province of empirical, ‘testable’,

educational research. While it is possible to gather information
on them, to trace the history of past practices, to survey current
public opinion, and make some assessment of what future
moves might be acceptable, most of the necessary decisions are
essentially philosophical or ‘political’ ones. Such issues are not
easily researchable in the strict, empirical sense.
• The project needs to be manageable and workable. The financial

and human resources need to be available, so that an outcome is
possible within a reasonable time frame. Staff with an interest
in the area, and possessing the appropriate methodologies, or
consultants with the necessary expertise and sensitivity need
to be available. Other questions which need to be answered are:
Will it be possible to obtain access to a large enough sample to
allow reliable and valid results? Are appropriate methodologies
known and data analysis facilities available? Is there reason to
believe that something useful will emerge from the research, an
‘answer’ to the problem.
• Timeliness is another important criterion which is easily

overlooked. Will results be obtained within a suitable time scale
so that they will be of some practical value to the policy maker?
Educational researchers characteristically want to carry out a
large, long and thorough study, taking as many variables as
possible into account, so as to ‘milk’ the research project of as
much information as possible.
Policy makers, on the other hand, are usually under political
pressures, and want the results immediately, if not sooner! The
tough question, then, is: Will there be a payoff from this piece
of research, and if so what will it be, and when will it occur?
• Other considerations which are probably more relevant in a

university setting, and less important for research directed at
policy makers, are theoretical value, critical mass, and personal
interest.
14
• Firstly, does the problem fill a gap in the literature, and

contribute to the underlying theory in a particular area
of education? Will it contribute to the advancement of
knowledge in the field, and will others recognise its
significance? Does it improve on the ‘state-of-the-art’?
• Secondly, does it have a critical mass. In other words, is its
size and scope sufficiently large to allow something really
important to be said? Or is it rather insignificant, with only
small sample sizes, few variables, and lack of potential
results.
• Thirdly, is it the sort of project that will generate enthusiasm
on the part of the researchers, so that they will be committed
to it, and willing to work long hours on it? Will it excite their
imagination and ‘turn them one’? Will it provide them the
opportunity to learn further useful skills, and extend their
research competence?
The content of policy research

With these comments as a preliminary, it is useful to examine the
types of research which policy makers are likely to be interested in.
Out of the infinite number of possible research topics, it is necessary
to decide which ones are most important, and should be pursued.
This means the range of possible projects must be narrowed, and
for this to be done systematically so that no possible area of concern
is omitted, a classification system is desirable. Numerous such
classification systems are possible, and that which follows is only
one of many. It appears to be reasonably comprehensive, and allows
projects to be classified in a logical way. The procedure described
is a modification of one that has been tried out in practice in
Indonesia, and there is evidence that it seems to work (Postlethwaite
and Ross, 1986).
© UNESCO 15
It begins with the establishing of six broad categories:
1. Learning goals and curriculum.

2. Assessment, guidance and selection.
3. Demography, enrolment, structure.
4. Finance and administration.
5. Selection and training of educators.
6. Monitoring the education system.
1. Learning goals and curriculum

The goals of education can be considered at three main levels. At
the first level, very general statements can be made, such as ‘to
produce democratic citizens’, ‘to produce literate and numerate
workers’, or the five principles of the Indonesian state philosophy
Panca Sila (Belief in One God, Humanity, Unity of Indonesia,
Democracy and Social Justice). Such broad general goals are
often seen in five-year plans. However, these goals need to be
transformed into more detailed curriculum goals, for each year
of schooling. These curricular goals are characteristically seen in
subject syllabuses. This is the second level. Finally, at the third level
we have detailed and specific goals, which might be related to what
was intended should be covered in a unit or module of work in a
particular subject.
The first and second level goals are the ones of interest to policy
makers and planners, and are the ones with which researchers in
education ministries are most likely to be concerned.
This area relating to the teaching/learning process is clearly a

crucial one, and may consume a large proportion of the research
budget. Recent studies in mathematics conducted under the
auspices of the International Association for the Evaluation of
16
Educational Achievement (IEA) (Travers and Westbury, 1989)

have highlighted the complex nature of curriculum. There are at
least three aspects to be considered. First, there is the intended
curriculum, at the system level - the general goals and intentions of
the authorities who prescribe what should be taught, and determine
the course outlines, syllabuses, textbooks and so on. Then, at the
level of the institution, there is the implemented curriculum – the
material that is actually taught in the classroom by the teacher or
lecturer. This may differ from the intended curriculum, depending
on the degree of control exercised by the central authorities over
teachers in what is taught. Then at the student level there is the
attained curriculum – the body of knowledge, skills and attitudes
which a student has actually acquired from the educative process.
Some examples of research topics which would fall into this area of
learning goals and curriculum are:
a. Needs assessment surveys

If an education ministry is not alert to the needs of its society, and
responsive to those needs, the school system may be producing
learning which is irrelevant. Students may drop out of school early,
and be unable to find suitable work; others who continue on to the
upper classes in secondary school may be bored and disaffected,
and become a disruptive influence. At the tertiary level, such
surveys are vital in a rapidly changing world if industry is to obtain
the skilled and up-to-date work force which is necessary for high
national productivity.
b. Curriculum development
This follows on naturally from needs assessment surveys. Once
the needs of society are known, including both the needs of
citizens and the needs of employers, it is necessary to translate
these requirements into actual curriculum statements of what
shall be taught. A good feedback mechanism between employers
© UNESCO 17
and the education system can help to ensure that what is taught
is appropriate, that the curriculum is up-to-date and relevant
and not imported from some other, quite different society, or ten
years behind the times! There is much research to be done here,
and many different curriculum development models exist to
guide the researcher. Furthermore, if a policy is made to bring in
a curriculum innovation without careful trialing including sound
on-going formative evaluation, the new curriculum may be poorly
implemented and eventually unsuccessful in bringing about desired
results.
c. Provision of resources
Again, following on naturally from curriculum development is
the provision of appropriate curriculum resources to allow the
curriculum to be implemented as intended. In many developing
countries, a very large amount of time and energy has been spent
on the production of textbooks, learning packages, and other
curriculum materials, and on the setting up of libraries and resource
centres. More recently, some countries have set up educational radio
and/or TV networks, and have introduced computers to schools,
with all the necessary hardware and software. Apart from these,
there are the obvious needs for school buildings - classrooms,
laboratories, with all their necessary science equipment, and sports
facilities and equipment for cultural activities.
d. Special needs students

Another possible topic for research which could be classified
under this heading is that of provisions for special needs students;
children whose first language is not that of the country in which
they are living, disadvantaged ethnic minorities, displaced persons,
the physically, intellectually and emotionally handicapped, and the
highly gifted. For all of these types of students advanced education
systems will attempt to make special provision, on the grounds
of catering for the needs of the individual. The policy makers
18
entrusted with the task will need research to guide them in the
sorts of provisions they might make, and the likely costs of those
provisions.
2. Assessment, guidance and selection

An integral part of curriculum development is assessment, because
it is not possible to know whether the curriculum is appropriate
without some form of feedback from the students and teachers.
Assessment can be formative, occurring at intervals during the
learning process and designed to assist and guide the learner,
perhaps with some diagnostic elements. Or it may be summative,
occurring at the end of a learning experience, and designed to
provide feedback to the individual, educational institution or the
community about what learning has been achieved (Livingstone,
1990).
a. Examinations
Virtually every country has written examinations in one form of
another. Some have national examinations at several points in
the education system, which determine the rate of promotion of
students. Results on such examinations provide an indication of the
level of education received by the student, and also an indication
of attainment relative to other students at this level. But they also
act as a filter, a form of selection, a mechanism for rationing of
scarce resources, to control the entry of students to higher levels of
education, and eventually their career paths into the occupational
hierarchy in the world of work. Although examinations may take
various forms (a single national examination, a number of regional
examinations, teacher-based assessments, or a combination of
these), virtually every country has them, at a higher or lower level.
In many developing countries, the first such examination is that
for selection for entry to secondary school. Whatever the form
of the examinations may be, their validity (particularly content
© UNESCO 19
and predictive validity, and freedom from bias) their reliability

(precision, replicability), and their comparability (use of moderation,
etc.) are all topics for serious and ongoing research. So many
important promotion and selection decisions hang on their results,
that they must be seen to be entirely fair by all involved: students,
teachers, parents, employers and community. A related topic for
consideration is the value of positive discrimination, and ‘targeted’
provisions for disadvantaged communities, e.g., remote rural areas.
b. Other forms of assessment

There are many other types of assessment which are worthy of
research. For example, the development of forms of diagnostic
assessment, perhaps using computer technology, the preparation
of item banks, the development and use of standardised tests of
achievement for guidance and placement in particular courses,
mastery and competency testing for specialist vocational skills,
objectives-referenced or criterion-referenced approaches to
assessment and their appropriate uses, and different ways of
reporting and certification of levels of achievement. This is a vast
field for research.
c. Guidance and selection

Guidance services for students are an important part of a well-
developed education system, aiming to cater for the needs of the
individual. Some of this will occur informally within an institution,
without the need for specialist guidance personnel, simply on the
basis of personal acquaintance, and information from tests, and
other forms of assessment. For children with special needs, either
those who are handicapped or highly gifted, special provisions
will be needed, and specialist guidance will be required. The
recent trend in some advanced systems to ‘mainstream’ children
with special needs, removing handicapped children from ‘special
schools’ back into regular classrooms, will doubtless lead to
20
evaluation studies. Vocational or career guidance, the development

of suitable instruments and methodologies, and methods of
selection for higher education, are all likely to be areas for research.
3. Demography, enrolment and structure

Education systems are located in particular demographic settings,
in countries with particular population distributions and patterns
of enrolment. More especially in countries where universal primary
education has yet to be attained, basic statistical information is
necessary, and will form a very necessary part of the research
information base for a ministry of education establishing primary
schools. It is also vital information for those planning secondary
and higher level educational institutions.
Topics for investigation should include:
a. Basic demographic statistics

These should be prepared for the whole country, for each age
group from birth onwards, with particular attention being given to
accuracy, comprehensive coverage, geographic breakdown (rural/
urban, by province), birth-rate and migration trends.
b. Educational enrolments
Once the population base is determined, political decisions are
likely to determine the extent of provision for education. But
politicians need guidance on what is possible (e.g., in providing a
pre-school service, or expanding a system of secondary schools),
and basic statistical research can provide that guidance. Information
needs to be available (in relation to both the statutory school
beginning and leaving age) on such matters as: the percentage
of children who are not in school, or some other educational
institution, at every level, the rural/urban and male/female balance,
© UNESCO 21
participation of ethnic minorities, special community requirements

(e.g., for girls to care for children in the home, and for boys to work
on the farm), and their impact on present and future enrolments.
c. Educational structures
Once the characteristics of the relevant population base have
been ascertained, it is then possible to proceed to consider the
educational structures necessary to cater for those who wish (and
are able) to take advantage of them. This is likely to involve studies
of the location of schools and other educational institutions (school
‘mapping’), the provision of various alternative types of secondary
and tertiary education (comprehensive secondary schools,
vocational training institutions, teachers colleges, universities).
It will be necessary to advise on the likely effects of automatic
promotion or grade repetition policies, which are in turn linked to
examination pass-rates. Investigations are needed on the prevalence
of school truancy, ‘stop-out’ and ‘drop-out’ and the reasons for
them. Studies should be undertaken on the retention rates of
various institutions (including tertiary institutions such as teachers
colleges and universities). All of these will have a major bearing on
the quantity of education which must be provided.
The structure of a school system is always a matter of debate. There

are many matters here which are subject to value judgements, of
course. The value of intermediate schools, whether selective or
comprehensive secondary schools are better for students, whether
or not single sex schools or private schools should be encouraged,
the effects on learning and student attitudes of large schools, or
large class sizes, or ‘streaming’ into more homogeneous teaching
groups, and so on. Some of the findings to date in these areas of
research are equivocal, e.g., the class size issue (Glass, 1985). And
yet even here, it is possible for specific, in-country research to be
carried out, and for its results to provide an input into the decision-
making process, over a period of time.
22
4. Finance and administration

In most countries, educational costs are growing rapidly. At the
same time, there is a concern over educational standards, and a
desire to increase levels of performance. This tension has given
rise to a particular concern for both effectiveness and efficiency,
and a desire to see ‘lean’ administrative structures in place which
will contribute to these two very desirable goals in any education
system. When curriculum innovations are being made, too, there
are often implied, hidden costs, and trade-offs need to be made
when budgets are limited.
Research can therefore address issues such as the following:
a. Unit costs
Policy makers need to know the cost of particular forms and levels
of education, and their various economic rates of return, both
private and social. This is desirable if the demand is to be estimated
accurately. At the same time, it should be appreciated that human
beings do not always behave rationally, and traditional rate of
return analysis makes strong assumptions. Its results should always
be placed alongside other information which takes political and
social realities into account before major financial decisions are
made. It is also helpful if the actual costs of running institutions
are known, to ascertain whether economies of scale are possible
(e.g., with small rural schools), and whether the marginal costs of
bringing in extra students are likely to be relatively small.
b. Resource allocation
In most countries, education is seen as a public good, to be provided
for all its citizens as one of their rights, at least up to a certain
level. But when times are tough, there is likely to be some pressure
towards user-pays, particularly if it is believed that some students
(e.g., those attending university) are receiving an undue share of the
© UNESCO 23
country’s tax income to pay for their advanced education. Student

loans and the use of ‘voucher’ systems are examples of mechanisms
which have been recommended to achieve increased equity. In
the long run, it is a value decision on whether education should be
cross-subsidised, and will depend on a number of factors, including
how much the country needs highly qualified people for its
economic growth, and how much it will go out of its way to locate
able students, encourage them by way of bursaries and scholarships,
living and boarding allowances, and subsidise their educational
costs.
Some more affluent countries are prepared to consider differential

resource allocation to disadvantaged groups (ethnic minorities,
children from low socio-economic backgrounds, handicapped, etc.)
to make their levels of achievement more similar to those of the
population at large. The particular way in which such resources are
to be targeted is a legitimate subject for research (Ross, 1985), and
the outcomes can be of material assistance to policy makers.
When new curricula or teaching/learning strategies (e.g. computers,

open plan education) are being tried out, a good estimate of costs
needs to be obtained, to find out whether the innovation will be
cost-effective or not. It is risky simply to follow other countries, and
assume that what works in one setting will work in another, often
very different, setting.
Perhaps the most crucial area related to resource allocation is the

salary component, which commonly swallows up at least 80% of
the education budget in developed countries, and as much as 90%
in developing countries. Reliable information on the qualifications,
experience, movement and dropout of teachers is a vital ingredient
in the costing of an education system, and one upon which good up-
to-date information is vital.
24
c. Administrative structures
Some education systems are highly centralised, others devolve a
large amount of responsibility to the local level in administrative
matters, and sometimes in curriculum as well. Most achieve a blend
between the two. Research can be valuable in determining the
best compromise, in determining the cost-effectiveness of various
alter-native patterns of administration, and ascertaining the effects
of these upon teachers, principals, members of school governing
bodies, and parents (Wylie, 1990). The reward and incentive
structures for teachers also have a considerable bearing upon the
quality of education, and the evenness of its spread across rural and
urban areas in any country. Research is also needed on alternative
teaching strategies and delivery systems (e.g., distance learning,
small group, problem-based enquiry learning) and their impact on
costs and the physical environments for learning.
5. Selection and training of educators

A vast amount of research has been carried on into exactly what
makes a good teacher, from the pre-school level right through to
university. This is to be expected, because teaching and learning
form the core of all education. A good teacher can make us happy
and inspire us to continue learning, a bad teacher can make us
miserable and ‘turn us off’ further learning altogether. And effective
teaching behaviours vary widely, at different age levels, in different
subject areas, and in different situations. Teaching also merits study
because it uses up a lot of money, and an inefficient teaching service
is a heavy drain on the country’s educational budget.
a. Teacher selection
Every country must pay considerable attention to the way in which
it selects its teaching force, because the lives of its future citizens
and its own economic welfare are in their hands. Research is
© UNESCO 25
therefore highly desirable on the qualifications of the pool from

which student teachers are traditionally drawn, the competencies
and qualifications which are sought in prospective teachers, and
in those who are eventually chosen for training. It is important to
have good information on the incentives to teaching, the reputation
of teaching as a career and the motives of those who select it, as
well as the screening processes (overt and covert) which are used to
select teachers and other educators.
b. Teacher education
The settings in which pre-service education of teachers is carried
out, whether at a university, a teachers college or both, the level
and length of training, and the balance between education theory
and classroom practice are all legitimate topics for research. In-
service education is another issue which is becoming of increasing
importance, as new and updated curricula and teaching methods
are introduced (e.g., in science) which make much heavier demands,
both upon teachers’ knowledge of their subjects, their ability to
use new equipment and new approaches (e.g., discovery, problem-
solving methods), and on their ability to cater for individual needs.
They may be required to work in team teaching situations in open
plan classrooms, and generally cope with a much more flexible and
less-structured teaching environment, in which the traditional ‘lock-
step’ rote learning is no longer acceptable.
c. Teacher effectiveness
All the matters mentioned above have a bearing on the general
issue of teacher effectiveness. In spite of the vast amount of
research into this area, we still do not know enough about
what makes for effective teaching, in any global, international
sense. This probably differs at different class levels, in different
countries, under different teaching conditions, and with different
community expectations. But it is important to have evidence about
effective and ineffective teaching behaviours to plan the content
26
of pre-service and in-service teacher education programmes,

because there is no sense in training teachers to adopt ineffective
teaching strategies! If apparently successful teacher behaviours
are so variable (as they seem to be) a search for general findings
is probably unproductive. Specific, in-country research may be
necessary to guide teacher educators in their very important task.
6. Monitoring and inspection

One form of assessment takes place at the student level. This has
been considered under Topic 2. Another form of evaluation is as
an accountability mechanism at the system level. It consists of
gathering measures of performance so that policy makers know
whether their expectations of the system outputs are being fulfilled,
whether standards are being improved (or at least maintained), and
(perhaps) how their own country’s education system compares with
other similar systems.
Some countries have very detailed and formalised ways of

obtaining this information, others do it much more informally, but
preparation of suitable instruments and evaluating the results is an
important research exercise.
a. Monitoring achievement
It is not usual to undertake monitoring of achievement at every
grade level, because of the sheer expense of the operation.
Some countries do not even undertake formal monitoring at
all, through the use of tests or other assessments, because they
are not convinced that it is a cost-effective way to maintain
standards. They may prefer to use informal methods of quality
assurance by concentrating upon teacher in-service training, or
by providing standardised tests of achievement to guide teachers
in their curriculum and assessment decisions. But many countries
do select important ‘check-points’ in the system at which to
© UNESCO 27
administer various assessment measures on a nation-wide basis,

so that policy makers and the general community can obtain
some idea of the standards that are being maintained throughout
the education system (Livingstone, 1985). Tests of literacy and
numeracy commonly form the basis of such assessments, but they
can go much broader than this. A further argument is that without
monitoring, the policy makers will not know how to improve the
internal efficiency of the system, because they do not know how
efficient or inefficient it is, nor exactly what are its outputs of well-
qualified students.
b. Comparative evaluation
On a slightly broader front, system evaluation studies are desirable,
to consider such topics as the following. Is my country investing
more or less in the education of its population than other similar
countries, seen as a proportion of its GDP? How does the country
fare in relation to these other countries on a range of social
indicators, such as school enrolment ratios, graduation rates,
proportion of enrolment in higher level science and mathematics
courses, etc.? Are we producing a sufficient supply of highly-
qualified persons to compete with the output from other rapidly
developing countries. And even, what proportion of the total
educational budget should be devoted to educational research!
To answer such questions will require careful, comparative

interpretation of basic statistics against those of other countries,
studies of labour market trends, and ‘tracer’ studies of graduates in
particular fields.
c. Inspection
Another traditional way in which educational systems maintain a
quality check has been through regular inspection of its teachers, at
least at the primary and secondary levels. The quality of education
provided in any country is crucially affected by the quality of the
28
teaching and lecturing force, and research studies of the ways in

which such evaluations can best be carried on in a sensitive, on-
going way are called for. Such evaluations commonly go along with
some inspection of the schools, or other educational institutions,
themselves, to ensure that they are well-equipped and capable of
delivering the high-quality education which is required.
d. Miscellaneous
An additional Miscellaneous category has been added at the bottom,
to cater for such things as research on methods of information
dissemination, the preparation of research bibliographies, clearing-
house activities, research methodologies, and any other topic which
may not fit neatly into the grid. A summary of all these topics is
contained in the table below, grouped according to the classification
given, and expanded to indicate the sorts of activities which might
fall into each category.
© UNESCO 29
An expanded content table
• Needs assessment surveys – societal goals, employment requirements

• Curriculum development/revision
• Provision of resources – textbooks, learning materials, classrooms, libraries, laboratories,
sports and cultural facilities
• Special needs students – disadvantaged, gifted
• Examinations – validity, reliability, internal assessment

• Other forms of assessment - diagnostic, mastery, competency-based; standardised tests,
item-banks, certification
• Guidance and selection – career guidance, special needs
3. Demography, enrolment, structure
• Basic demographic statistics – birth-rate, migration trends, rural/urban balance

• Educational enrolments – coverage, retention, male/female and urban/rural balance
• School structure – comprehensive/single sex/private, school and class size, streaming
• Unit costs – rate of return, economies of scale

• Resource allocation – equity, targeted aid, ‘user-pays’, bursaries and scholarships
• Administrative structures – decentralisation, methods of delivery
• Teacher selection – qualifications, competencies, procedures

• Teacher education – pre-service, in-service training
• Teacher effectiveness – teaching and learning strategies
• Monitoring achievement – system accountability, national assessment, literacy and

numeracy, check-points
• Comparative evaluation – cost effectiveness, output of highly-qualified people, educational
budgets
• Inspection – teachers, institutions
7. Miscellaneous
• Information dissemination, research methodologies
30
Filling out the grid

One more stage in the process is necessary before all possible
projects which might be undertaken can be accurately and neatly
classified. It is necessary to decide at which levels of the system
the issue is important, and thus at which levels research should
be carried out. Some issues are of specific concern to secondary
students (examinations and career guidance, for example), while
others may relate only to pre-school education, or to some form
of tertiary education. Some issues may span several levels, and
research projects, particularly where they involve transitions from
one level to the other, must take this into account. The final two-
way table therefore includes five levels of education, plus an Other
category coded as below:
A Pre-school
B Primary School
C Secondary School
D Tertiary Education
E Non-formal Education
F Other
These are not the only possible categories, of course. Some
countries with selective secondary schools might wish to divide the
Secondary School category up into two or more. Others with no
organised pre-school services may not need to include this category.
But the pattern will remain the same. Every cell in the table now has
its own code (e.g. 1.3B would refer to a project on the provision of
resources at primary school level, 5.2D would refer to a study of the
selection of lecturers for some form of tertiary education (perhaps
teacher education), the code 7.0F might refer to a miscellaneous
project on establishing an education index for the whole education
system, and so on. Projects can span several levels of the system,
and should be entered under each relevant level. Occasionally
a project may fit more than one content category. In this case, it
should be allocated to the category it fits best, cross-referencing it to
another category if this is thought desirable.
© UNESCO 31
The content/level grid
Pre- Non
school Primary Secondary Tertiary formal Other
A B C D E F

· Needs assessment surveys
· Curriculum development
· Provision of resources
· Special needs students

· Examinations
· Other forms of assessment
· Guidance and selection
3. Demography, enrolment, structure

· Basic demographic statistics
· Educational enrolments
· School structure

· Unit costs
· Resource allocation
· Administrative structures

· Teacher selection
· Teacher education
· Teacher effectiveness

· Monitoring achievement
· Comparative evaluation
· Inspection
7. Miscellaneous
32
E XERCISE 2
Below are ten research projects which you may assume have been suggested to
your Ministry of Education. Classify each one onto the grid above, by giving it
the correct code. Compare your answers with those of other members of your
group.
1. To establish criteria for the identification and categorisation of handicapped

children, with a view to deciding which ones should be integrated into
normal school programmes (‘main-streamed’).
2. To develop diagnostic materials in mathematics for primary school children,

and a set of strategies for teachers to apply these materials.
3. To study teacher dropout in isolated and rural areas in order to identify

those areas where there is, or is likely to be, a severe teacher shortage.
4. To develop indicators of family socio-economic circumstances which would

be suitable for making adjustments to school and tertiary fees on the basis
of need.
5. To conduct a ‘tracer’ study of vocational education graduates in order to

obtain feedback to improve the relevance of higher education courses.
6. To compile an annotated bibliography of all the reports produced by the

research division of your Ministry of Education since 1970.
7. To follow a cohort of teacher education students through their courses

of training and out into their first year in the classroom, to find out their
perceptions of how well their courses prepared them for teaching.
8. To construct a national test of basic literacy to be used as a benchmark for

future tests, and a guide for programmes in non-formal education.
9. To survey local communities on their wishes regarding the extension into

their areas of national secondary schools offering a full curriculum up to the
pre-tertiary year.
To find out the effects of building a well-stocked ‘community’ library, readily

accessible to secondary school students, on their reading comprehension and
vocabulary skills and breadth of reading interests.
© UNESCO 33
E XERCISE 3
Now consider whether there are any variations which you think
should be made to the grid to fit the education system in your own
country, either to the content categories or the levels.
Think up ten new projects which you think would appeal to policy
makers in your country, and use your own grid to classify them.
(You can stay with the grid above if you think it is suitable for
application in your country)
34
Setting priorities for 3
educational policy research
issues
When establishing a policy-oriented research programme, it is
highly desirable to establish a mechanism to determine national
priorities. Researchers are not good judges of what is important
nationally. They tend to see research projects in terms of ideas,
models or methodologies which are of interest to them. On the
other hand, some administrators lack foresight, and are only able
to identify projects when problems arise in Parliament or there is a
national outcry. It is usually then too late to initiate research which
will deliver the desired results on time.
A good procedure is to poll major interest groups well ahead of

time, so that the likely problem areas are identified in advance.
This will give the necessary lead time to get the required research
under way. It is also good to consult the Ministries of Education
in other countries at a similar or slightly more advanced stage of
development, to find out what problems they have experienced,
locate any relevant research they may have done or commissioned,
and generally pave the way for sound policies.
The grid should be a helpful device to give a broad picture of the

total scope of the research effort in the country, and help to ensure
that there is not an undue concentration of resources in just a few
areas, with a large number of gaps elsewhere.
© UNESCO 35
E XERCISE 4
Make a list of exactly which groups in your country you would

consult in such a polling exercise.
Which education ministries in other countries would you think

about consulting? Give reasons for your choice.
It is one thing to have a number of possible projects outlined. It is

another thing altogether to decide which projects should be tackled,
with what priorities and with what resources. There are really two
issues here: importance and feasibility. Presumably only important
projects are likely to be tackled, particularly if funds are scarce (as
they usually are!). And even it the matter is important, it may not be
feasible if the human, financial and other resources are not present.
Exercise 5 provides a way to work through these issues, first by a
consensus rating method to find out which projects should have
priority, and then by a simple statistical technique.
36
Setting priorities for educational policy research issues
E XERCISE 5
1. Examine the list of ten projects given in Exercise 2 above, which you
labelled according to their location on the content grid.
2. Now gather together between 8 and 12 other fellow-students on the

course from your own country (or one at a similar stage of development)
and by discussion (and argument if necessary!) attempt to obtain agreement
on which projects you might consider carrying out in your respective
countries next year. List them in order, with the highest priority first, going
down to the lowest priority.
This is the consensus approach to determining research priorities, and may

work well if there is little disagreement within the group.
3. Next, try a slightly more systematic approach. Go back by yourself and

make an assessment of each project, individually, without consulting other
group members. As far as possible, do not take into account the results of
the last discussion. Make your own ratings, as outlined below, describing
what you yourself think of each of the ten projects.
First, rate each of these projects on a 5-point scale of importance in your
country as follows:
Use these criteria for your ratings:
a. The project deals with a relevant educational issue in your country;
b. The issue is currently a persistent problem being faced by your Ministry
of Education;
c. The project will provide information that can be used in policy
decisions.
…
© UNESCO 37
… E XERCISE 5
Extremely 5 An absolutely crucial study; top priority

important 4 An important project; should be done
á
3 A moderately important project; could be done
2 A rather unimportant project; probably not a high priority right
now
1 An unimportant project; need not be done at this time
not important
Next consider the feasibility of the project, rating each project in a similar way,
from 5 down to 1. Note that importance is not the same as feasibility. A project
is important if we ought to do it. A project is feasible if we can do it, i.e., it is
possible, within the limitations of our resources of personnel, equipment and
finance, or those that we can obtain.
Thoroughly 5 An easy project to carry out; could be started without delay

feasible 4 A relative straightforward project; not too much difficulty here
á
3 A moderately easy project to mount; some difficulty could be
experienced
2 A difficult project to get underway
1 A quite impossible project to carry out at this time
not feasible
Use these criteria to evaluate the feasibility of each project:

a. We have the human resources to carry it out, or we can get them;
b. We have the financial resources to do this project, or we can get them;
c. We have access to the required methodologies and facilities;
d. There appear to be no other practical difficulties (e.g., in getting co-
operation from school principals, education authorities);
e. Conditions are right in my country to carry out this type of research.
…
38
Setting priorities for educational policy research issues
… E XERCISE 5
4. Now add together your ratings to obtain a total score out of 10 for
each project.
5. Come together with your group again, and on the basis of your ratings, see
if you can obtain a consensus on the order in which the projects should be
placed, from top priority down to bottom. It is suggested you enter the
results of the ratings for the whole group on a chart like the one below.
Using a desk calculator, find the mean and standard deviation of the ratings
in each row, i.e., for each project, and use this as the basis for discussion.
Project Course student no. Importance rating

number 1 2 3 4 5 6 7 8 9 10 11 12 …
Mean SD
1.
2.
3.
… etc.
Did you find the same five projects at the top of the list as you did previously?
If you did not, can you explain why?
6. Note that the Standard deviation (S.D.) gives a rough measure of how
strong the agreement was within the group. A large standard deviation
(over 2, say) means there was quite a bit of difference in viewpoint; a small
standard deviation means you were all basically in agreement.
7. Discuss with the group which method seemed to be most useful

(Consensus or Rating). What were the advantages and disadvantages of
each. Can you suggest any improvement in the rating method which would
help?
© UNESCO 39
4 Clarifying priority educational

policy research issues and
developing specific research
questions and research
hypotheses
Once the general aims of the piece of research have been agreed
upon, it is necessary to get down to more specific aims, and
establish a suitable methodology for the research. The particular
approach chosen will depend to some extent on the experience
and preferences of the researcher, but very largely upon the type of
problem faced.
To quote David J. Fox (1969) in The Research Process in Education.

p.45.
In considering the research approach, we must consider two

separate and underlying dynamics or dimensions, along which we
can structure our research thinking. The first dimension is a kind of
time line reflecting whether we believe the answer to the research
question is in the past, present or future. The second dimension
is an intent dimension reflecting what we intend to do with the
completed research.
In the time dimension, if we believe the answer is in the past, we

resort to what is called the historical approach, a research approach
40
in which the effort is made to cast light on current conditions and
problems through a deeper and fuller understanding of what has
already been done. If we believe the answer exists somewhere in
the present, we use the survey approach. In this approach we seek
to cast light on current problems by a further description and
understanding of current conditions. In other words, we seek more
fully to understand the present through a data-gathering process
which enables us to describe it more fully and adequately than is
now possible.
If, on the other hand, our interest is in predicting what will happen
in the future, that is, if we undertake something new and different,
or make some changes in the present condition, we have the
experimental approach, which is experimental in that it seeks to
establish on a trial (or experimental) basis a new situation. Then,
through a study of this new situation under controlled conditions,
the researcher is able to make a more generalised prediction of what
would happen if the condition were widely instituted.
The research question

Usually the matters of concern to researchers in ministries of
education are in the form of questions to be answered. This is part
of the statement of the problem, which simply indicates what the
researcher is trying to find out.
A research question is not the same as a question which you could

ask an individual who might be part of your investigation. A
research question is a way of formulating a problem so that you
are directed to the answers. If one person could give the answer
to a research question there would be no need to set up a research
project to establish the facts.
© UNESCO 41
Usually, the topics of concern to policy makers, or those selected

by students for degree dissertations, are too vague to begin with.
This does not mean that studying them is not worthwhile or even
necessary. It simply means that the topic needs refining.
Examine the following example of a discussion between a policy

maker (P) and a researcher (R), limiting and clarifying the topic of
concern, and making it more easily researchable.
P. We need to do some research on why boys are not performing

as well as girls in mathematics.
R. What exactly is the problem here?
P. Well, teachers I meet with keep telling me that girls do better.
R. Do you have any hard evidence to show this is so?
P. Only the examination results at the end of primary school. The

girls score on average about 5% higher than boys.
R. Do all children take this examination?
P. Most of those who are still in school at that time take it.
R. What proportion of these are girls?
P. I’ll have to check the figures. I would estimate about two-thirds

would be boys, one-third girls.
R. So there is a higher drop-out rate amongst girls in primary

school?
P. Yes. There always has been.
R. Why do you think this is?
42
Clarifying priority educational policy research issues and developing specific research questions and research hypotheses
P. Social factors, largely, I would think. Many girls have to stay at

home to help with the younger children; they don’t see a career
as to be as important as boys do.
R. Would it be the girls with a better home background who stay

on longer?
P. Yes, I would think so.
R. Well, perhaps that might be part of the reason. They might have
more support and resources at home to do well at school and
pursue their education. How could you check it out?
P. I suppose it would be possible to find out the proportion of

high-level occupations amongst the parents of boys and girls
sitting the primary final examinations, and see whether there
was a difference.
R. Yes, that would be useful. Do you have any other information

from lower down the primary school which might be of
assistance?
P. I think some schools administer standardised tests of

achievement at Grade 4 level.
R. Have many girls dropped out by then?
P. No. Most of them are still in school.
R. Do you have universal primary education in your country?
P. Well, more or less, at least up to Grade 4 (that’s about age 10 or

so). The dropout starts after that.
R. So there should be a roughly equal number of boys and girls in

primary school at Grade 4 level.
© UNESCO 43
P. Yes.
R. Could you get the results from boys and girls separately from
those schools which administer the tests in mathematics at this
level?
P. No central record is held, but I suppose they could be obtained

from the schools.
R. Would you have many small rural schools administering the

test?
P. Yes, quite a number, but not so many as in large, relatively well-

off urban schools.
R. You would need to be careful in drawing your sample so that

there was an appropriate balance of various types of school,
before you compared the mean scores of boys and girls.
P. Yes, my staff could do that.
R. And then you could compare the mean scores of boys and girls
on the standardised test to see whether the same differences in
favour of girls occurred at a younger age.
P. Yes I suppose so.
R. Incidentally, why do you want to do this piece of research?
P. Well, actually, now you mention it, I’ve had a suspicion about
the mathematics examinations for Form 3 over the last three
years. I have a feeling that these examinations have been
prepared in order to favour the performance of girls!
44
E XERCISE 6
Now that the real reason for the request for research has
emerged, try to continue the dialogue between the policy maker
and researcher. See if you can arrive at a plan of action to solve the
problem, which you have started to clarify by this discussion. You
will note it was not the problem you first thought it was!
Then pick another completely different topic, one which may be

of importance in your own country, and write a similar dialogue
between a policy maker and researcher, clarifying the issue little
by little, and proceeding in the end to the outline of a research
proposal.
This procedure of clarifying the issue and making it researchable

is a very important and necessary step before any research is
embarked upon. In many circumstances, it can be made more
systematic.
Consider the research question below, which relates to a small

(imaginary) island called Marino, with a population of about 5000,
part of an island nation in the South Pacific.
Research question
What is the need for a secondary school on the island of Marino?
We will assume that the answer to this question is not already

known, and that it could not be satisfactorily answered by just one
person, e.g., the tribal head on the island. The answer will depend
on information from various sources, such as the Minister for
Education, policy makers within the Ministry who are responsible
for interpreting the Minister’s policies and finding the financial
© UNESCO 45
and personnel resources, the views and opinions of people on the

island, the views of the principals of the ‘feeder’ primary schools,
population trends in the area, distance from secondary schools on
neighbouring islands, ease of transport on the island, and so on.
When surveying opinion, the researcher will not simply want to

repeat the research question above. Rather, it will be necessary
to ask a series of more specific questions related to it. It may be
possible to obtain answers to some of these questions from official
records or government statistics. Perhaps some could be obtained
from a postal questionnaire. But the most reliable way would
usually be to interview key people likely to be affected by the
decision. These are sometimes called the ‘stake-holders’, because
they have an interest or ‘stake’ in the question.
To begin with, the following four questions might be asked by the

interviewer:
Is there a need?
Where is the need?
Why is there a need?
What kind of need is there likely to be in the future?
The outline in Exercise 7, which you will be asked to complete, is

for an interview with the chief officer of the Ministry of education
on the island of Marino. Notice how the general research question
has been broken down into separate, more detailed questions.
The interview questions help to answer one or more of these
smaller questions. Notice also how the interview questions match
the detailed research questions and allow informed guesses
(hypotheses) to be made about the answers.
46
Detailed research
questions Examples of interview questions
Is there a need? How many children attend primary school

on the island of Marino?
How many of these are in their final year of
primary school?
...
...
Where is the need? Where are the primary schools on the

island located?
...
...
Why is there a need? What is the policy of the Minister of

Education on the setting up of secondary
schools?
...
...
What about the future? What is the birth rate in the whole
country?
Is the birth rate on the island of Marino
likely to be any different?
Is there much internal migration to or from
Marino?
...
...
Note that the research question also leads to the definition of

terms. What is intended by secondary school? Is it a full national
secondary school with courses going right up to pre-tertiary level, or
is it a regional secondary school with courses terminating at a lower
level?
© UNESCO 47
E XERCISE 7
Make an expanded version of the above list of questions. Then fill

in all the detailed interview questions you can think of that might
correspond to these four research questions.
Then add you own, additional research questions, e.g., on funding,

personnel, transport, etc. and make up interview questions to
fit them. You should end up with a complete interview schedule,
dealing with all the important issues which the Minister would
want to know.
E XERCISE 8
Now imagine you were interviewing the head man of the local
tribe on the island. What research questions would you want to
ask him? (Many of them will be different!)
Make up a table in the same way, listing the research questions

down the left, and the corresponding interview questions on the
right.
Discuss this exercise with your group, and compare your answers
Beware of unresearchable questions!

Try answering: Is early childhood education a good thing?
The problem is that it is impossible to tell what would count as

evidence to support any one answer over another. Some people
would not accept, ‘It leads to a child’s all-round development’.
48
Some would not accept ‘It prepares a child for primary school’.
Some would not accept ‘It has economic benefit’. The answer to
this question lies in the field of values and is not really a matter for
scientific investigation. However, if someone told you what kind of
answer they would accept as evidence (e.g., it prepares a child for
school) then the question could be researched, within those terms.
Faced with an unresearchable question, a researcher will sometimes

define what answer he or she would accept as evidence, and
structure the research around that. For example, an economist
might assume that the most important benefit from early childhood
education was an economic one. The economist might then evaluate
early childhood education in relation to its costs and its benefits.
One type of early childhood service might be compared to another,
early childhood services might be compared with other services
in relation to what the service costs and what benefits result to the
users and/or to society as a whole.
E XERCISE 9
Begin with an unresearchable question which people in your

country might ask, and turn it into a researchable question by
defining the terms of reference more closely. You might like to
use as an example (if you wish) the question ‘Is having many small
rural schools a good thing?’ and consider it from the viewpoint of
an economist, an educational administrator and a local community
leader. For this example (if you chose to use it), you would need
three separate, and different, research questions to reflect the
three different viewpoints.
© UNESCO 49
The research hypothesis

Sometimes it is helpful to state the issue in the form of a hypothesis.
Put simply, a hypothesis is a suggested answer to a problem,

expressed in the form of a brief sentence.
It should satisfy at least four criteria:
• It should be relational; that is, the hypothesis should state an

expected relationship between two or more variables. The
researcher will attempt to verify this relationship.
• It should be non-trivial; the hypothesis should be sensible

and worthy of testing, a likely possibility and not just an idea
dreamed up for the sake of having a hypothesis.
• It should be testable; that is, it should be possible to state it in an

operational form which can be evaluated on the basis of data to
be gathered.
• It should be clear and concise; the hypothesis should be in the

form of a brief, unambiguous sentence.
1. The hypothesis should be relational

In correlational studies, that is, those in which data on two or more
variables are collected on the same individuals, a direct relationship
is usually stated in the hypothesis. In experimental studies, where
an experimental treatment is administered to one group of students
but not to another group, differences between the treatments are
usually hypothesised, based on means and standard deviations.
50
In addition to stating a relationship, the hypothesis may also briefly

identify the variables and the population from which the researcher
intends to select the sample. As a rule, however, it is best not to
include too much information of this type in the actual hypothesis,
because it makes it too lengthy and less clear.
2. The hypothesis should be non-trivial

After completing the necessary review of the literature, you will
have detailed knowledge of any previous work relating to your
research investigation. In many cases you will find conflicting
results, but they will at least give you some leads so that your
hypotheses are sensible and reasonable. It is best to have some basis
in theory, fact, or past experience for your hypotheses. A ‘shot-gun’
approach which gathers large amounts of information unsupported
by any underlying rationale is not to be recommended.
3. The hypothesis should be testable

The relationship or difference under consideration should be such
that the measurement of the underlying variables can be made
reliably and validly, in order to see whether the hypothesis as stated
is supported by the research. Do not state any hypothesis which
you do not have good reason to believe can be tested by some
objective means. The hypotheses of inexperienced researchers
often fail to meet the criterion of testability, either because the
required measures do not exist, because many other likely factors
are at work, or because it would take far too long to obtain results.
For example, a hypothesis that taking a particular course in moral
education would lead to a reduction in adult crime statistics is
unlikely to be easily testable.
© UNESCO 51
4. The hypothesis should be clear and concise

In stating hypotheses, the simplest and most concise statement of
the problem is probably the best. Brief, clear hypotheses are easier
for the reader to understand, and also easier to test. It is better to
have a larger number of simple hypotheses than a few complex
ones. Care and precision in the use of language are necessary in
order to define the variables and samples clearly. Each separate
relationship drawn from the research question needs its own
hypothesis, because usually some will be supported by the data and
some will not.
In dealing with many education policy issues, researchers are

likely to find it helpful to use questions rather than hypotheses for
their problem statements. Questions may be more useful because
they provide more less mechanistic, more holistic guidance in the
framing of a research project. They are closer to real life enquiries,
and allow a variety of broader approaches to gathering information,
including qualitative methods. The advantage of the research
hypothesis lies in the direction and precision which it gives to
research. There is no room for sloppy thinking in framing research
hypotheses.
In the past, students writing research dissertations have generally

been encouraged to frame their questions in the form of hypotheses,
in many cases based upon underlying theories. On the other hand,
many projects carried out by research staff within ministries of
education do not call for hypotheses. They may be baseline studies
describing the situation as it exists, or evaluations of particular
curriculum innovations, and simple descriptive statistics are all that
is required. No cause-effect relationships are being teased out, as
with an experimental study.
In passing, it should be emphasised that a demonstrated statistical

association is not the same as a proof of causality. Just because
52
there is a statistically significant relationship between, say, whether

or not the school has a science laboratory and the proportions of
students achieving high examination pass-rates, does not prove that
the presence of laboratories causes high achievement, simply that
it is related to it. There may be other important factors at work as
well. Association is not the same as causality, but strong statistical
techniques such as path analysis can go some way towards teasing
out the relative effects of different variables, and combinations of
variables, and allow some cautious generalisations to be made about
which hypotheses are more likely to be true. Even here, there is a
strong body of literature suggesting the ‘situation-specific’ nature
of much behaviour. It depends to a considerable extent on the
particular class, school, teacher, and circumstances prevailing at the
time.
Researchers commonly use two different kinds of hypotheses. A

research hypothesis indicates what the researcher expects to find,
and substantiate with evidence. It is framed in a positive way, but it
is important to note that it is not what the researcher wants to find,
but what he or she expects to find that is the basis of the statement.
No researcher operates in a value-free environment. All of us
have our own preconceptions about education; the frameworks in
which we conceive the research and even the very questions which
we ask will reflect that. But the important point here is that there
is a ‘procedural neutrality’, and the researcher is as objective as
possible in gathering the necessary data to see whether what he or
she expects to find is indeed the case. Researchers sometimes use
another type of hypothesis, the null hypothesis for reasons of ease
in statistical testing. The null hypothesis states that no difference
or relationship exists among the variables, regardless of whether
or not the researcher believes this to be true. If, as a result of the
research, the null hypothesis is rejected, the investigator concludes
that differences do exist, and will then set out to identify those
differences, and if possible, their causes.
© UNESCO 53
Two examples will make the difference clear.
Example 1
• Topic: The relationship between age of entry to primary school
and subsequent school success.
• Research Question: How is performance in the English language

affected by whether children enter primary school at age 6 or
age 7?
• Research Hypothesis: Children who enter primary school at age

6 will perform better on standardised tests of English reading
comprehension at age 12 than children who enter at age 7.
• Null Hypothesis: There are no differences in achievement on

standardised tests of English reading comprehension at age 12
between children who enter primary school at age 6 and at age 7.
Note that the general topic has been made more specific by the
research question, in that school success has been defined in
terms of the English language only. Clearly there are many other
definitions of school success, and these would all need their own
research questions.
Note next that the research hypothesis has further narrowed the
research question, in specifying that performance in English is
defined in operational terms as written performance only (not
spoken performance or listening skills) and that this performance is
limited to what can be measured on a standardised test of reading
comprehension (not vocabulary, for example). Other hypotheses
would be needed to cover other aspects of English.
Furthermore, the age of 12 years has been set as the point at which
the measurement is to be done. If it was desired to see whether
the advantage persisted to a later age, it may be necessary to test
54
students again when they were 14 years or 16 years. These would

require further hypotheses.
The null hypothesis matches exactly the research hypothesis, and

the same precision is called for in the framing of the statement.
Here, however, no assumption is made as to whether beginning

school earlier has a positive effect on learning or not. The
hypothesis is entirely neutral.
Example 2
• Topic: The use of micro-computers in diagnosing errors in basic
arithmetic.
• Research Question: Can micro-computers be used effectively to

diagnose errors in basic addition and subtraction with primary
school pupils?
• Research Hypothesis: The number of errors in addition and

subtraction correctly located by a computer diagnostic
arithmetic programme is more than the number located by a
classroom teacher.
• Null Hypothesis: The number of errors in addition and

subtraction correctly located by a computer diagnostic
arithmetic programme is no different from the number located
by a classroom teacher.
Note that a number of other hypotheses would need to be tested

here as well. It would be important to know how long it took the
teacher to do the error diagnosis task, compared with the computer,
whether or not the same errors were detected, how much extra
information was obtained from the computer printout about the
false starts made by the pupil before he or she got the right answer,
and so on. (Livingstone, Eagle, Laurie, 1988).
© UNESCO 55
E XERCISE 10
Choose three topics as research possibilities. They can be any

topics you like.
Now make three problem statements for each, as for the

Examples above.
Topic under investigation
Problem statement as research question
Problem statement as research hypothesis
Problem statement as null hypothesis
As you go, check each of your statements against the four criteria
given above: relational, non-trivial, testable, clear and concise.
When you have finished, exchange your problem statements with
those of another member of your group. Can you improve on one
another’s problem statements? Discuss and critique them together.
56
Generating research hypotheses

Where do hypotheses come from?
There are three main sources:
• Observation
• Theory
• Literature Review
Hypotheses are often derived as the end result of a series of

observations. But they are not to be confused with observations. An
observation refers to what is, and can be seen; a hypothesis refers to
something that can be inferred, or expected, or assumed from what
is seen.
For example, some researchers could visit a primary school, and

note that there is no library, there are very few bookshelves round
the classroom walls, and there are hardly any books on them.
Though they do not know the school achievement results are poor
(that is, they have no data on examination success, at this stage),
they expect that in general children from that school will not
perform well.
They could then make an explicit hypothesis, setting out an

anticipated relationship between two variables, number of books
found in school and success in examinations for entry to secondary
school. This hypothesis could be tested by visiting a number of
different schools, observing the number of books, and whether or
not the school had a library, and relating these observations to the
proportion of pupils who were successful in entering secondary
school. A generalised hypothesis could be framed on the basis of
the evidence.
© UNESCO 57
But in visiting the various neighbourhoods, the researchers also

observed that some of them showed obvious signs of poverty. There
were many broken down old sheds, and the ground seemed not
to be cultivated. They wondered whether this might have an even
more significant effect on school achievement. Perhaps the majority
of people living in these poor neighbourhoods could not afford
a newspaper, or any books in their homes. There was no ‘literate
culture’ in the homes, to reinforce what the school taught, and this
was the main reason why the children did not do well at school.
This idea would give rise to some alternative hypotheses, relating

first, total home income, and secondly, the number of books,
newspapers, magazines, etc. in the home, to the success of
children in their examinations for entry to secondary school. More
generalised hypotheses could now be framed and tested, provided
that data on home income and home literacy could be obtained and
put into an operational form to allow the necessary correlations to
be calculated.
And so the process would continue, with more likely hypotheses

being generated and proposed for testing, requiring new
information to be collected in a form which allowed it to be
evaluated with reasonable objectivity.
A hypothesis, then, is an expectation about events, based on a

generalisation of an assumed relationship between variables.
Hypotheses are abstract and concerned with theories and concepts,
whereas the observations used to test them are specific and based
on facts.
Another way of generating a hypothesis is from an underlying

theory, which has been built up over a large number of previous
studies by researchers over a long period of time. The theory of
mastery learning would be one such example. This states that
if learners possess the necessary entry behaviours (prerequisite
58
knowledge, skills and attitudes) for a new learning task, and if the
quality of instruction is adequate, then they should all learn the
task, given sufficient time. Two typical hypotheses following from
this theory would be:
• Hypothesis 1
Following corrective instruction (mastery learning) the
relationship between original learning and corrected learning
will be zero.
• Hypothesis 2
Given equal prior learning, corrective instruction (mastery
learning) will produce greater achievement that non-corrective
instruction.
Theory thus provides a good source from which hypotheses may

be derived, for particular sorts of research, because testing the
hypotheses will confirm, elaborate or disconfirm the theory, and
a new and cohesive body of knowledge will begin to emerge. This
source of hypothesis generation may, however, be more suitable for
an academic thesis than for the work of someone involved in policy
research for a Ministry of Education.
A third, and very common way of generating hypotheses is through

a literature review. The object of a literature review is to learn from
what is already known on the topic, and build on it. There is no
point in ‘re-inventing the wheel’. Reviewing and synthesising
the work of others can provide suitable research questions or
hypotheses which they have used, suggest forms of wording
which have proved successful in eliciting what might be sensitive
information, and alert you to additional questions which could be
asked if the circumstances of previous research are different from
those under which you are working. A literature review can identify
suitable target populations, suggest important predictor and
outcome variables, and demonstrate tried and tested measurement
techniques.
© UNESCO 59
A thorough literature review has another advantage as well; by

learning from the work of others, you can avoid repeating their
mistakes, if they are honest enough to report them! Finally, you may
discover some studies which give clear and unequivocal answers to
the very questions you are concerned about. You may not need to do
the research you had planned at all, or you may only need to do a
small-scale study to check that the results already found somewhere
else hold equally true in your own country.
One particular form of literature review is the relatively recent

development of meta-analysis, which can provide a fruitful field
for the generation of research hypotheses. Meta-analysis refers to
a method that combines a large number of similar studies testing
essentially the same hypothesis in different settings, and thus
helps to reveal the overall size of the effect or relationship between
the variables involved. In effect, meta-analysis is the analysis of
analyses, a distillation of past research on an issue. Not only is
this a good way to generate hypotheses, but it also provides an
opportunity to pass judgement on the overall quality of the research
on a particular issue. It also helps identify variables which ought to
be included in future studies, because they have been found to be
important in other studies in the past.
A well-known illustration of meta-analysis is that of Gene V. Glass

(1985), referred to above, on the effects of class size on school
performance. This is a very vexed question, and the last word
has by no means been spoken on it. Most national evaluations
in developing countries include a measure of class size. Because
teacher salaries use up between 80% and 90% of the education
budget in such countries, it is a crucial variable for economists of
education. But it is a very hard one to interpret. For example, some
high quality urban schools may have large classes because they are
so popular, and parents want to send their children to them. And
if the method of instruction is the same in large classes as in small
classes (teacher writes on the blackboard, children copy and learn)
60
the size of the class makes little difference to learning, although it

may make a difference to the amount of marking the teacher has
to do, the amount of time he or she can spend helping individual
children with their difficulties, and general teacher morale.
A note on qualitative approaches

Researchers and those who consult research information want
trustworthy information, whether this is desired to guide
educational practice or inform educational policy. In the past there
has been a division of opinion over whether qualitative datasets are
sound, trustworthy and generalizable. Today, it is generally accepted
that quantitative and qualitative research are complementary.
Each type of research data has its own authority and rules for
establishing validity.
Quantitative research relies upon measurement, using such

techniques as questionnaires, interviews, and observational studies
which involve the counting of scores, tallying frequencies, and
estimating statistical differences and relationships between sets of
variables.
Qualitative research refers to a range of activities, but they do have

some characteristics in common. Robert Burgess lists four:
• “The researcher works in a natural setting ... and much of the

investigation is devoted to obtaining some understanding of the
social, cultural and historical setting”.
• In certain styles of research, “Studies may be designed and

redesigned ... [and] researchers may modify concepts as the
collection and analysis of data proceeds”.
© UNESCO 61
• “The research is concerned with social processes and with

meaning ... the kinds of studies that are conducted using this
perspective involve focusing on how definitions are established
by teachers and pupils, and how teacher and pupil perspectives
have particular implications for patterns of schooling”.
• In some styles, “Data collection and data analysis occur

simultaneously.... Hypotheses and categories and concepts
are developed in the course of data collection. The theory is
therefore not superimposed on the data, but emerges from the
data that are collected”. (Burgess, 1985)
In the latter, more qualitative approach, with its emphasis on

‘context’ and ‘meaning’, the researcher collects the data without a
preconceived framework for analysis. Writing a running record of
the behaviour of a child in a pre-school institution is an example. A
researcher could start with a very general question, without formal
hypotheses, collect as much information as possible, and let the data
‘speak for themselves’, as they gradually accumulate.
In the former, more quantitative approach, the data are fairly well
structured, as for example, with a structured observational schedule
or an interview schedule. Researchers often start with the second
approach, by recording conversations with people on the topic of
study, before moving on a more structured analysis.
There are hazards with each method. The more qualitative

approach works best if the researcher has the ability to see the
underlying ‘meaning’ beneath the surface of events, but is very
time-consuming. The quantitative method allows for standardised
data-collection across subjects and samples, and more ready
generalisation, but the choice of what to observe and measure can
be a source of bias. Something important may be overlooked, or
the kind of data collected may not be appropriate to important
sub-groups within the population. However, it is likely that this
62
approach will be most commonly used by researchers working

in Ministries of Education and providing information for policy
makers, and it is assumed in what follows.
© UNESCO 63
5 Moving from specific research

questions and research
hypotheses to the basic
elements of research design
Putting it into operation

Let us suppose that you have now, after careful consideration of
your research priorities, established your general aims for a research
project. Following consultation with all the stake-holders, you have
outlined your specific research aims and set out some basic research
questions or hypotheses. How then do you proceed to put these
research aims into practice, or ‘operationalize’ them?
The research questions or hypotheses are of crucial importance in

the design of the research. They will determine every facet of the
methodology.
1. Research questions determine the type of study which should

be carried out: e.g., descriptive, relational or experimental. Is
there likely to be some control over the research setting, so
that random assignment of students to differing experimental
‘treatments’ is possible, or must the research simply describe
the situation as it is, and attempt to draw conclusions? To what
extent are the outcomes likely to be ‘situation-specific’, and how
much generalisation to other settings is hoped for?
64
2. Research questions identify the target population from which
any sample will be drawn. Only when it is known exactly about
whom the policy decisions are to be made will it be possible to
decide the subjects to consult and study. If you do not identify
the people you are most interested in, those who are most
able to provide the information required, you risk omitting
important respondents from the project.
3. Research questions determine the level of aggregation of the data

to be obtained. Are differences between pupils within classes,
with their differing ages, abilities and home backgrounds,
the important consideration? Or is it information gathered at
the classroom level about levels of performance of groups of
children under different instructional environments which is
the major focus of the study? Or are you largely concerned with
the schools themselves, their location within a particular urban
or rural area, their organizational structure, and the effects this
may have on attitudes or achievement? If this question is not
considered carefully at the outset, there is a risk that there will
not be enough data at the right level of aggregation to answer
the research questions you want to answer.
4. Research questions identify the outcome variables. For example,

precisely what definition of adult literacy is being used in a
particular study? Does it only include written literacy, or is
oral literacy included? Is it in the first spoken language or only
in English, which may well be a second language for most
people in your country? It is possible to determine appropriate
outcomes only if you know exactly what you want to know.
Otherwise, there is a danger that you fail to collect data on
some significant outcomes.
5. Research questions identify the key predictor variables. Does

performance in computer-based instruction differ by sex, or
reading ability, or attitude to technology in the home? What
© UNESCO 65
other variables may be important? Thinking carefully about

all the most likely things which may be associated with your
outcomes will help to avoid missing vital predictor variables.
6. Research questions influence the measuring instruments which

will be used in the study, the ways in which data is collected,
whether by access to archival records, questionnaire, interview
schedule, observational record, or other means. Do the official
records exist in a form which allows for ready analysis? Are
there published instruments or other existing measures which
are suitable for the task, or will new ones have to be prepared,
validated and used? Will a variety of approaches be necessary
to gain the rich information necessary to describe this
phenomenon?
7. Research questions determine the sample size, the number of

people you must consult or study. Different questions require
different modes of analysis, which in turn require different
sample sizes to ensure adequate statistical power to detect
effects. Low or biased response rates and clustering effects
play havoc with sample sizes, and considerable care is needed
to ensure that the research gives definitive and generalizable
results, useful to the policy makers who will be interpreting its
results.
(Light, Singer and Willett, 1990)
Below are two examples of how these procedures to ‘operationalize’

the specific research aims can be carried through systematically.
Every research project is different, and different types of research
require different approaches. But there is good value in being
thoroughly systematic about the way in which the project is
planned. A tabular approach can be helpful, and the two examples
which follow illustrate how this can be done for two different
research projects.
66
Moving from specific research questions and research hypotheses to the basic elements of research design
Example 1
This is drawn from the illustration in Exercise 7 above, about the
new secondary school on the island of Marino.
In studies such as this one, it is first necessary to decide:
• The type of research which would be most appropriate. (It

could be a multi-faceted or ‘triangulated’ study, using several
methodologies).
• The target population, i.e. who are you going to consult,

interview or survey. (This may be different for different sub-
questions).
• The sample size for the study (where this is appropriate).
• the data sources, measuring instruments, etc. which will be

required.
Research question:
What is the need for a secondary school on the island of Marino?
Sub-questions:
Is there a need?
Where is the need?
Why is there a need?
What kind of need is there likely to be in the future?
…
Type of study:
Needs-based opinion survey, drawing on archives, official and local

viewpoints.
© UNESCO 67
Table 1. Summary of data sources
Code Target population Sample size Data source
A Official government - Official records

records
B Ministry of Education 5 Interview

Officials
C (a) national 3 Interview
D (b) local 6 Questionnaire

and interview
E All principals of primary 5 (random) from Interview

schools on Marino each of 6 schools
= 30
F Sample of parents of 3 Interview

primary school children
on Marino
Tribal head on Marino,
and other local dignitaries
It is helpful to set out your information in the form shown in Table 1

above, coding each of the data sources, and filling in the entries
across the page.
Once such a table is complete, you will have some idea of the
size and scope of the project, the number of interviews and
questionnaires which will be required, and a rough estimate of the
amount of time and money which will be needed. The sample sizes
can be modified at a later stage in the planning, of course, when
the statistical treatment of the data is finally determined. But it is
helpful to have some guide at this early stage.
It is now possible to operationalize the process one stage further,

by examining each of the individual questions which you intend to
answer and deciding the best way of obtaining the information. In
general, the best way is the most efficient way, the one which will
68
provide the information at least cost, maximum accuracy, and least

intrusion on the valuable time of other busy people.
The source of the data for a few of the questions in the sample has
been indicated by the codes in the final column of Table 2 below.
Table 2. Survey questions linked to data sources
Question Data source
1. How many children attend primary D,A as cross-check

school on the island of Marino?
2. How many of these are in their final year D,A as cross-check

at primary school?
3. Where are the primary schools on the D, observation

island located?
and so on…
E XERCISE 11
1. Go back to your own answers to this sample question, and

expand and complete the two tables above as appropriate by
adding in all your extra interview and other questions.
2. Then return to the three research topics you picked in

Exercise 10, and choose one of these topics, somewhat
different from that about the secondary school on Marino.
3. Write out the main research question again, and then at least
four sub-questions which follow from it, as in Exercise 7. These
give your general and specific aims for the project.
4. Now devise some actual questionnaire, interview, or other

questions which will give you the information you need on the
problem. Prepare two grids like those above, and enter your
information on them in the same way.
© UNESCO 69
Example 2
For some research projects, particularly experimental studies, it
is possible to go one stage further, and actually list the outcome
variables and key predictor variables which are to be considered.
This was less relevant for the question on the secondary school on
the island of Marino, because the outcome was simply whether or
not a school would be built, and many of the so-called ‘predictor’
variables were really background demographic information,
political intentions by government, and central and local
perceptions of need. It would be difficult to carry out an empirical,
statistical analysis on such data.
But consider this example, drawn from an actual research project

conducted in Fiji. (Elley and Mangubhai, 1981)
The project arose from findings in other studies that children with
high achievement levels invariably came from schools with large
libraries and/or homes with many books. Access to books seemed
to be important for language learning, and it was hypothesised that
a substantial increase in the supply of books available to children
might improve their language learning. This was actually done by
a ‘book flood’, the donation of a large number of books for use in
school classrooms.
It turned out that the resources available were only sufficient to

provide 16 classrooms with 250 books each. Eight schools were
selected to receive books at two class levels, Classes 4 and 5.
These were divided up into two groups of four schools each, one
to adopt what was known as a Shared Book approach to reading,
the other to adopt a Silent Reading approach. (These terms are
described in the reference above.) A third, matched group of four
schools was used as a control group in the experiment.
70
Here the general research question could have been framed as

follows:
Research question
What are the effects of a ‘book flood’ in school classrooms on the

achievement in English language of primary school children in Fiji?
More detailed research questions followed, which are not repeated

here.
Finally the various procedures and instruments were selected, and

a timetable for the research drawn up (see Table 3). This is another
use of a table in the form of a ‘timeline’ to describe how the research
was to be ‘operationalized’, or put into practice.
Tables like this bring system to the research, and show critical
points in the timing. It is important that the research does not get
behind schedule, particularly if (as in this case) the fieldwork has
to take place at particular times to fit in with school holidays. The
delay of even a week or two here could set the whole project back a
whole term, or even a whole year.
A further use of a table (Table 4) would be to describe:
• The level of aggregation of data;
• The main outcome variables;
• The key predictor variables.
© UNESCO 71
Table 3. Project timetable
Experimental
group Shared Book Silent Reading Control
August Development of general research aims following discussion with

Ministry of Education officials
September Preparation of specific research questions and hypotheses

Planning research design, including sampling plans
October Construction and trialling of instruments
January Briefing of teachers involved in research
February Pre-tests Pre-tests Pre-tests
March 3-day workshop No workshop 1-day workshop
Marking and analysis of pre-tests for all classes
April 250 books supplied 250 books supplied Usual programme
Teacher observation in all classrooms
November Post-tests Post-tests Post-tests
December Marking and analysis of post-test for all classes
January Further data analysis, comparing pre-test and post-test results
March Preparation of research report, including policy recommendations
72
Table 4. Relationship between outcome and predictor

variables
Class Predictor variables Outcome variables
Class 5 Pre-test : Reading Post-test: STAF Reading Comprehension (Form Y)

comprehension Reading test of 6 passages and 32 multiple-choice
(sentence completion items
test of 35 multiple- STAF: Listening Comprehension (Form Y) 36
choice items multiple-choice items based on 7 short passages
read aloud
English Structures Test : 20 open-ended questions
Composition Test : short story of at least five
sentences on a set topic
Class 4 Pre-test : As for Class 5 Post-test : Same as the Pre-test

English Structures Test : 35 multiple choice
questions
Word Recognition Test : Individual interview about
background and attitude to English; then for every
second pupil only, an orally administered test to
pronounce words from a 50-word graded list
Oral Sentence Repetition: Pupils repeat orally after
the examiner a series of 28 English sentences,
graded according to complexity of structure
© UNESCO 73
There is no standard pattern for such tables. Each research project

will have its own requirements. The important thing is that it
should allow the specific aims of the project to be developed and
carried out in a systematic way.
A final use of what are sometimes known as ‘dummy’ tables would

be the setting up of a series of ‘empty’ tables like the ones which
will appear in the final version of the report. These tables force the
researcher to think very early in the research project about exactly
what information will be reported, and how it will be reported.
For example, in the research question outlined in Example 1 above,

it would be possible to set up a ‘dummy’ table like Table 5. This table
would then be gradually filled in as the information was obtained,
and would provide the basis for discussion in the report of exactly
how many students were involved. A similar type of ‘dummy’ table
(Table 6) could be used for the second question, ‘Where is the need?’
Table 5. Student enrolments in Marino primary school,

by grade level and tribal region
Tribal region
Grade
level A B C D E
K
1
2
3
4
5
6
Total
74
Table 6. Location of secondary school preferred by

various stakeholders
Source of data
Ministry Principals
Tribal National of primary Tribal
region Local school Parents heads
Region A
Region B
Region C
Region D
Region E
For the third research question, Table 7 might give a convenient way
of recording the information, by classifying the reasons given by
various groups in their efforts to justify the location of a secondary
school. These reasons would be drawn from open-ended questions
asked at interviews, and coded afterwards into a few convenient
and logically coherent categories.
Table 7. Reasons given by various stakeholders to justify

need for a secondary school
Source of data
Ministry Principals
National of primary Tribal
Reason Local school Parents heads
Reason 1
Reason 2
Reason 3
Reason 4
© UNESCO 75
6 Summary
You have now come to the end of this module. In it you have traced
the path that a research worker engaged in policy research must
tread. You have:
• prepared the ground, by establishing exactly for whom the

research is being done, and to whom its results are to be
directed;
• examined the issues, and decided what is researchable and what

is not, through discussion with the policy makers;
• established priorities for what can be done, and defined the

content and level of the proposed research in a systematic way;
• considered appropriate research questions and hypotheses,

and decided on a suitable methodology for the projects to be
undertaken, after a literature search;
• planned the research projects in outline, and made some

decisions on how to operationalize them and present the
results.
Other modules in this series will dwell in more detail on some of

these matters, such as carrying out a literature search, working
out suitable experimental designs, designing interview schedules,
analysing results, and so on.
But you should now have a good grip on the aims of research into
educational policy issues, and how to get started on it.
76
Annotated biliography 7
Borg, Walter R. and Meredith D. Gall (1983). Educational Research:
An Introduction. London, Longman. [This text contains a good
section on developing the research proposal and planning the
research.]
Burgess, Robert G. (ed) (1985). Strategies of Educational Research:

Qualitative Methods. London, The Falmer Press.
Charles, C. M. (1988). Introduction to Educational Research. New

york, Longman. [This introductory text gives a straightforward
treatment of issues relating to the initial planning of educational
research.]
Elley, Warwick B. and Francis Mangubhai (1981). The Impact of

a Book Flood in Fiji Primary Schools. Studies in South Pacific
Education, No.1. Wellington, New Zealand Council for
Educational Research, and Institute of Education, University of
South Pacific. [This small booklet contains a clear account of an
actual piece of well-planned, experimental research carried out
in the South Pacific. It is referred to in the text.]
Fox, David J. (1969). The Research Process in Education. New York,

Holt, Rinehart and Winston.
Glass, G.V. (1985). Class Size. International Encyclopaedia of

Education. Oxford, Pergamon Press. [This well-known example
of meta-analysis of the effects of class size concludes that there
are some advantages in smaller class sizes, provided the method
of instruction is varied to suit, but only for quite small class
sizes. The issue is still under debate.]
© UNESCO 77
Husén, Torsten and Maurice Kogan (eds) (1984). Educational

Research and Policy: How do they Relate? Oxford, Pergamon
Press. [This is a standard text on the topic, drawn from the
proceedings of a four-day symposium at Wijk, Lidingö-
Stockholm in June 1982, and containing a number of country
experiences on the way this relationship has been worked out,
preceded by an expanded introduction by Torstén Husen on
more general issues.]
Light, Richard J., Judith D. Singer and John B. Willett (1990). By

Design: Planning Research on Higher Education. Cambridge,
Harvard University Press. [This book contains a very good
section on preparing research questions.]
Livingstone, Ian D. (1985). Standards, National: Monitoring.

International Encyclopaedia of Education. Oxford, Pergamon
Press. [This article summarises the various ways in which
levels of performance are monitored nationally in a number of
countries.]
Livingstone, Ian D., Barry Eagle and John Laurie (1988). The
Computer as a Diagnostic Tool in Mathematics. Study 13 of the
Evaluation of Exploratory Studies in Educational Computing.
Wellington, New Zealand Council for Educational Research.
Livingstone, Ian D. (1990). Assessment for What? The Director’s

Commentary in Fifty Fifth Annual Report. Wellington, New
Zealand Council for Educational Research. [This essay provides
a framework for considering various types of assessment, in
relation to the time at which the assessment takes place and the
outcome which is intended.]
Nisbet, John and Patricia Broadfoot (1980). The Impact of Research on

Policy and Practice in Education. Aberdeen, Aberdeen University
Press. [This study focusses mainly on the developed countries
78
Annotated bibliography
of Europe and the United States, and finds that in these

settings, educational research has not made a large impact on
policy formation.]
Nisbet, John and Stanley Nisbet (eds) (1985). Research, Policy

and Practice. World Yearbook of Education 1985. London,
Kogan Page. [Part 1 of this standard work contains reviews
of educational research in 14 countries, in which the authors
describe the organisations responsible for educational research
and development and outline their recent history, describe
how research is funded, and analyse underlying assumptions
about the nature and function of research in education. In Part
2, five authors describe new models and styles of educational
research and development, and review recent thinking on the
relationship between teacher, researcher and policy maker. The
book contains an extensive bibliography.]
Postlethwaite, T.N. and K.N. Ross (1986). Indonesia: Joint Assignment

Report. Office of Educational and Cultural Research and
Development, Ministry of Education and Culture, Jakarta.
[This report, the result of a consultancy under UNDP/Unesco
Project INS/85/022, and including a four day workshop, is full
of very practical suggestions on how research and development
can be useful to the policy maker. The present module has
drawn heavily from this source, which is hereby gratefully
acknowledged.]
Ross, Kenneth (1985). The Measurement of Disadvantage. set No.1,

Item 3. Wellington, New Zealand Council for Educational
Research.
Ross, Kenneth N. and Lars Mählck (eds) (1990). Planning the

Quality of Education: The Collection and Use of Data for Informed
Decision-Making. Paris, UNESCO/International Institute for
Educational Planning. [This book was prepared from papers
© UNESCO 79
and discussions associated with an international workshop on

Issues and Practices in Planning the Quality of Education organised
by the International Institute for Educational Planning in
November 1989. Amongst other useful contributions, it contains
a chapter on improving the dialogue between the producers
and consumers of educational information, and an agenda for
international action which has led to the preparation of the
current series of modules.]
Ross, Kenneth N. (1990). The potential contribution of research

to educational policies for the poor in Asia. Prospects, Vol XX,
No.4. [This article contains summaries of the research and
implications for action of five relatively recent investigations of
national policy issues. The author argues that there is a great
deal of valuable information ‘locked up’ in rarely read reports of
important educational research studies.]
Travers, Kenneth J. and Ian Westbury (eds) (1989). The IEA Study
of Mathematics I: Analysis of Mathematics Curricula. Oxford,
Pergamon Press, for the International Association for [the
Evaluation of] Educational Achievement. [This is the first of a
three-volume work on the massive collaborative, international
study of mathematics carried out under the auspices of the IEA
during the 1980s. This volume on the mathematics curriculum
breaks new ground in the way in which it conceptualises and
measures the various curriculums. See pp.5-10.]
Tuckman, Bruce W. (1988). Conducting Educational Research. Third

edition. London, Harcourt, Brace Jovanovich. [This standard
reference contains sections on selecting a problem and
constructing hypotheses.]
Wallen, N.E. (1974). Educational Research: A Guide to the Process.

Belmont, California, Wadsworth, pp.1-2. [Contains a simple
definition of the difference between basic and applied research.]
80
Annotated bibliography
Weiss, Carol (ed) (1977). Using Social Research in Public Policy-

Making. Lexington Books, Lexington, Mass.
Weiss, Carol (1979). The many meanings of research utilization.

Public Administration Review, Sept/Oct, pp.426-431. [These and
other publications by Carol Weiss outline a very comprehensive
taxonomy of seven models describing the way in which
educational research results can make an impact on policy
makers.]
Wylie, Cathy (1990). The Impact of “Tomorrow’s Schools” in Primary

Schools and Intermediates: 1990 Survey Report. Wellington, New
Zealand Council for Educational Research.
© UNESCO 81
Quality (SACMEQ).


and innovation”.
3
Module
Kenneth N. Ross
Sample design for

educational survey
research



Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66

Module 3 Sample design for educational survey research
Content
1. Basic concepts of sample design

for educational survey research 1
Populations : desired, defined, and excluded 2
Sampling frames 3
Representativeness 4
Probability samples and non-probability samples 5
Types of non-probability samples 6
1. Judgement sampling 7
2. Convenience sampling 7
3. Quota sampling 8
Types of probability samples 8

1. Simple random sampling 9
2. Stratified sampling 10
3. Cluster sampling 11
The accuracy of estimates obtained from probability samples 13

1. Mean square error 13
2. The accuracy of individual sample estimates 15
3. Comparison of the accuracy of probability samples 16
Sample design for two-stage cluster samples 17

1. The design effect for two-stage cluster samples 18
2. The effective sample size for two-stage cluster samples 19
Sample design tables for two-stage cluster samples 20
© UNESCO 1
2. How to draw a national sample of students:

a hypothetical example for ‘country x’ 27
3. How to draw a national sample of schools

and students: a real example for Zimbabwe 48
4. The estimation of sampling errors 64

The Jackknife Procedure 66
5. References 70
Appendix 1
Random number tables for selecting a simple random sample
of twenty students from groups of students of size 21 to 100 73
Appendix 2
Sample design tables (for roh values of 0.1 to 0.9) 78
Appendix 3
Estimation of the coefficient of intraclass correlation 81
II http://www.sacmeq.org and http://www.unesco.org/iiep

Basic concepts of sample 1
design for educational survey
research
Sampling in educational research is generally conducted in order

to permit the detailed study of part, rather than the whole, of a
population. The information derived from the resulting sample is
customarily employed to develop useful generalizations about the
population. These generalizations may be in the form of estimates
of one or more characteristics associated with the population,
or they may be concerned with estimates of the strength of
relationships between characteristics within the population.
Provided that scientific sampling procedures are used, the

selection of a sample often provides many advantages compared
with a complete coverage of the population. For example, reduced
costs associated with gathering and analyzing the data, reduced
requirements for trained personnel to conduct the fieldwork,
improved speed in most aspects of data summarization and
reporting, and greater accuracy due to the possibility of more
intense supervision of fieldwork and data preparation operations.
The social science research situations in which sampling is

used may be divided into the following three broad categories:
experiments – in which the introduction of treatment variables
occurs according to a pre-arranged experimental design and all
extraneous variables are either controlled or randomized; surveys
– in which all members of a defined target population have a known
© UNESCO 1
non-zero probability of selection into the sample; and investigations

– in which data are collected without either the randomization of
experiments or the probability sampling of surveys.
Experiments are strong with respect to internal validity because

they are concerned with the question of whether a true measure of
the effect of a treatment variable has been obtained for the subjects
in the experiment. Surveys, on the other hand, are strong with
respect to external validity because they are concerned with the
question of whether the findings obtained for the subjects in the
survey may be generalized to a wider population. Investigations are
weak with respect to both internal and external validity and their
use is due mainly to convenience or low cost.
Populations :
desired, defined, and excluded
In any educational research study it is important to have a precise
description of the population of elements (persons, organizations,
objects, etc.) that is to form the focus of the study. In most studies
this population will be a finite one that consists of elements
which conform to some designated set of specifications. These
specifications provide clear guidance as to which elements are to be
included in the population and which are to be excluded.
In order to prepare a suitable description of a population it is

essential to distinguish between the population for which the
results are ideally required, the desired target population, and the
population which is actually studied, the defined target population.
An ideal situation, in which the researcher had complete control
over the research environment, would lead to both of these
populations containing the same elements. However, in most
studies, some differences arise due, for example, to (a) noncoverage:
2 http://www.sacmeq.org and http://www.unesco.org/iiep

Basic concepts of sample design for educational survey research
the population description may accidently omit some elements

because the researcher has no knowledge of their existence, (b)
lack of resources: the researcher may intentionally exclude some
elements from the population description because the costs of their
inclusion in data gathering operations would be prohibitive, or (c)
an ageing population description: the population description may
have been prepared at an earlier date and therefore it includes some
elements which have ceased to exist.
The defined target population provides an operational definition

which may be used to guide the construction of a list of population
elements, or sampling frame, from which the sample may be drawn.
The elements that are excluded from the desired target population
in order to form the defined target population are referred to as the
excluded population.
Sampling frames
The selection of a sample from a defined target population requires
the construction of a sampling frame. The sampling frame is
commonly prepared in the form of a physical list of population
elements – although it may also consist of rather unusual listings,
such as directories or maps, which display less obvious linkages
between individual list entries and population elements. A
well-constructed sampling frame allows the researcher to ‘take hold’
of the defined target population without the need to worry about
contamination of the listing with incorrect entries or entries which
represent elements associated with the excluded population.
Generally the sampling frame incorporates a great deal more

structure than one would expect to find in a simple list of elements.
For example, in a series of large-scale studies of Reading Literacy
carried out in 30 countries during 1991 (Ross, 1991), sampling
frames were constructed which listed schools according to a number
© UNESCO 3
of stratification variables: size (number of students), program (for

example, comprehensive or selective), region (for example, urban
or rural), and sex composition (single sex or coeducational). The
use of these stratification variables in the construction of sampling
frames was due, in part, to the need to present research results for
sample data that had been drawn from particular strata within the
sampling frame.
Representativeness
The notion of ‘representativeness’ is a frequently used, and often
misunderstood, notion in social science research. A sample is often
described as being representative if certain percentage frequency
distributions of element characteristics within the sample data are
similar to corresponding distributions within the whole population.
The population characteristics selected for these comparisons

are referred to as ‘marker variables’. These variables are usually
selected from among those demographic variables that are readily
available for both population and sample. Unfortunately, there are
no objective rules for deciding which variables should be nominated
as marker variables. Further, there are no agreed benchmarks for
assessing the degree of similarity required between percentage
frequency distributions for a sample to be judged as ‘representative
of the population’.
It is important to note that a high degree of representativeness

in a set of sample data refers specifically to the marker variables
selected for analysis. It does not refer to other variables assessed
by the sample data and therefore does not necessarily guarantee
that the sample data will provide accurate estimates for all element
characteristics. The assessment of the accuracy of sample data can
only be discussed meaningfully with reference to the value of the
mean square error, calculated separately, for particular sample
estimates (Ross, 1978).

The most popular marker variables in the field of education have

commonly been demographic factors associated with students (sex,
age, socio-economic status, etc.) and schools (type of school, school
location, school size, etc.). For example, in a series of educational
research studies carried out in the United States during the early
1970’s, Wolf (1977) selected the following marker variables: sex
of student, father’s occupation, father’s education, and mother’s
education. These variables were selected because their percentage
frequency distributions could be obtained for the population from
tabulations prepared by the Bureau of the Census.
Probability samples and non-probability

samples
The use of samples in educational research is usually followed
by the calculation of sample estimates with the aim of either
(a) estimating the values of population parameters from sample
statistics, or (b) testing statistical hypotheses about population
parameters. These two aims require that the researcher has some
knowledge of the accuracy of the values of sample statistics as
estimates of the relevant population parameters. The accuracy of
these estimates may generally be derived from statistical theory –
provided that probability sampling has been employed. Probability
sampling requires that each member of the defined target
population has a known, and non-zero, chance of being selected
into the sample.
In contrast, the stability of sample estimates based on

non-probability sampling cannot be discovered from the internal
evidence of a single sample. That is, it is not possible to determine
whether a non-probability sample is likely to provide very
accurate or very inaccurate estimates of population parameters.
Consequently, these types of samples are not appropriate for
© UNESCO 5
dealing objectively with issues concerning either the estimation of

population parameters or the testing of hypotheses.
The use of non-probability samples is sometimes carried out with

the (usually implied) justification that estimates derived from the
sample may be linked to some hypothetical universe of elements
rather than to a real population. This justification may lead to
research results which are not meaningful if the gap between the
hypothetical universe and any relevant real population is too large.
In some circumstances, a well-planned probability sample design

can be turned accidentally into a non-probability sample design if
subjective judgement is exercised at any stage during the execution
of the sample design. Some researchers fall into this trap through a
lack of control of field operations at the final stage of a multi-stage
sample design.
The most common example of this in educational settings occurs

when the researcher goes to great lengths in drawing a probability
sample of schools, and then leaves it to the initiative of teaching
staff in the sampled schools to select a ‘random sample’ of students
or classes.
Types of non-probability samples

There are three main types of non-probability samples: judgement,
convenience, and quota samples. These approaches to sampling
result in the elements in the target population having an unknown
chance of being selected into the sample. It is always wise to treat
research results arising from these types of sample design as
suggesting statistical characteristics about the population – rather
than as providing population estimates with specifiable confidence
limits.

1. Judgement sampling
The process of judgement, or purposive, sampling is based on the
assumption that the researcher is able to select elements which
represent a ‘typical sample’ from the appropriate target population.
The quality of samples selected by using this approach depends
on the accuracy of subjective interpretations of what constitutes a
typical sample.
It is extremely difficult to obtain meaningful results from a

judgement sample because no two experts will agree upon the exact
composition of a typical sample. Therefore, in the absence of an
external criterion, there is no way in which in the research results
obtained from one judgement sample can be judged as being more
accurate than the research results obtained from another.
2. Convenience sampling
A sample of convenience is the terminology used to describe a
sample in which elements have been selected from the target
population on the basis of their accessibility or convenience to the
researcher.
Convenience samples are sometimes referred to as ‘accidental

samples’ for the reason that elements may be drawn into the
sample simply because they just happen to be situated, spatially
or administratively, near to where the researcher is conducting the
data collection.
The main assumption associated with convenience sampling is that

the members of the target population are homogeneous. That is,
that there would be no difference in the research results obtained
from a random sample, a nearby sample, a co-operative sample, or a
sample gathered in some inaccessible part of the population.
© UNESCO 7
As for judgement sampling, there is no way in which the researcher

may check the precision of one sample of convenience against
another. Indeed the critics of this approach argue that, for many
research situations, readily accessible elements within the target
population will differ significantly from less accessible elements.
They therefore conclude that the use of convenience sampling is
likely to introduce a substantial degree of bias into sample estimates
of population parameters.
3. Quota sampling
Quota sampling is a frequently used type of non-probability
sampling. It is sometimes misleadingly referred to as ‘representative
sampling’ because numbers of elements are drawn from various
target population strata in proportion to the size of these strata.
While quota sampling places fairly tight restrictions on the

number of sample elements per stratum, there is often little or
no control exercised over the procedures used to select elements
within these strata. For example, either judgement or convenience
sampling may be used in any or all of the strata. Therefore, the
superficial appearance of accuracy associated with proportionate
representation of strata should be considered in the light that there
is no way of checking either the accuracy of estimates obtained
for any one stratum, or the accuracy of estimates obtained by
combining individual stratum estimates.
Types of probability samples

There are many ways in which a probability sample may be drawn
from a population. The method that is most commonly described in
textbooks is simple random sampling. This method is rarely used in
practical social research situations because

(a) the selection and measurement of individual population

elements is often too expensive, and (b) certain complexities may be
introduced intentionally into the sample design in order to address
more appropriately the objectives and administrative constraints
associated with the research. The complexities most often employed
in educational research include the use of stratification techniques,
cluster sampling, and multiple stages of selection.
1. Simple random sampling

The selection of a simple random sample is usually carried out
according to a set of mechanical instructions which guarantees the
random nature of the selection procedure.
For example, Kish (1965) provides the following operational

definition in order to describe procedures for the selection of a
simple random sample of elements without replacement from a
finite population of elements:
From a table of random digits select with equal probability n different

selection numbers, corresponding to n of the N listing numbers of the
population elements. The n listings selected from the list, on which each
of the N population elements is represented separately by exactly one
listing, must identify uniquely n different elements. (p. 36)
Simple random sampling, as described in this definition, results in

an equal probability of selection for all elements in the population.
This characteristic, called ‘epsem sampling’ (equal probability of
selection method), is not restricted solely to this type of sample
design. Equal probability of selection can result from either the use
of equal probabilities of selection throughout the sampling process,
or from the use varying probabilities that compensate for each other
through several stages of multistage sampling. Epsem sampling
is widely applied in educational research because it usually leads
to self-weighting samples in which the simple arithmetic mean
obtained from the sample data is an unbiased estimate of the
population mean.
© UNESCO 9
2. Stratified sampling
The technique of stratification is often employed in the preparation
of sample designs because it generally provides increased accuracy
in sample estimates without leading to substantial increases in
costs. Stratification does not imply any departure from probability
sampling – it simply requires that the population be divided
into subpopulations called strata and that probability sampling
be conducted independently within each stratum. The sample
estimates of population parameters are then obtained by combining
information from each stratum.
In some studies, stratification is used for reasons other than

obtaining gains in sampling accuracy. For example, strata may be
formed in order to employ different sample designs within strata, or
because the subpopulations defined by the strata are designated as
separate ‘domains of study’ (Kish, 1987, p. 34).
Variables used to stratify populations in education generally

describe demographic aspects concerning schools (for example,
location, size, and program) and students (for example, age, sex,
grade level, and socio-economic status).
Stratified sampling may result in either proportionate or

disproportionate sample designs. In a proportionate stratified
sample design the number of observations in the total sample is
allocated among the strata of the population in proportion to the
relative number of elements in each stratum of the population.
That is, a stratum containing a given percentage of the elements in

the population would be represented by the same percentage of the
total number of sample elements. In situations where the elements
are selected with equal probability within strata, this type of sample
design results in epsem sampling and therefore ‘self-weighted’
estimates of population parameters.

In contrast, a disproportionate stratified sample design is associated

with the use of different probabilities of selection, or sampling
fractions, within the various population strata. This can sometimes
occur when the sample is designed to achieve greater overall
accuracy than proportionate stratification by using ‘optimum
allocation’ (Kish,1965:92). More commonly disproportionate
sampling is used in order to ensure that the accuracy of sample
estimates obtained for stratum parameters is sufficiently high to be
able to make meaningful comparisons between strata.
The sample estimates derived from a disproportionate sample

design are generally prepared with the assistance of ‘weighting
factors’. These factors, represented either by the inverse of the
selection probabilities or by a set of numbers proportional to
them, are employed in order to prevent inequalities in selection
probabilities from causing the introduction of bias into sample
estimates of population parameters. The reciprocals of the selection
probalities, sometimes called ‘raising factors’, refer to the number of
elements in the population represented by a sample element (Ross,
1978).
In the field of educational research, the weighting factors are often

calculated so as to ensure that the sum of the weighting factors over
all elements in the sample is equal to the sample size. This ensures
that the readers of research reports are not confused by differences
between actual and weighted sample sizes.
3. Cluster sampling
A population of elements can usually be thought of as a hierarchy
of different sized groups or ‘clusters’ of sampling elements. These
groups may vary in size and nature. For example, a population of
school students may be grouped into a number of classrooms, or it
may be grouped into a number of schools. A sample of students may
then be selected from this population by selecting clusters of students
© UNESCO 11
as classroom groups or school groups rather than individually as

would occur when using a simple random sample design.
The use of cluster sampling in educational research is sometimes

undertaken as an alternative to simple random sampling in order to
reduce research costs for a given sample size. For example, a cluster
sample consisting of the selection of 10 classes – each containing
around 20 students – would generally lead to smaller data collection
costs compared with a simple random sample of 200 students. The
reduced costs occur because the simple random sample may require
the researcher to collect data in as many as 200 schools.
Cluster sampling does not prevent the application of probability

sampling techniques. This may be demonstrated by examining
several ways in which cluster samples may be drawn from a
population of students. Consider the hypothetical population,
described in Figure 1, of twenty-four students distributed among six
classrooms (with four students per class) and three schools (with
two classes per school).
A simple random sample of four students drawn without

replacement from this population would result in an epsem sample
with each element having a probability of selection, p, equal to 1/6
(Kish. 1965:40). A range of cluster samples, listed below with their
associated p values, may also be drawn in a manner which results
in epsem samples.
• Randomly select one class, then include all students in this class
in the sample. (p = 1/6 x 4/4 = 1/6).
• Randomly select two classes, then select a random sample of

two students from within these classes. (p = 2/6 x 2/4 = 1/6).
• Randomly select two schools, then select a random sample

of one class from within each of these schools, then select a
random sample of two students from within these classes. (p =
2/3 x 1/2 x 2/4 = 1/6).

Figure 1. Hypothetical Population of 24 students
Schools School 1 School 2 School 3
/ \ / \ / \
Classes Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
|||| |||| |||| |||| |||| ||||
Students abcd efgh ijkl mnop qrst uvwx
The accuracy of estimates obtained

from probability samples
The degree of accuracy associated with a sample estimate derived
from any one probability sample may be judged by the difference
between the estimate and the value of the population parameter
which is being estimated. In most situations the value of the
population parameter is not known and therefore the actual
accuracy of an individual sample estimate cannot be calculated in
absolute terms. Instead, through a knowledge of the behaviour of
estimates derived from all possible samples which can be drawn
from the population by using the same sample design, it is possible
to estimate the probable accuracy of the obtained sample estimate.
1. Mean square error

Consider a probability sample of n elements which is used to
calculate the sample mean as an estimate of the population mean. If
an infinite set of samples of size n were drawn independently from
© UNESCO 13
this population and the sample mean calculated for each sample,
the average of the resulting sampling distribution of sample means
would be referred to as the expected value.
The accuracy of the sample mean as an estimator of the population

parameter may be summarized in terms of the mean square error
(MSE). The MSE is defined as the average of the squares of the
deviations of all possible sample estimates from the value being
estimated (Hansen et al, 1953).
A sample design is unbiased if the expected value of the sample

mean is equal to the population mean. It is important to remember
that ‘bias’ is not a property of a single sample, but of the entire
sampling distribution, and that it belongs neither to the selection
nor the estimation procedure alone, but to both jointly.
For most well-designed samples in educational research the

sampling bias is usually very small – tending towards zero with
increasing sample size. The accuracy of sample estimates is
therefore generally assessed in terms of the magnitude of the
variance term in the above equation.

2. The accuracy of individual sample estimates

In educational settings the researcher is usually dealing with
a single sample of data and not with all possible samples from
a population. The variance of sample estimates as a measure
of sampling accuracy cannot therefore be calculated exactly.
Fortunately, for many probability sample designs, statistical theory
may be used to derive formulae which provide estimates of the
variance based on the internal evidence of a single sample of data.
For a simple random sample of n elements drawn without

replacement from a population of N elements, the variance of the
sample mean may be estimated from a single sample of data by
using the following formula:
where s2 is the usual sample estimate of the variance of the element

values in the population (Kish, 1965:41).
��
��
� �
For sufficiently large values of N, the value of the finite population

correction, (N - n)/N, tends toward unity. The variance of the
sample mean in this situation may be estimated to be equal to s2/n.
The sampling distribution of the sample mean is approximately

normally distributed for many educational sampling situations. The
approximation improves with increased sample size even though
the distribution of elements in the parent population may be far
from normal. This characteristic of sampling distributions is known
as the Central Limit Theorem and it occurs not only for the sample
mean but also for most estimators commonly used to describe
survey research results (Kish, 1965).
© UNESCO 15
From a knowledge of the properties of the normal distribution we

know that we can be “68 per cent confident” that the population
mean lies within the range specified by:
(the sample mean ± one standard error of the sample mean)
where the standard error of the sample mean is equal to the square
root of the variance of the sample mean. Similarly the we can be “95
percent confident” that the population mean lies within the range
specified by:
(sample mean ± two standard errors of the sample mean)
While the above discussion has concentrated mostly on sample

means derived from simple random samples, the same approach
may be used to set up confidence limits for many other population
values derived from various types of sample designs. For example,
confidence limits may be calculated for complex statistics such as
correlation coefficients, regression coefficients, multiple correlation
coefficients, etc. (Ross, 1978).
3. Comparison of the accuracy of probability

samples
The accuracy of probability samples is usually compared by
considering the variances associated with a particular sample
estimate for a given sample size. This comparison has, in recent
years, been based on the recommendation put forward by Kish
(1965) that the simple random sample design should be used as a
standard for quantifying the accuracy of a variety of probability
sample designs which incorporate such complexities as stratification
and clustering. Kish (1965:162) introduced the term ‘deff’ (design
effect) to describe the ratio of the variance of the sample mean for a
complex sample design, denoted c, to the variance a simple random
sample, denoted srs, of the same size.

��
��
��
The potential for arriving at false conclusions in educational

research by using incorrect sampling error calculations has been
demonstrated in a study carried out by Ross (1976). This study
showed that it was highly misleading to assume that sample size
was, in itself, an adequate indicator of the sampling accuracy
associated with complex sample designs. For example, Ross (1976:
40) demonstrated that a two-stage cluster sample of 150 students
(that was selected for a study conducted in Australia by randomly
selecting 6 classes followed by the random selection of 25 students
within these classes) had the same sampling accuracy for sample
means as would a simple random sample of 20 students.
Sample design for two-stage cluster

samples
The two-stage cluster sample is probably the most commonly used
sample design in educational research. This design is generally
employed by selecting either schools or classes at the first stage of
sampling, followed by the selection of either clusters of students
within schools or clusters of students within classes at the second
stage. In many studies the two-stage cluster design is preferred
because this design offers an opportunity for the researcher to
conduct analyses at more than one level of data aggregation. For
example, the selection of students within classes at the second
stage of sampling would, provided there were sufficient numbers
of classes and numbers of students selected within classes, permit
analyses to be carried out at (a) the between-student level (by
© UNESCO 17
using data describing individual students), (b) the between-class

level (by using data based on class mean scores), or (c) both levels
simultaneously using ‘multilevel analysis’ procedures.
In the following discussion some properties of two-stage

cluster samples have been described by using the simplifying
assumption that the clusters in the population are of equal size.
This assumption permits the use of ‘sample design tables’ (see
Appendix 2) which provide an excellent aid for designing two-stage
cluster samples with pre-specified levels of sampling accuracy.
1. The design effect for two-stage cluster

samples
The value of the design effect (Kish, 1965:257) for a two-stage
cluster sample design depends, for a given cluster size, on the value
of the coefficient of intraclass correlation.
��
��
��
where b is the size of the selected clusters, and roh is the coefficient
of intraclass correlation
The coefficient of intraclass correlation, often referred to as roh,

provides a measure of the degree of homogeneity within clusters.
In educational settings the notion of homogeneity within clusters
may be observed in the tendency of student characteristics to more
homogeneous within schools, or classes, than would be the case
if students were assigned to schools, or classes, at random. This
homogeneity may be due to common selective factors (for example,

residential zoning of schools), or to joint exposure to the same

external influences (for example, teachers and school programs), or
to mutual interaction (for example, peer group pressure), or to some
combination of these.
2. The effective sample size for two-stage

cluster samples
The “effective sample size” (Kish, 1965:259) for a given two-stage
cluster sample is equal to the size of the simple random sample
which has a level of sampling accuracy, as measured by the variance
of the sample mean, which is equal to the sampling accuracy of
the given two-stage cluster sample. A little algebra may be used to
demonstrate that the actual sample size, nc, and the effective sample
size, n*, for a two-stage cluster sample are related to the design
effect associated with that sample in the following manner (Ross,
1978:137-138).
��
From previous discussion, we may replace deff in this formula by an

expression which is a function of the cluster size and the coefficient
of intraclass correlation.
��
For example, consider a two-stage cluster sample based on a sample

of 10 schools followed by the selection of 20 students per school.
In addition, consider a student characteristic (for example, a test
score or attitude scale score) for which the value of the coefficient
of intraclass correlation is equal to 0.1. This value of roh would
© UNESCO 19
be typical for clusters of students selected randomly from within

secondary schools in Australia (Ross, 1983). In this situation, the
above formula simplifies to the following expression.
��
Solving this equation for n* gives a value of 69 for the value of

the effective sample size. That is, given the value of 0.1 for roh, a
two-stage cluster sample of size 200 that is selected by sampling 10
schools followed by sampling 20 students per school would have
sampling accuracy equivalent to a simple random sample of 69
students.
For a given population of students, the value of roh tends to be

higher for clusters based on classes rather than clusters based
on schools. Ross (1978) has obtained values of roh as high as 0.5
for mathematics test scores based on classes within Australian
secondary schools.
Sample design tables for two-stage

cluster samples
Sample design tables are often prepared for well-designed research
studies in which it is intended to employ two-stage cluster
sampling. These tables present a range of sample design options
– each designed to have a pre-specified level of sampling accuracy.
A hypothetical example has been presented in the following
discussion in order to illustrate the sequence of calculations and
decisions involved in the preparation of these tables.
Consider an educational research study in which test items are

administered to a two-stage cluster sample of students with the aim

of estimating the percentage of students in the population that are

able to obtain correct answers. In addition, assume that a sampling
accuracy constraint has been placed on the design of the study so
as to ensure that the sample estimate of the percentage of students
providing the correct answer, p, will provide p + 5 per cent as 95
per cent confidence limits for the value of the percentage in the
population.
For reasonably large samples it is possible to assume normality of

the distribution of sample estimates (Kish, 1965:13-14) and therefore
confidence limits of p + 5 per cent are approximately equivalent to an
error range of plus or minus two standard errors of p. Consequently,
the error constraint placed on the study means that one standard
error of p needs to be less than or equal to 2.5 per cent.
Consider a simple random sample of n* students selected from this

population in order to calculate values of p. Statistical theory may
be employed to show that, for large populations, the variance of the
sample estimate of p as an estimate of the population value may be
calculated by using the following formula (Kish, 1965:46).
� ��
��
��
The maximum value of p(100 - p) occurs for p = 50. Therefore, in

order to ensure that we could satisfy the error constraints described
above, the following inequality would need to be valid.
��
��
��
© UNESCO 21
That is, the size of the simple random sample, n*, would have to be
greater than, or equal to, about 400 students in order to obtain 95
per cent confidence limits of p + 5 per cent.
Now consider the size of a two-stage cluster sample design which

would provide equivalent sampling accuracy to a simple random
sample of 400 students. The design of this cluster sample would
require knowledge of the numbers of primary sampling units
(for example, schools or classes) and the numbers of secondary
sampling units (students) which would be required.
From previous discussion, the relationship between the size of the

cluster sample, nc, which has the same accuracy as a simple random
sample of size n* = 400 may be written in the following fashion. This
expression is often described as a ‘planning equation’ because it
may be used to explore sample design options for two-stage cluster
samples.
� � � ��
The value of nc is equal to the product of the number of primary

sampling units, a, and the number of secondary sampling units
selected from each primary sampling unit, b. Substituting for
nc in this formula, and then transposing provides the following
expression for a in terms of b and roh.
��
� � � ��
�

As an example, consider roh = 0.1, and b = 20. Then we have the

following value for a.
��
� � � ��
��
� ��
That is, for roh = 0.1, a two-stage cluster sample of 1160 students
(consisting of the selection of 58 primary sampling units followed
by the selection of clusters of 20 students) would have sampling
accuracy equivalent to a simple random sample of 400 students.
In Table 1.1 the planning equation has been employed to list sets
of values for a, b, deff, and nc which describe a group of two-stage
cluster sample designs that have sampling accuracy equivalent
to a simple random sample of 400 students. Three sets of sample
designs have been listed in the table – corresponding to roh values
of 0.1, 0.2, and 0.4. In a study of school systems in ten developed
countries, Ross (1983: 54) has shown that values of roh in this range
are typical for achievement test scores obtained from clusters of
students within schools.
The most striking feature of Table 1.1 is the rapidly diminishing

effect that increasing b, the cluster size, has on a, the number of
clusters that are to be selected. This is particularly noticeable when
the cluster size reaches 10 to 15 students.
© UNESCO 23
Table 1 Sample design table for two-stage cluster

samples with sampling accuracy equal
to a simple random sample of 400
roh = 0.1 roh = 0.2 roh = 0.4

Cluster
Size b
deff nc a deff nc a deff nc a
1 1.0 400 400 1.0 400 400 1.0 400 400

2 1.1 440 220 1.2 480 240 1.4 560 280
5 1.4 560 112 1.8 720 144 2.6 1040 208
10 1.9 760 76 2.8 1120 112 4.6 1840 184
15 2.4 960 64 3.8 1530 102 6.6 2640 176
20 2.9 1160 58 4.8 1920 96 8.6 3440 172
30 3.9 1560 52 6.8 2730 91 12.6 5040 168
40 4.9 1960 49 8.8 3520 88 16.6 6640 166
50 5.9 2400 48 10.8 4350 87 20.6 8250 165
Consider, for example, two sample designs applied in a situation

where a value of roh = 0.4 may be assumed: (a) a total sample of
2640 students obtained by selecting 15 students per cluster from
176 clusters, and (b) a total sample of 8250 students obtained
by selecting 50 students per cluster from 165 clusters. From
Table 1.1, it may be observed that both of these sample designs
have sampling accuracy equivalent to a simple random sample of
400 students.These two sample designs have equivalent sampling
accuracy, however there is a striking difference between the each
design in terms of total sample size. Further, the magnitude of this
difference is not reflected proportionately in the difference between
the number of clusters selected.

This result illustrates an important point for the planning of

educational research studies that seek to make stable estimates
of student population characteristics. This is that the sampling
accuracy levels of two-stage cluster sample designs, for cluster sizes
of 10 or more, tend to be greatly influenced by small changes in the
number of clusters that are selected at the first stage of sampling,
and relatively less influenced by small changes in the size of the
selected clusters.
The main use of sample design tables like the one presented in Table
1.1 is to permit the researcher to choose, for a given value of roh,
one sample design from among a list of equally accurate sample
design options. The final choice between equally accurate options
is usually guided by cost factors, or data analysis strategies, or a
combination of both of these.
For example, the cost of collecting data by using ‘group

administration’ of tests, questionnaires, etc. often depends more
on the number of selected schools than on the number of students
surveyed within each selected school. This occurs because the
use of this methodology usually leads to higher marginal costs
associated with travel to many schools, compared with the marginal
costs of increasing the number of students surveyed within each
selected school. In contrast, the cost of collecting data by using
‘individual administration’ of one-to-one tests, interviews, etc.,
normally depends more on the total sample size than on the
number of selected schools.
The choice of a sample design option may also depend upon the
data analysis strategies that are being employed in the research. For
example, analyses may be planned at both the between-student and
between-school levels of analysis. In order to conduct analyses at
the between-school level, data obtained from individual students
may need to be aggregated to obtain files consisting of school
records based on student mean scores. This type of analysis
© UNESCO 25
requires that sufficient students be selected per school so as to

ensure that stable estimates are able to be made for individual
schools. At the same time, it requires that sufficient schools are
available so as to ensure that meaningful results may be obtained
for estimates of population characteristics.

How to draw a national 2
sample of students:
a hypothetical example
for ‘country x’
In this section a hypothetical example has been presented in order
to illustrate the 12 main steps involved in selecting a national
sample of schools and students for ‘Country X’. The numbers of
schools and students in the example have been made very small so
that all calculations and tabulations can be carried out by hand. In
the next section of this module a ‘real’ example has been presented
in which the required calculations and tabulations must be carried
out by computer.
The desired target population for the example consists of 1100

students attending 20 schools located in two administrative regions
of Country X. The sample design requires that five schools be
selected across four strata with probability proportional to the
number of students in the defined target population. Within the
five selected schools a simple random sample of 20 students is to be
selected. The 12 steps required to complete the sample design cover
the following five areas.
• Specification of Sample (Steps 1 to 4)
The development and listing of ‘specifications’ for the sample

design – including target population, sample size, sampling stages,
stratification, sample allocation, cluster size, and pseudoschool
© UNESCO 27
rules. In most research situations the specifications are prepared

following a detailed investigation of the availability of information
that is suitable for the construction of sampling frames, and
an evaluation of the nature and scope of the data that are to be
collected.
• Stratification and Sample Allocation (Steps 5 to 7)
A description of the stratification variables and an analysis of the

breakdown of the defined target population across strata. The
breakdown is used with the sample design specifications to guide
the allocation of the sample across strata.
• Marker Variables (Step 8)
This is an ‘optional’ area. It provides information that can be used to

make comparisons between the population and sample. However, it
has no real impact on the scientific quality of the final sample.
• Construction of Sampling Frame (Steps 9 and 10)
The construction of a separate list of schools for each stratum.

Each school is listed in association with the number of students
in the defined target population. Schools that are smaller than a
specified size are linked to a similar (and nearby) school to form a
‘pseudoschool’.
• Sample Selection (Steps 11 and 12)
The selection of the sample of schools within strata (by using a

‘lottery tickets’ approach), and the selection of a sample of students
within schools (by using a table of random numbers).

How to draw a national sample of students: a hypothetical example for ‘country x’
E XERCISE A
The reader should work through the 12 steps of the sample design.
Where tabulations are presented, the figures in each cell should be
verified by hand calculation using the listing of schools that has been
provided in step 3. After working through all steps, the following
questions should be addressed:
1. What is the probability of a student in the first school of the

first stratum being selected into the sample? Is this probability
the same for students in other schools in the first stratum? Is
it the same for students in other strata?
2. How would the sampling of students be undertaken for a

pseudoschool that was selected at the first stage of sampling?
3. What other variables might be used to stratify the defined

target population? Are any of these ‘better’ than the ones used
in the example?
4. What would Table 3 and Table 4 look like if the stratification

variables were ‘Region’ and ‘School Type’?
© UNESCO 29
Step 1
List the basic characteristics of the sample design
1. Desired target population: Grade Six students in Country X.
2. Stratification variables: ‘Region’ and ‘School size’.
3. Sampling stages: First Stage: Schools selected within strata with

probability proportional to size. Second Stage: Students selected
within schools by using simple random sampling.
4. Minimum cluster size: Fixed cluster of twenty students per

school.
5. Number of schools to be selected: A total of five schools.
6. Allocation of the sample across strata: Proportionate allocation

across strata. That is, the size of the sample within a stratum
should be proportional to the total size of the stratum.
7. Pseudoschools: Each school in the defined target population

with less than 20 students in the defined target population is to
be combined with another similar (and nearby) school to form a
‘pseudoschool’.
8. Selection equation
Nhi nhi
Probability = ah × ( )×( )
Nh nh
ah × nhi
=
Nh
where ah = the number of schools selected in stratum h,

Nh = the total number of students in stratum h,
Nhi = the total number of students in school i in stratum h, and
nhi = the number of students selected from school i.

Step 2
Prepare brief and accurate written descriptions of
the desired target population, the defined target
population, and the excluded population
1. Desired Target Population: “Grade Six students in Country X”.
2. Defined Target Population: “All students at the Grade Six level

who are attending registered Government or Non-government
primary schools in all administrative regions of Country X”.
3. Excluded Population: “All students at the Grade Six level

attending special schools for the handicapped in Country X”.
© UNESCO 31
Step 3
Locate a listing of all primary schools in Country_X
that includes the following information for each
school that has students in the desired target
population
1. Description of listing
a. School name. The official (unique) name of the school or a

suitable school identification number.
b. Region. The major state, province, or administrative region in
which the school is located.
c. District. The district within an administrative region that is
normally supervised by a District Education Officer.
d. School type. The name of the authority that administers
the school. For this example, a simple classification into
‘Government’ and ‘Non-government’ schools. (Many studies
use more detailed subgroups.)
e. School location. The location of the school in terms of
urbanisation. For this example, ‘Urban’ or ‘Rural’ schools.
f. School size. The size of the total school enrolment. For this
example, a simple classification into ‘Large’ and ‘Small’
schools. (Many studies use more detailed subgroups and
some use the ‘exact’ enrolment.)
g. School program. A description of school program that is
suitable for identifying schools in the ‘excluded population’.
For this example, ‘Regular’ and ‘Special’ schools.
h. Target population. The number of students within the school
who are members of the desired target population. For this
example, the total enrolment of students at the Grade Six
level.

2. Contents of Listing
School_A Reg_1 Dist_1 Govt Urban Large Regular 100

School_B Reg_1 Dist_1 Govt Urban Large Regular 60
School_C Reg_1 Dist_1 Non-Govt Urban Small Regular 60
School_D Reg_1 Dist_2 Govt Urban Large Special 60
School_E Reg_1 Dist_2 Non-Govt Urban Large Regular 70
School_F Reg_1 Dist_3 Govt Rural Small Regular 40
School_G Reg_1 Dist_3 Govt Rural Large Regular 80
School_H Reg_1 Dist_4 Non-Govt Urban Small Regular 50
School_I Reg_1 Dist_4 Non-Govt Urban Small Regular 50
School_J Reg_1 Dist_4 Govt Urban Large Regular 90
School_K Reg_2 Dist_5 Govt Urban Large Regular 60
School_L Reg_2 Dist_5 Non-Govt Urban Large Regular 40
School_M Reg_2 Dist_6 Govt Urban Small Regular 10
School_N Reg_2 Dist_6 Govt Urban Small Regular 80
School_O Reg_2 Dist_6 Non-Govt Urban Large Regular 50
School_P Reg_2 Dist_7 Govt Rural Small Regular 30
School_Q Reg_2 Dist_7 Govt Rural Large Regular 50
School_R Reg_2 Dist_8 Govt Urban Large Special 40
School_S Reg_2 Dist_8 Non-Govt Urban Small Regular 50
School_T Reg_2 Dist_9 Non-Govt Urban Small Regular 30
© UNESCO 33
Step 4
Use the listing of schools in the desired target
population to prepare a tabular description of
There are 1100 students attending 20 schools in the desired target
population. Two of these schools are ‘special schools’ and are
therefore their 100 students are assigned to the excluded population
– leaving 1000 students in 18 schools as the defined target
population.
Table 2 The desired, defined and excluded populations
Desired Defined Excluded
Schools Students Schools Students Schools Students
20 1100 18 1000 2 100

Step 5
Select the stratification variables
The stratification variables are ‘Region’ (which has two categories:
‘Region_1’ and ‘Region_2’) and ‘School Size’ (which has two
categories: ‘Large’ and ‘Small’ schools). These two variables
combine to form the following four strata.
Stratum_1: Region_1 Large Schools

Stratum_2: Region_1 Small Schools
Stratum_3: Region_2 Large Schools
Stratum_4: Region_2 Small Schools
© UNESCO 35
Step 6
Apply the stratification variables to the desired,
defined, and excluded population
Table 3 Schools and students in the desired, defined
and excluded populations listed by the four
strata

Stratum
Schools Students Schools Students Schools Students
Stratum_1 6 460 5 400 1 60

Stratum_2 4 200 4 200 0 0
Stratum_3 5 240 4 200 1 40
Stratum_4 5 200 5 200 0 0
Country X 20 1100 18 1000 2 100

Step 7
Etablish the allocation of the sample across strata
The sample specifications in Step 1 require five schools to be
selected in a manner that provides a proportionate allocation of
the sample across strata. Since a fixed-size cluster of 20 students
is to be drawn from each selected school, the number of schools
to be selected from each stratum must be proportional to the total
stratum size.
Proportionate Number of Schools. The proportionate number of

schools required from the first stratum is 5 x (400/1000) = 2
schools. The other three strata have 200 students each and so the
proportionate number of schools for each is 5 x (200/1000) = 1
school.
Table 4 The sample allocation for the defined target

population
Sample allocation
Population of
students
Stratum Schools Students
Number Percent Exact Rounded Number Percent
Stratum_1 400 40.0 2.0 2 40 40.0

Stratum_2 200 20.0 1.0 1 20 20.0
Stratum_3 200 20.0 1.0 1 20 20.0
Stratum_4 200 20.0 1.0 1 20 20.0
© UNESCO 37
Step 8 (an optional step)

Prepare tabular displays of students and schools
in the defined target population – broken down
by certain variables that may be of interest for the
construction of ‘marker variables’ after the data
have been collected
This is an optional step in the sample design because it has no
impact upon the scientific quality of the final sample. The main
justification for this step is that it may provide useful information
(sometimes referred to as ‘marker variables’) that can be used to
compare several characteristics of the final sample and the defined
target population.
Table 5 Schools with students in the defined target

population listed by categories of school size,
school location, and school type
School size School location School type

Stratum Stratum
Non-
Large Small Rural Urban Govt
Govt
Stratum_1 5 0 1 4 4 1 5
Stratum_2 0 4 1 3 1 3 4
Stratum_3 4 0 1 3 2 2 4
Stratum_4 0 5 1 4 3 2 5
Country X 9 9 4 14 10 8 18

Table 6 Students in the defined target population listed

by categories of school size, school location,
and school type
School size School location School type

Stratum Stratum
Non-
Large Small Rural Urban Govt
Govt
Stratum_1 400 0 80 320 330 70 400

Stratum_2 0 200 40 160 40 160 200
Stratum_3 200 0 50 150 110 90 200
Stratum_4 0 200 30 170 120 80 200
Country X 600 400 200 800 600 400 1000
© UNESCO 39
Step 9
For schools with students in the defined target
population, prepare a separate list of schools for
each stratum with ‘pseudoschools’ identified by a
bracket ( [ )
The sample design specifications in Step 1 require that 20 students
be drawn for each selected school. However, one school in the
defined target population, School_M, has only 10 students. This
school is therefore combined with a similar (and nearby) school,
School_N, to form a ‘pseudoschool’.
The pseudoschool is treated as a single school of 90 students during

the process of selecting the sample. If a pseudoschool happens to
be selected into the sample, it is treated as a single school for the
purposes of selecting a subsample of 20 students.
Note that School_D and School_R do not appear on the list because
they are members of the excluded population.

Stratum 1. Region_1 Large schools
School_A Reg_1 Dist_1 Govt Urban Large Regular 100

School_B Reg_1 Dist_1 Govt Urban Large Regular 60
School_E Reg_1 Dist_2 Non-Govt Urban Large Regular 70
School_G Reg_1 Dist_3 Govt Rural Large Regular 80
School_J Reg_1 Dist_4 Govt Urban Large Regular 90
Stratum 2. Region_1 Small schools
School_C Reg_1 Dist_1 Non-Govt Urban Small Regular 60

School_F Reg_1 Dist_3 Govt Rural Small Regular 40
School_H Reg_1 Dist_4 Non-Govt Urban Small Regular 50
School_I Reg_1 Dist_4 Non-Govt Urban Small Regular 50
School_K Reg_2 Dist_5 Govt Urban Large Regular 60

School_L Reg_2 Dist_5 Non-Govt Urban Large Regular 40
School_O Reg_2 Dist_6 Non-Govt Urban Large Regular 50
School_Q Reg_2 Dist_7 Govt Rural Large Regular 50

School_N Reg_2 Dist_6 Govt Urban Small Regular
[ 80
School_P Reg_2 Dist_7 Govt Rural Small Regular 30
School_S Reg_2 Dist_8 Non-Govt Urban Small Regular 50
School_T Reg_2 Dist_9 Non-Govt Urban Small Regular 30
© UNESCO 41
Step 10
population, assign ‘lottery tickets’ such that each
school receives a number of tickets that is equal
to the number of students in the defined target
population
Note that the pseudoschool made up from School_M and School_N
has tickets numbered 1 to 90 because these two schools are treated
as a single school for the purposes of sample selection.
Lottery
tickets
School_A Reg_1 Dist_1 Govt Urban Large Regular 100 1-100

School_B Reg_1 Dist_1 Govt Urban Large Regular 60 101-160
School_E Reg_1 Dist_2 Non-Govt Urban Large Regular 70 161-230
School_G Reg_1 Dist_3 Govt Rural Large Regular 80 231-310
School_J Reg_1 Dist_4 Govt Urban Large Regular 90 311-400
Lottery
tickets
School_C Reg_1 Dist_1 Non-Govt Urban Small Regular 60 1-60

School_F Reg_1 Dist_3 Govt Rural Small Regular 40 61-100
School_H Reg_1 Dist_4 Non-Govt Urban Small Regular 50 101-150
School_I Reg_1 Dist_4 Non-Govt Urban Small Regular 50 151-200

Stratum 3. Regional_ 2 Large schools
Lottery
tickets
School_K Reg_2 Dist_5 Govt Urban Large Regular 60 1-60

School_L Reg_2 Dist_5 Non-Govt Urban Large Regular 40 61-100
School_O Reg_2 Dist_6 Non-Govt Urban Large Regular 50 101-150
School_Q Reg_2 Dist_7 Govt Rural Large Regular 50 151-200
Stratum 4. Regional_2 Small schools
Lottery
tickets

School_N Reg_2 Dist_6 Govt Urban Small Regular
[ 80 1-90
School_P Reg_2 Dist_7 Govt Rural Small Regular 30 91-120
School_S Reg_2 Dist_8 Non-Govt Urban Small Regular 50 121-170
School_T Reg_2 Dist_9 Non-Govt Urban Small Regular 30 171-200
© UNESCO 43
Step 11
Select the sample of schools
• Selection of two schools for stratum 1
In the first stratum two schools must be selected with probability

proportional to the number of students in the defined target
population. This is achieved by allocating a number of ‘lottery
tickets’ to each school so that the first school on the list, School_A,
with 100 students in the defined target population receives lottery
tickets that are numbered from 1 to 100. The second school on the
list, School_B, with 60 students receives lottery tickets that are
numbered from 101 to 160, and so on – until the final school in the
first stratum receives tickets numbered 311 to 400.
Since two schools must be selected there is a need to identify two

‘winning tickets’. The ratio of the total number of tickets to the
number of winning tickets is known as the ‘sampling interval’.
For the first stratum the sampling interval is equal to 400/2 = 200.
That, is each ticket in the first stratum should have a 1 in 200
chance of being drawn as a winning ticket. Note that within the
other three strata, one winning ticket must be selected from a total
of 200 tickets – which gives the same value for the sampling interval
(200/1 = 200).
The ‘winning tickets’ for the first stratum are drawn by using a
‘random start – constant interval’ procedure whereby a random
number in the interval of 1 to 200 is selected as the first winning
ticket and the second ticket is selected by adding an increment of
200. Assuming a random start of 105, the winning ticket numbers
would be 105 and 305. This results in the selection of School_B
(which holds tickets 101 to 160) and School_G (which holds
tickets 231 to 310). The probability of selection is proportional to
the number of tickets held and therefore each of these schools is
selected with probability proportional to the number of students in
the defined target population.

• Selection of one school each for the other three strata
Only one winning ticket is required for the other strata and
therefore one random number must be drawn between 1 and 200
for each stratum. Assuming these numbers are 65, 98, and 176, the
selected schools are School_F, School_L, and School_T.
• Sample of Schools
Lottery
tickets
School_B Reg_1 Dist_1 Govt Urban Large Regular 60 105

School_G Reg_1 Dist_3 Govt Rural Large Regular 80 305
School_F Reg_1 Dist_3 Govt Rural Small Regular 40 65
School_L Reg_2 Dist_5 Non-Govt Urban Large Regular 40 98
School_T Reg_2 Dist_9 Non-Govt Urban Small Regular 30 176
© UNESCO 45
Step 12
Use a table of random numbers to select a simple
random sample of 20 students in each selected
school
Within a selected school a table of random numbers is used to
identify students from a sequentially numbered roll of students in
the defined target population. The application of this procedure
has been described in detail by Ross and Postlethwaite (1991). A
summary of the procedure has been presented below.
1. Obtain Grade Six register(s) of attendance
These registers are obtained for all students in the defined

target population – in this case Grade Six students. In multiple
session schools, both morning and afternoon registers are
obtained.
2. Place a sequential number beside the name of each Grade Six

student
For example, consider a school with one ‘shift’ and a total of 48

students in Grade Six. Commence by placing the number ‘1’
beside the first student on the Register; then place the number
‘2’ beside the second student on the Register; …etc…;finally,
place the number ‘48’ beside the last student on the Register.
As another example, consider a school with 42 students in the

morning ‘shift’ and 48 students in the afternoon session of Grade
Six. Commence by placing the number ‘1’ beside the first student
on the Morning Register; …etc…; then place a ‘42’ beside the last
student on the Morning Register; then place a ‘43’ beside the first
student on the Afternoon register; …etc…;finally place a ‘90’ beside
the last student on the afternoon register.

3. Locate the appropriate set of selection numbers
In Appendix 6 sets of “random number tables” have been listed

for a variety of school sizes. (Note that only the sets relevant for
school sizes in the range 21 to 100 have been presented in this
Appendix.) For example, if a school has 48 students in Grade
Six, the appropriate set of selection numbers is listed under the
‘R48’ heading. Similarly, if a school has 90 Grade Six students
then the appropriate set of selection numbers is listed under the
‘R90’ heading.
4. Use the appropriate set of random numbers
After locating the appropriate set of random numbers, use the

first random number to locate the Grade Six student with the
same sequential number on the Register. Then use the second
random number to locate the Grade Six student with the same
sequential number on the Register. Continue with this process
until the complete sample of students has been selected.
For example, in Step 11 the first school selected in Stratum_1 is

School_B which has 60 students in the defined target population.
Within this school the students selected would be those with the
following sequential numbers: 1, 5, 6, 7, 9 …, 52, and 54. (These
numbers are obtained from the set of random numbers labelled
‘R60’ in Appendix 6).
© UNESCO 47
3 How to draw a national sample

of schools and students:
a real example for Zimbabwe
In Section Two of this module a hypothetical example was presented
in order to provide a simple illustration of the 12 main steps
involved in selecting a national sample of schools and students.
In this section a ‘real’ example has been presented by applying
these 12 steps to the selection of a national sample of schools and
students in Zimbabwe. For most of these steps, an ‘exercise’ has
been given which requires the use of a computer to carry out the
kinds of calculations and tabulations that were possible to conduct
by hand for the hypothetical example. The example draws upon
computer-based techniques that were developed for a large-scale
study of the quality of education at Grade Six level in Zimbabwe
(Ross and Postlethwaite, 1991). The study was undertaken in
1991 by the International Institute for Educational Planning (IIEP,
UNESCO) and the Zimbabwe Ministry of Education and Culture.
In order to work through the exercises in this section it will be

necessary to employ the SAMDEM software (Sylla et al., 2004)
in association with a sampling frame that was constructed from
the 1991 ‘school forms’ survey of primary schools in Zimbabwe.
The database (described in Step 3 of this section, and entitled
‘Zimbabwe.dat’) was developed in association with George Moyo,
Patrick Pfukani, and Saul Murimba from the Policy and Planning
Unit of the Zimbabwe Ministry of Education and Culture.

The 12 steps described in this section have been used in many
countries to select national samples of schools and students (Ross,
1991). However, it should be emphasized that the general ideas are
applicable to samples that might be selected for smaller studies such
as surveys of an administrative region or educational district.
Step 1
List the basic characteristics of the sample design
1. Desired Target Population: Grade Six students in Zimbabwe.
2. Stratification Variables: The following two stratification

variables were used.
a. ‘Region’: This variable referred to the nine major

administrative regions of Zimbabwe and took the following
nine values: ‘Harare’, ‘Manica’ (Manicaland), ‘Mascen’
(Mashonaland Central), ‘Masest’ (Mashonaland East),
‘Maswst’ (Mashonaland West), ‘Matnor’ (Matabeleland
North), ‘Matsou’ (Matabeleland South), ‘Maving’ (Masvingo),
‘Midlnd’ (Midlands).
b. ‘School size’: This variable referred to the size of the school in
terms of total school enrolment and took the following two
values: ‘L’ (Large) and ‘S’ (Small).
3. Sampling Stages: First Stage: Schools selected within strata

with probability proportional to size. Second Stage: Students
selected within schools by using simple random sampling.
4. Minimum Cluster Size: A fixed cluster of twenty students per

school was accepted because this represented a manageable
cluster size for data collection requirements within any single
school.
© UNESCO 49
5. Number of Schools to be Selected: The number of schools

required for this example was governed by the requirement
that final sample should have an effective sample size for the
main criterion variables of at least 400 students. That is, the
final sample was required to have sampling accuracy that was
equivalent to, or better than, a simple random sample of 400
students.
a. Sample design tables

The general sample design framework adopted for the
example consisted of a stratified two-stage cluster sample
design. This permitted the use of sample design tables
(Ross, 1987) to provide estimates of the number of schools
and students required to obtain a sample with an effective
sample size of 400. In order to use the sample design tables,
it is necessary to know: the minimum cluster size (the
minimum number of students within a school that will
be selected for participation in the data collection), and
the coefficient of intraclass correlation (a measure of the
tendency of student characteristics to be more homogeneous
within schools than would be the case if students were
assigned to schools at random).
From above, it was known that the minimum cluster size
adopted for the example was 20 students. Also, from a
previous study of Reading Achievement conducted in
Zimbabwe (Ross and Postlethwaite, 1991) it was found that
the value of roh was around 0.3.
In Appendix 2 a set of sample design tables has been
presented for various values of the minimum cluster size,
and various values of the co-efficient of intraclass correlation.
The construction of these tables has been described by
Ross (1987). In each table, ‘a’ has been used to describe
the number of schools, ‘b’ has been used to describe the

How to draw a national sample of schools and students: a ‘real’ example for Zimbabwe
minimum cluster size, and ‘n’ has been used to describe the
total sample size.
The rows of Table 4 that correspond to a minimum cluster
size of one refer to the effective sample size. That is, they
describe the size of a simple random sample which has
equivalent accuracy. Therefore, the pairs of figures in the
fourth and fifth columns in the table all refer to sample
designs which have equivalent accuracy to a simple random
sample of size 400. The second and third columns refer to
an effective sample size of 1600, and the final two pairs
of columns refer to effective sample sizes of 178 and 100,
respectively.
The most important columns of figures in Appendix 2 for this
example are the fourth and fifth columns because they list a
variety of two-stage samples that would result in an effective
sample size of 400.
To illustrate, consider the intersection of the fourth and
fifth columns of figures with the third row of figures in the
first page of Appendix 2. The pair of values a=112 and n=560
indicate that if roh is equal to 0.1 and the minimum cluster
size, b, is equal to 5, then the two-stage cluster sample design
required to meet the required sampling standard would be
5 students selected from each of 112 schools – which would
result in a total sample size of 560 students.
The effect of a different value of roh, for the same
minimum cluster size, may be examined by considering the
corresponding rows of the table for roh=0.2, 0.3, etc. For
example, in the case where roh=0.3, a total sample size of
880 students obtained by selecting 5 students from each of
176 schools would be needed to meet the required sampling
standard.
© UNESCO 51
b. The number of schools required for this example

In this study the value selected for the minimum cluster
size was 20, and the estimated value of the co-efficient of
intra-class correlation was 0.3. From the sample design
tables in Appendix 8, it may be seen that, in order to obtain
a two-stage cluster sample with an effective sample size of
400, it is necessary to select a sample of 134 schools – which
results in a total sample size of 2680 students.
6. Allocation of the sample across strata: Proportionate allocation
across strata. That is, the size of the sample within a stratum
should be proportional to the total size of the stratum.
7. Pseudoschools: Each school with less than 20 students in the

defined target population is to be combined with another
similar (and nearby) school to form a ‘pseudoschool’.
8. Selection Equation
Nhi nhi
Probability = ah × ( )×( )
Nh nh
ah × nhi
=
Nh
where ah = the number of schools selected in stratum h,

Nh = the total number of students in stratum h,
Nhi = the total number of students in school i in stratum h, and
nhi = the number of students selected from school i.

Step 2
Prepare brief and accurate written descriptions of
1. Desired target population: “Grade Six students in Zimbabwe”.
2. Defined target population: “All students at the Grade Six level

in September 1991 who are attending registered Government
or Non-government primary schools that offer the ‘regular’
curriculum, that contain at least 10 Grade Six students, and that
are located in all nine administrative regions of Zimbabwe”.
3. Excluded Population: “All students at the Grade Six level in

September 1991 who are attending (a) special schools for the
handicapped in Zimbabwe (these schools do not offer the
‘regular’ curriculum), or (ii) primary schools in Zimbabwe with
less than 10 Grade Six students (these schools are smaller than
the specified minimum size)”.
© UNESCO 53
Step 3
Locate a listing of all primary schools in Zimbabwe
that includes the following information for each
school that has students in the desired target
population
E XERCISE B
Use the SAMDEM software to list the following groups of schools in the
Zimbabwe.dat file:
1. the first ten schools in the file,
2. schools that contain students in the ‘excluded population’

(special and small schools) and,
3. schools that must be combined to form `pseudoschools’.
Description of Listing (Stored in ‘Zimababwe.dat’ File)
• Schlname (Columns 1-15). This variable describes the official

name given to each the school by the Zimbabwe Ministry of
Education and Culture.
• Region (Columns 17-22). This variable describes the

administrative regions of Zimbabwe. The variable takes nine
values: ‘Harare’, ‘Manica’ (Manicaland), ‘Mascen’ (Mashonaland
Central), ‘Masest’ (Mashonaland East), ‘Maswst’ (Mashonaland
West), ‘Matnor’ (Matabeleland North), ‘Matsou’ (Matabeleland
South), ‘Maving’ (Masvingo), ‘Midlnd’ (Midlands).
• District (Columns 24-29). This variable describes the

administrative district within each administrative region of
Zimbabwe. The variable takes 55 values ranging from ‘Harare’
(which is congruent with the region of Harare) to ‘Zvisha’
(which is located in the region of Midlands).

• Schltype (Columns 31-33). This variable describes schools

according to the authority responsible for school administration.
The variable takes two values: ‘Gov’ (Government Schools),
and ‘Non’ (Non-government schools administered by District
Councils, Rural Councils, Missions, Farm School Councils,
Trust Councils, etc.).
• Schllocn (Column 35). This variable describes the location

of the school in terms of the degree of urbanization of the
surrounding community. The variable takes two values ‘R’
(rural location) and ‘U’ (urban location).
Contents of Listing (First Ten Schools in ‘Zimbabwe.dat’ File)
Alpha Brick Harare Harare Non R L Reg 86

Royden Harare Harare Non R L Reg 34
St Marnocks Harare Harare Non R L Reg 125
Chizungu Harare Harare Non R L Reg 302
Nyabira Harare Harare Non R L Reg 102
Kubatana Harare Harare Non R L Reg 43
Kintyre Harare Harare Non R L Reg 61
Gwebi Harare Harare Non R L Reg 90
Henderson Harare Harare Non R L Reg 36
Beatrice Harare Harare Non R L Reg 24
• Schlsize (Column 37). This variable describes each school in

terms of the size of its enrolment at the Grade Six level. The
variable takes two values: ‘S’ (a ‘small’ school) and ‘L’ (a ‘large’
school).
• Schlprog (Columns 39-41). This variable differentiates between

school programs according to whether schools are regular
primary schools or special schools designed for handicapped
students. The variable takes two values ‘Reg’ (regular schools)
and ‘Spe’ (special schools).
• Enrolg6 (Columns 43-45). This variable describes the Grade Six

enrolment for each Zimbabwe primary school in 1991.
© UNESCO 55
Step 4
Use the listing of schools in the desired target
population to prepare a tabular description of
E XERCISE C
Use the SAMDEM software to prepare frequency distributions for the

desired, defined, and excluded populations. Then complete Table 3.1.
Table 7 Schools and students in the desired, defined

and excluded populations (Zimbabwe Grade
Six 1991)
Special Schl Small Stdt

Schools Students Schools Students
Schl Stdt Schl Stdt
4487 283781 - - - - 65 393

Step 5
Select the stratification variables
E XERCISE D
Use the names of the stratification variables presented in Step 1

to complete the following list.
Stratum_1: Harare Region Large Schools

Stratum_2: Harare Region Small Schools
Stratum_3: Manica Region Large Schools
Stratum_4: Manica Region Small Schools
..................................
..................................
..................................
..................................
..................................
..................................
..................................
..................................
..................................
..................................
..................................
..................................
Stratum_17: Midlnd Region Large Schools
Stratum_18: Midlnd Region Small Schools
© UNESCO 57
Step 6
Apply the stratification variables to the desired,
defined, and excluded population
E XERCISE E
Use the SAMDEM software to prepare frequency distributions for

the desired, defined, and excluded populations – separately for each
stratum. Then complete Table 3.2
Table 8 The desired, defined, and excluded populations (Zimbabwe

Grade Six 1991)
Stratum Desired Defined Excluded

No Special Very-small
Region Sch. size Sch. Stu. Sch. Stu.
Sch. Stu. Sch. Stu.
01 Harare Large 140 25054 140 25054 0 0 0 0

02 Harare Small 82 3367 72 3298 8 61 2 8
03 Manica Large - - - - - - - -
04 Manica Small - - - - - - - -
05 Mascen Large - - - - - - - -
06 Mascen Small - - - - - - - -
07 Masest Large - - - - - - - -
08 Masest Small - - - - - - - -
09 Maswst Large - - - - - - - -
10 Maswst Small - - - - - - - -
11 Matnor Large - - - - - - - -
12 Matnor Small - - - - - - - -
13 Matsou Large - - - - - - - -
14 Matsou Small - - - - - - - -
15 Maving Large - - - - - - - -
16 Maving Small - - - - - - - -
17 Midlnd Large - - - - - - - -
18 Midlnd Small - - - - - - - -
Zimbabwe 4487 283781 4399 23 275 65 393

Step 7
Establish the allocation of the sample across strata
E XERCISE F
Use the SAMDEM software to establish a proportionate allocation of

the sample across the strata. Then complete Table 3.3.
Table 9 The sample allocation for the defined population (Zimbabwe

Grade Six 1991)
Stratum Defined Sample

No Schools Students
Region Sch. size Number Pct.
Exact Planned Number Pct.
01 Harare Large 25054 8.8 11.9 12 240 8.9

02 Harare Small 3298 1.2 1.6 2 40 1.5
03 Manica Large - - - - - -
04 Manica Small - - - - - -
05 Mascen Large - - - - - -
06 Mascen Small - - - - - -
07 Masest Large - - - - - -
08 Masest Small - - - - - -
09 Maswst Large - - - - - -
10 Maswst Small - - - - - -
11 Matnor Large - - - - - -
12 Matnor Small - - - - - -
13 Matsou Large - - - - - -
14 Matsou Small - - - - - -
15 Maving Large - - - - - -
16 Maving Small - - - - - -
17 Midlnd Large - - - - - -
18 Midlnd Small - - - - - -
Zimbabwe Total 283113 100.0 134.0 135 2700 100.0
© UNESCO 59
Step 8 (an optional step)

Prepare tabular displays of students and schools in
the defined target population – broken down by
School Size, Location, and Type
E XERCISE G
Use the SAMDEM software to prepare ‘marker variable’ information.

Complete Tables 3.4 and 3.5.
Table 10 Schools with students in the defined target population listed by school
size, school location, and school type (Zimbabwe Grade Six 1991)
Stratum School size School location School type Stratum

No
Total
Region Sch. size Large Small Rural Urban Govt Non-Govt
01 Harare Large 140 0 8 132 99 41 140

02 Harare Small 0 72 26 46 17 55 72
03 Manica Large - - - - - - -
04 Manica Small - - - - - - -
05 Mascen Large - - - - - - -
06 Mascen Small - - - - - - -
07 Masest Large - - - - - - -
08 Masest Small - - - - - - -
09 Maswst Large - - - - - - -
10 Maswst Small - - - - - - -
11 Matnor Large - - - - - - -
12 Matnor Small - - - - - - -
13 Matsou Large - - - - - - -
14 Matsou Small - - - - - - -
15 Maving Large - - - - - - -
16 Maving Small - - - - - - -
17 Midlnd Large - - - - - - -
18 Midlnd Small - - - - - - -
Zimbabwe Total 1142 3257 4036 363 270 4129 4399

Table 11 Students in the defined target population listed by school size, school
location, and school type (Zimbabwe Grade Six 1991)
Stratum School size School location School type Stratum

No
Total
Region Sch. size Large Small Rural Urban Govt Non-Govt
01 Harare Large 25054 0 1175 23879 17798 7256 25054

02 Harare Small 0 3298 1062 2236 913 2385 3298
03 Manica Large - - - - - - -
04 Manica Small - - - - - - -
05 Mascen Large - - - - - - -
06 Mascen Small - - - - - - -
07 Masest Large - - - - - - -
08 Masest Small - - - - - - -
09 Maswst Large - - - - - - -
10 Maswst Small - - - - - - -
11 Matnor Large - - - - - - -
12 Matnor Small - - - - - - -
13 Matsou Large - - - - - - -
14 Matsou Small - - - - - - -
15 Maving Large - - - - - - -
16 Maving Small - - - - - - -
17 Midlnd Large - - - - - - -
18 Midlnd Small - - - - - - -
Zimbabwe Total 139642 143471 239428 43685 35763 247350 283113
© UNESCO 61
Step 9
population, prepare a separate list of schools for
each stratum with ‘pseudoschools’ identified with a
bracket ( [ )
E XERCISE H
Use the SAMDEM software to list the schools having students in the
defined target population and located in ‘stratum 2: Harare region small
schools!’.
Step 10
For the defined target population, assign ‘lottery
tickets’ such that each school receives a number of
tickets that is equal to the number of students in
the defined target population
E XERCISE I
Use the SAMDEM software to assign ‘lottery tickets’ for schools having
students in the defined target population and located in ‘stratum 2:
Harare region small schools’. repeat this exercise for one other stratum.

Step 11
Select the sample of schools.
E XERCISE J
Use the SAMDEM software to to select the sample of schools from

schools having students in the defined target population and located in
‘stratum 2: Harare region small schools’. repeat this exercise for one
other stratum.
Step 12
Use a table of random numbers to select a simple
random sample of 20 students in each school.
E XERCISE K
use the random number tables (see Appendix 1) to identify student

selection numbers for the schools that were selected from ‘stratum 2:
Harare region small schools’. repeat this exercise for one other stratum.
© UNESCO 63
4 The estimation of sampling

errors
The computational formulae required to estimate the variance of
descriptive statistics, such as sample means, are widely available
for simple random sample designs and some sample designs that
incorporate complexities such as stratification and cluster sampling.
However, in the case of more complex analytical statistics, such
as correlation coefficients and regression coefficients, the required
formulae are not readily available for sample designs which depart
from the model of simple random sampling. These formulae are
either enormously complicated or, ultimately, they prove resistant to
mathematical analysis (Frankel, 1971).
In the absence of suitable formulae, a variety of empirical

techniques have emerged in recent years which provide
«approximate variances that appear satisfactory for practical
purposes» (Kish, 1978:20). The most frequently applied empirical
techniques may be divided into two broad categories : Subsample
Replication and Taylor’s Series Approximation.
In Subsample Replication a total sample of data is used to construct

two or more subsamples and then a distribution of parameter
estimates is generated by using each subsample. The subsample
results are analyzed to obtain an estimate of the parameter, as well
as a confidence assessment for that estimate (Finifter, 1972). The
main approaches in using this technique have been Independent
Replication (Deming, 1960), Jackknifing (Tukey, 1958), and
Balanced Repeated Replication (McCarthy, 1966).

Alternatively, Taylor’s Series Approximation may be used to provide
a more ‘direct’ method of variance estimation than these three
approaches. In the absence of an exact formula for the variance, the
Taylor’s Series is used to approximate a numerical value for the first
few terms of a series expansion of the variance formula. A number
of computer programs have been prepared in order to carry out the
extensive numerical calculations required for this approach (Wolter,
1985, Appendix ).
In the remainder of this section the Jackknife procedure has been

described. This procedure offers the following two important
benefits.
• Greater Flexibility
The Jackknife may be applied to a wide variety of sample
designs whereas (i) Balanced Repeated Replication is designed
for application to sample designs that have precisely two
primary sampling units per stratum, and (ii) Independent
Replication requires a large number of selections per stratum
so that a reasonably large number of independent replicated
samples can be formed
• Ease of Use
The Jackknife does not require specialized software systems
whereas (i) Balanced Repeated Replication usually requires
the prior establishment by computer of complex Hadamard
matrices that are used to ‘balance’ the half samples of data that
form replications of the original sample, and (ii) Taylor’s Series
methods require specific software routines to be available for
each statistic under consideration.
© UNESCO 65
The Jackknife Procedure
The development of the Jackknife procedure may be traced back to

a method used by Quenouille (1956) to reduce the bias of estimates.
Further refinement of the method (Mosteller and Tukey, 1968) led to
its application in a range of social science situations where formulae
are not readily available for the calculation of sampling errors.
The Jackknife procedure requires that an initial estimate, yall, of a

statistic, y, be made on the total sample of data. The total sample is
then divided into k subgroups and yi is computed, using the same
functional form as yall, but based on the reduced sample of data
obtained by omitting the ith subgroup. Then k ‘pseudovalues’ yi*
(i=1,2,...,k) can be defined – based on the k reduced samples:
��
Quenouille’s estimator (also called the ‘Jackknife value’) is the mean

of the k pseudovalues:
��
Quenouille’s contribution was to show that, while yall may have bias
of order 1/n as an estimate of y, the Jackknife value, y*, has bias of
order 1/n2.

The estimation of the sampling errors
The variance of y* may be estimated from the pseudovalues.
�
��
�
��
�
� � � �
� � ��
��
Tukey (1958) set forward the proposal that the pseudovalues could
be treated as if they were approximately independent observations
and that Student’s t distribution could be applied to these estimates
in order to construct confidence intervals for y *. Later empirical
work conducted by Frankel (1971) provided support for these
proposals when the Jackknife technique was applied to complex
sample designs and a variety of simple and complex statistics.
Substituting for yi* in the expression for var(y*) permits the variance
of y* to be estimated from the k subsample estimates, yi, and their
mean – without the need to calculate pseudovalues.
� ��
��
�
Wolter (1985, p. 156) has shown that replacing y* by yall in the right
hand side of the first expression for var(y *) given above provides a
conservative estimate of var(y*) – the overestimate being equal to
(yall-y*)2/(k-1).
© UNESCO 67
In practice, these expressions for var(y*) have also been used to

estimate the variance not only of Quenouille’s estimator y*, but also
of yall (Wolter, 1985, pp. 155, 172).
Substituting yall for y* provides the following estimator for the

variance of y* (and also a conservative estimator for yall).
� � �
��
��
Wolter (1985, p.180) and Rust (1985) have presented an extension
of these formulae for complex stratified sample designs in which
there are kh primary sampling units in the hth stratum (where h =
1,2,...,H). In this case, the formula for the variance of yall employs yhi
to denote the estimator derived from the same functional form as
yall – calculated after deleting the ith primary sampling unit from
the hth stratum.
� � ��
��
��
where K = Σkh is the total number of samples that are formed
In many educational research studies, schools are used as the

primary sampling units, and therefore the K estimates of yhi are
obtained by removing one school at a time from the total sample
and then applying the same functional form used to estimate yall for
each reduced sample. In studies where geographical areas are used

The estimation of the sampling errors
as the primary sampling units, followed by the selection of more

than one school per area, the estimates of yhi are based on reduced
samples formed by omitting one geographical area at a time.
© UNESCO 69
5 References
Brickell, J. L. (1974). “Nominated samples from public schools and
statistical bias”. American Educational Research Journal, 11(4),
333-341.
Deming, W. E. (1960). Sample design in business research. New York:

Wiley.
Finifter, B. M. (1972). “The generation of confidence: Evaluating

research findings by random subsample replication”. In H.
L. Costner (Ed.), Sociological Methodology. San Francisco:
Jossey-Bass.
Frankel, M. R. (1971). Inference from survey samples. Ann Arbor,

Michigan: Institute for Social Research.
Haggard, E.A. (1958). Intraclass correlation and the analysis of

variance. New York: Dryden.
Hansen, M. H., Hurwitz, W. N.; Madow, W. G. (1953). Sample survey

methods and theory (Vols. 1 & 2). New York: Wiley.
International Association for the Evaluation of Educational

Achievement (IEA) (1986). Technical committee meeting: sample
design and sample sizes. Paper distributed at the Annual General
Meeting of the IEA, Stockholm, Sweden.
Jessen, R. L. (1978). “Statistical survey techniques”. New York:

Wiley.

Kalton, G. (1983). “Introduction to survey sampling” (Sage Series
in Quantitative Applications in the Social Sciences, No. 35).
Beverly Hills, CA: Sage.
Kish, L. (1965). Survey sampling. New York: Wiley.
Kish, L. (1978). “On the future of survey sampling”. In N. K.

Namboordi (Ed.), Survey sampling and measurement.
New York: Academic Press.
McCarthy, P. J. (1966). Replication: An approach to the analysis of

data from complex surveys. Washington: United States National
Centre for Health Statistics.
Mosteller, F. and Tukey, J. (1968). “Data Analysis Including

Statistics”. In G. Lindzey and E. Aronson (Eds.), The handbook
of social psychology. (2nd ed.). Reading, Massachusetts:
Addison-Wesley.
Quenouille, M. J. (1956). “Notes on Bias in Estimation”. Biometrica,

43, 353-360.
Ross, K. N. (1976). Searching for uncertainty: Sampling errors in

educational survey research. Hawthorn, Victoria: Australian
Council for Educational Research.
Ross, K. N. (1978). Sample design for educational survey research

(Monograph). Evaluation in Education, 2, 105-195.
Ross, K. N. (1987). Sample design. International Journal of

Educational Research, 11(1), 57-75.
Ross K. N. (1991). Sampling Manual for the IEA International Study

of Reading Literacy. Hamburg: IEA International Study of
Reading Literacy Coordinating Centre.
© UNESCO 71
Ross, K. N. and Postlethwaite, T. N. (1992). “Indicators of the

Quality of Education: A National Study of Primary Schools
in Zimbabwe”. Paris: International Institute for Educational
Planning.
Rust, K. (1985). Variance estimation for complex estimators in

sample surveys. Journal of Official Statistics, 1(4), 381-397.
Sylla, K.; Saito, M. and Ross,K. (2004). SAMDEM: Sample Design

Manager. Paris. International Institute for Educational Planning.
Tukey, J. W. (1958). Bias and confidence in not-quite large samples

(Abstract). Annals of Mathematical Statistics, 29, 614.
Wolf, R. M. (1977). Achievement in America: National report of the

United States for the International Educational Achievement
Project. New York: Teachers College Press.
Wolter, K. M. (1985). Introduction to Variance Estimation. New

York: Springer-Verlag.

Appendix 1
Random number tables for

selecting a simple random
sample of twenty students
from groups of students
of size 21 to 100
Size of Group
R21 R22 R23 R24 R25 R26 R27 R28 R29 R30
1 1 1 2 1 1 1 1 1 4
2 2 2 3 2 2 2 2 2 6
3 3 3 4 3 3 3 3 3 7
4 5 5 5 4 5 4 4 4 8
5 6 6 7 5 6 6 6 5 10
6 7 7 8 6 7 7 7 6 12
7 8 9 9 7 8 8 8 9 15
8 9 11 10 9 9 11 11 10 16
9 10 12 11 11 10 12 12 11 17
10 11 13 12 12 12 13 13 12 18
11 12 14 13 13 13 15 15 14 19
12 13 15 15 14 14 17 17 16 20
13 14 16 16 15 15 18 18 21 22
14 15 17 17 16 16 19 19 22 23
16 16 18 18 18 18 20 20 23 24
17 17 19 19 20 19 21 21 24 25
18 18 20 21 21 21 22 22 25 26
19 20 21 22 23 22 25 25 27 27
20 21 22 23 24 25 27 27 28 29
21 22 23 24 25 26 28 28 29 30
© UNESCO 73
Size of Group
R31 R32 R33 R34 R35 R36 R37 R38 R39 R40
1 4 1 2 1 4 1 1 1 2
4 6 2 3 2 5 3 4 2 3
5 7 4 4 7 7 6 7 3 5
8 9 5 5 8 8 7 9 5 6
10 10 7 6 9 10 8 11 6 8
11 11 10 710 11 13 9 12 7 12
14 12 12 12 14 16 10 14 8 13
15 13 13 13 15 17 11 16 12 16
16 14 14 14 16 19 12 17 14 17
17 15 16 16 17 21 13 18 15 18
19 16 18 18 21 22 17 19 17 23
20 17 19 19 22 24 18 22 20 24
22 21 21 21 23 25 19 23 24 25
23 23 23 23 24 26 24 24 26 31
24 24 25 25 25 27 25 25 27 33
25 25 26 26 26 28 28 26 28 34
27 26 28 28 32 29 29 28 29 37
29 28 31 31 33 31 30 31 31 38
30 30 32 32 34 33 34 32 36 39
31 31 33 33 35 36 36 37 38 40
Size of Group
R41 R42 R43 R44 R45 R46 R47 R48 R49 R50
1 2 1 1 3 1 3 1 1 3
4 3 2 3 5 11 5 3 2 4
5 4 6 4 7 12 10 6 4 5
7 5 11 6 8 15 17 15 6 6
8 6 13 7 10 17 24 19 8 7
10 8 14 8 11 21 25 20 13 8
12 10 15 9 12 22 27 24 15 11
13 13 17 11 16 23 28 26 17 13
15 16 20 13 19 24 30 27 20 14
17 20 25 15 23 25 31 29 23 16
20 21 26 18 24 26 32 30 24 20
21 23 28 19 25 28 33 31 27 22
25 24 31 24 32 29 34 32 31 24
27 27 32 25 34 32 36 34 32 29
28 29 33 32 35 34 37 36 34 32
29 31 37 35 37 36 38 37 37 33
32 34 38 38 38 40 39 38 41 36
34 35 39 40 40 41 42 40 43 37
35 36 40 43 43 44 44 44 44 40
39 41 42 44 44 45 45 48 47 45

Appendix 1
Size of Group
R51 R52 R53 R54 R55 R56 R57 R58 R59 R60
3 4 3 1 1 2 2 3 3 1
4 5 6 2 6 5 6 5 4 5
8 6 8 4 7 7 7 6 5 6
9 10 9 8 9 10 10 8 8 7
10 13 11 10 10 12 17 16 10 9
12 17 12 11 12 15 19 17 11 11
13 18 14 16 14 18 23 21 15 14
15 19 15 20 16 24 29 23 16 16
17 20 18 24 29 25 34 27 17 19
26 22 19 25 31 27 35 32 20 21
27 25 23 27 36 32 37 37 24 24
29 27 26 28 38 34 38 38 26 28
31 28 28 29 40 38 43 42 29 29
32 32 33 32 42 43 44 44 31 39
36 35 38 37 44 45 46 46 36 40
39 40 42 39 47 47 48 48 38 49
40 42 43 41 48 49 52 49 42 50
44 49 48 46 52 50 53 50 48 51
47 51 51 47 53 52 55 52 50 52
48 52 53 49 54 55 57 56 53 54
Size of Group
R61 R62 R63 R64 R65 R66 R67 R68 R69 R70
2 9 8 5 3 1 4 2 6 5
5 12 12 8 4 2 5 4 9 6
7 14 14 10 7 8 6 5 11 7
8 22 15 17 10 9 10 10 14 9
11 23 16 19 14 11 14 16 23 11
18 25 18 21 18 19 17 17 24 13
19 227 24 25 19 22 18 21 29 17
21 28 32 26 20 23 20 22 31 18
22 29 33 27 28 25 21 23 36 21
27 33 34 29 37 26 25 28 40 22
30 41 38 33 40 27 29 31 44 36
34 42 39 35 42 30 30 32 48 39
35 43 40 37 46 28 34 33 49 41
39 46 42 39 47 46 37 37 50 47
46 50 45 45 49 50 39 42 52 49
48 51 46 48 51 52 41 49 55 50
49 53 48 49 58 53 47 55 59 53
50 57 54 54 59 54 57 58 60 55
53 59 61 58 61 61 58 61 67 61
54 62 63 61 65 66 63 67 68 67
© UNESCO 75
Size of Group
R71 R72 R73 R74 R75 R76 R77 R78 R79 R80
9 3 1 4 2 6 3 12 8 6
15 6 5 15 7 20 5 13 10 10
17 10 9 19 9 21 6 23 14 17
20 12 11 23 15 22 7 24 15 19
22 18 12 24 16 24 10 28 21 21
25 20 13 27 23 25 16 32 23 23
27 22 14 28 29 30 24 46 31 26
28 26 21 35 36 31 31 47 32 31
34 29 28 38 37 35 32 48 24 32
35 32 29 42 41 37 36 49 41 36
37 38 34 47 43 39 38 52 46 41
39 41 37 49 45 50 45 53 48 42
46 43 42 50 46 51 48 57 54 44
48 44 48 51 50 58 62 59 58 45
50 47 54 52 53 61 64 63 61 46
53 48 57 56 60 63 70 64 62 50
60 50 62 62 66 65 71 67 68 61
61 52 63 64 67 67 72 70 71 71
62 61 68 65 69 68 73 75 75 73
64 66 72 72 72 73 75 76 77 76
Size of Group
R81 R82 R83 R84 R85 R86 R87 R88 R89 R90
9 9 2 4 3 1 1 6 2 1
13 15 4 6 6 8 3 8 12 2
16 16 14 7 7 13 7 16 14 5
33 23 17 13 25 24 14 23 15 10
40 27 30 14 27 34 16 26 39 22
41 28 35 19 30 35 17 29 48 24
42 29 41 25 32 39 19 32 52 25
44 33 42 31 36 45 20 35 56 27
45 42 49 34 45 47 26 42 57 31
46 43 52 39 48 50 28 43 60 35
54 48 53 40 50 52 30 48 62 40
59 50 58 44 54 61 40 50 64 46
64 56 63 59 58 64 53 55 66 51
71 57 67 62 63 65 54 59 67 53
72 60 69 70 64 66 60 63 68 54
73 61 71 73 66 67 66 64 70 73
74 66 74 74 68 68 77 70 75 81
75 67 76 77 72 75 80 74 76 83
76 69 77 83 83 80 83 81 88 85
79 71 80 84 85 83 86 86 89 90

Appendix 1
Size of Group
R91 R92 R93 R94 R95 R96 R97 R98 R99 R100
7 8 1 1 7 2 3 8 3 7
15 15 7 8 9 4 10 11 11 21
20 17 9 9 12 6 26 17 20 22
29 21 19 10 14 9 33 23 29 25
36 25 20 14 18 16 37 25 30 27
38 26 21 15 24 18 39 31 34 35
42 31 22 20 32 21 40 36 38 40
43 34 30 23 35 28 41 38 42 49
49 40 31 36 46 32 48 42 50 50
50 41 32 46 49 47 50 48 51 52
54 49 35 48 54 50 51 54 53 56
58 54 36 57 60 51 53 58 54 57
59 61 40 58 63 63 61 63 60 75
67 69 46 60 64 64 62 70 62 79
73 71 51 61 69 70 64 71 63 81
80 78 62 72 78 73 65 78 68 86
84 79 72 79 87 78 68 89 79 87
85 81 73 81 88 80 70 91 92 92
87 83 76 87 92 89 82 92 94 93
91 84 89 91 95 92 97 93 99 94
© UNESCO 77
Appendix 2
Sample design tables (for roh values of 0.1 to 0.9)
95% confidence limits for means/percentages

Cluster size
±0.05s/±2.5% ±0.1s/±5.0% ±0.15s/±7.5% ±0.2s/±10.0%
b a n a n a n a n
roh = 0.1
1 (SRS) 1600 1600 400 400 178 178 100 100
2 880 1760 220 440 98 196 55 110
5 448 2240 112 560 50 250 28 140
10 304 3040 76 760 34 340 19 190
15 256 3840 64 960 29 435 16 240
20 232 4640 58 1160 26 520 15 300
30 208 6240 52 1560 24 720 13 390
40 196 7840 49 1960 22 880 13 520
50 189 9450 48 2400 21 1050 12 600
roh = 0.2
1 (SRS) 1600 1600 400 400 178 178 100 100
2 960 1920 240 480 107 214 60 120
5 576 2880 144 720 65 325 36 180
10 448 4480 112 1120 50 500 28 280
15 406 6090 102 1530 46 690 26 390
20 384 7680 96 1920 43 860 24 480
30 363 10890 91 2730 41 1230 23 690
40 352 14080 88 3520 40 1600 22 880
50 346 17300 87 4350 39 1950 22 1100
roh = 0.3
1 (SRS) 1600 1600 400 400 178 178 100 100
2 1040 2080 260 520 116 232 65 130
5 704 3520 176 880 79 395 44 220
10 592 5920 148 1480 66 660 37 370
15 555 8325 139 2085 62 930 35 252
20 536 10720 134 2680 60 1200 34 680
30 518 15540 130 3900 58 1740 33 990
40 508 20320 127 5080 57 2280 32 1280
50 503 25150 126 6300 56 2800 32 1600

Cluster size
±0.05s/±2.5% ±0.1s/±5.0% ±0.15s/±7.5% ±0.2s/±10.0%
b a n a n a n a n
roh = 0.4
1 (SRS) 1600 1600 400 400 178 178 100 100
2 1120 2240 280 560 125 250 70 140
5 832 4160 208 1040 93 465 52 260
10 736 7360 184 1840 82 820 46 460
15 704 10560 176 2640 79 1185 44 660
20 688 13760 172 3440 77 1540 43 860
30 672 20160 168 5040 75 2250 42 1260
40 664 26560 166 6640 74 2960 42 1680
50 660 33000 165 8250 74 3700 42 2100
roh = 0.5
1 (SRS) 1600 1600 400 400 178 178 100 100
2 1200 2400 300 600 134 268 75 150
5 960 4800 240 1200 107 535 60 300
10 880 8800 220 2200 98 980 55 550
15 854 12810 214 3210 95 1425 54 810
20 840 16800 210 4200 94 1880 53 1060
30 827 24810 207 6210 92 2760 52 1560
40 820 32800 205 8200 92 3680 52 2080
50 816 40800 204 10200 91 4550 51 2550
roh = 0.6
1 (SRS) 1600 1600 400 400 178 178 100 100
2 1280 2560 320 640 143 286 80 160
5 1088 5440 272 1360 122 610 68 340
10 1024 10240 256 2560 114 1140 64 640
15 1003 15045 251 3765 112 1680 63 945
20 992 19840 248 4960 111 2220 62 1240
30 982 29460 246 7380 110 3300 62 1860
40 976 39040 244 9760 109 4360 61 2440
50 973 48650 244 12200 109 5450 61 3050
© UNESCO 79

Cluster size
±0.05s/±2.5% ±0.1s/±5.0% ±0.15s/±7.5% ±0.2s/±10.0%
b a n a n a n a n
roh = 0.7
1 (SRS) 1600 1600 400 400 178 178 100 100
2 1360 2720 340 680 152 304 85 170
5 1216 6080 304 1520 136 680 76 380
10 1168 11680 292 2920 130 1300 73 730
15 1152 17280 288 4320 129 1935 72 1080
20 1144 22880 286 5720 128 2560 72 1440
30 1136 34080 284 8520 127 3810 71 2130
40 1132 45280 283 11320 126 5040 71 2840
50 1130 56500 283 14150 126 6300 71 3550
roh = 0.8
1 (SRS) 1600 1600 400 400 178 178 100 100
2 1440 2880 360 720 161 322 90 180
5 1344 6720 336 1680 150 750 84 420
10 1312 13120 328 3280 146 1460 82 820
15 1302 19530 326 4890 145 2175 82 1230
20 1296 25920 324 6480 145 2900 81 1620
30 1291 38730 323 9690 144 4320 81 2430
40 1288 51520 322 12880 144 5760 81 3240
50 1287 64350 322 16100 144 7200 81 4050
roh = 0.9
1 (SRS) 1600 1600 400 400 178 178 100 100
2 1520 3040 380 760 170 340 95 190
5 1472 7360 368 1840 164 820 92 460
10 1456 14560 364 3640 162 1620 91 910
15 1451 21765 363 5445 162 2430 91 1365
20 1448 28960 362 7240 162 3240 91 1820
30 1446 43380 362 10860 161 4830 91 2730
40 1444 57760 361 14440 161 6440 91 3640
50 1444 72200 361 18050 161 8050 91 4550

Appendix 3
Estimation of the coeff icient

of intraclass correlation
The coefficient of intraclass correlation (roh) was developed earlier this

century in connection with studies carried out to measure ‘fraternal
resemblance’ – such as in the calculation of the correlation between the
heights of brothers. To establish this correlation there was generally
no reason for ordering the pairs of measurements obtained from any
two brothers. Therefore the approach used was to calculate a product-
moment correlation coefficient from a symmetrical table of measures
consisting of two interchanged entries for each pair of brothers. In the
days before computers, these calculations became extremely arduous
when large numbers of brothers from large families were being studied.
Some computationally simpler formulae for calculating estimates of this
coefficient were eventually developed by several statisticians (Haggard,
1958). These formulae were based either on analysis of variance
methods, or on the calculation of the variance of elements and the
variance of the means of groups, or clusters, of elements.
In the field of educational survey research, the clusters of elements

which define primary sampling units often refer to students that are
grouped into schools. The value of roh then provides a measure of the
tendency for students to be more similar within schools (on some given
characteristic such as achievement test scores) than would be the case
if students had been allocated at random among schools. The estimate
of roh is prepared by using schools as ‘experimental treatments’ in
calculating the between-clusters mean square (BCMS) and the within-
clusters mean square (WCMS). The term b refers to the number of
students per school.
© UNESCO 81
��
��
��
If the numerator and denominator on the right hand side of the above
expression are both divided by the value of WCMS, the estimated roh
can be expressed in terms of the value of the F statistic and the value
of b.
� ��
��
� � ��
The application of the following alternative formula which is based

upon variance estimates for elements (students) and cluster means
(school means) has been discussed by Kish (1965: 176). The term sc2
is the variance of cluster means, and s2 is the variance of element
values.
� ��
��
��
Note that, in situations where the number of elements per cluster varies,
the value of b is sometimes replaced by the value of the average cluster
size.

Quality (SACMEQ).


and innovation”.
4
Module
Richard M. Wolf
Judging educational research

based on experiments and
surveys



Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66

Module 4 Judging educational research based on experiments and surveys
Content
1. Introduction and purpose 1
2. Research as a way of knowing 4
3. Types of educational research 8
4. Association and causation 10
5. The main characteristics of experimental

and survey studies 12
6. Qualitative and quantitative studies 16

The basic structure of experimental studies 18
7. Experimental studies and some factors

that often threaten their validity 18
Validity of experimental studies 20
1. History 21
2. Maturation 21
3. Testing 22
4. Instrumentation 22
5. Statistical regression 23
1
6. Selection 24
7. Drop-out 24
8. Interaction 25
9. Diffusion of experimental treatment 26
10. Resentful demoralization of students receiving less desirable
treatments 26
11. Generalizability 27
8. Survey studies and some factors

that often threaten their validity 28
The basic structure of survey studies 28
The validity of survey studies 29
1. The scope of the data collection 30
2. The sample design 31
3. Instrumentation 35
9. Other issues that should be considered

when evaluating the quality of educational
research 39
10. A checklist for evaluating the quality

of educational research 42
1. Problem 44
2. Literature review 44
3. Hypotheses and/or questions 45
4. Design and method 45
5. Sampling 46
6. Measures 47
7. Statistics 47
8. Results 48
9. Discussion 48
10. Write-up 49
II
Content
11. Summary and conclusions 51
Appendix
Adult Education Project: Thailand 53
Objectives of the evaluation project 54
Design 55
The construction of measures 62
Data collection 70
The participants’ achievement 74
Regression analyses 80
Conclusions and recommendations 89
References 93
Additional readings 94
© UNESCO III
Introduction and purpose 1
Educational planners are continually being asked to participate in,

and to provide information that can be used to guide administrative
decisions. These decisions may range from developing a set of
detailed procedures for some aspect of an educational enterprise, to
a major reorganization of an entire education system. The quality of
such decisions is the major determinant of successful administrative
practice and, eventually, these decisions define the long-term nature
of educational organizations.
The quality of educational planning decisions depends, in turn,

on the quality of the information upon which they are based.
This information provides the best possible guidance for decision
making when it is based on sound educational research combined
with expertise derived from a comprehensive knowledge of the
innermost ‘workings’ of the education system. The purpose of this
module is to furnish educational planners with information on how
to read and judge research reports in education so that they can use
the information contained in them wisely.
While educational research reports provide an essential source

of information for making decisions, there are other sources of
information that educational planners need to consider. These are:
costs, local customs and tradition, the views of various individuals
who have a stake in an educational enterprise, governmental
policies, laws, and the like. In making almost any decision, an
educational administrator will need to consider research results
alongside issues associated with some or all of these other sources
of information.
© UNESCO 1
Using the results of research in decision making is not an easy task.

First, the educational planner needs to be able to distinguish good
educational research from bad educational research. Currently,
a great deal of research is being carried out all over the world.
Some of it is of extremely high quality while some, unfortunately,
is unquestionably poor. The educational planner needs, first of
all, to be familiar with the key characteristics of research design
and execution that will permit valid judgements to be made
about research quality. Much of this module is directed towards
developing an understanding of these characteristics. Second,
just as most people have increasingly recognized the complexity
of education, so have educational researchers. Accordingly,
educational research has become more and more complex. This can
easily be seen when one compares current research reports with
those that were produced thirty or forty years ago. Contemporary
educational research studies generally consider far more variables
in a single study and employ more complex analytic procedures
than their counterparts of a generation or two ago. This makes
the task of reading and extracting information from research
reports much more difficult for the reader. While one welcomes
the more extensive understanding that has arisen from the
increasing sophistication of much modern educational research, the
problems that this sophistication creates for the reader must also be
acknowledged. This module aims to discuss a number of issues in
this area in an attempt to ameliorate such difficulties for the readers
of research reports.
Having described the major focus of this module, it seems equally

important to state what it is not intended to provide. It does not
purport to be a substitute for specific training in the planning and
conduct of research. Neither is it in any way a substitute for courses
in statistics. That kind of training is best acquired in university
courses and through gaining applied experience in those areas.
This module is not intended to be technical in nature. Accordingly,
discursive language has been used and specialized terminology
2
Introduction and purpose
has been avoided wherever possible. When it has been necessary

to introduce technical concepts, these have been explained as fully
as possible and in as non-technical a way as ordinary language
will permit. No special training should therefore be needed to read
and understand the material presented throughout. A short list of
references and additional readings is presented at the end of this
module in association with references linked to research studies
that have been cited in the text.
© UNESCO 3
2 Research as a way of
knowing
Research is a way of knowing the world and what happens in it. Its
application is governed by various principles, canons, and rules. It
can be contrasted with several other ways of knowing: authority,
tradition, and personal experience. Each of these has its limitations
and advantages.
Authority is a primary way of knowing used by many people. When

an individual uses a reference work to obtain information, he or
she is drawing on authority. Much of the education of the young is
directed at teaching them how to make wise use of authority. The
reason for this is very simple. It is too costly and inefficient for an
individual to go out and obtain all the information that one needs to
know through direct experience. It is also impossible. How is one to
know anything about history except through reliance on authority
in the form of published (written or oral) histories? Educational
planners also routinely make use of authority when they draw on
information in published works.
Tradition is a way of knowing in both developed and emerging

societies. All societies are guided by accumulated knowledge
of ‘what is’ and ‘how things are to be done’. Such knowledge is
often established through a process of trial and error. Methods of
agriculture and manufacturing are but two examples of knowledge
that is acquired over a long period of time and that serves as a
guide to human endeavours. Sometimes tradition provides a useful
guide to the conduct of human affairs and sometimes it does not.
For example, methods of crop cultivation may, in fact, have adverse
4
long-term consequences such as soil depletion despite apparent
short-term benefits. That is, the existence of traditional knowledge
does not necessarily insure that it is also useful knowledge.
However, traditions can often be strong and educational planners
and policy-makers need to take them into account.
Personal experience has served as a guide to conduct throughout

human history. Virtually everyone relies on his or her own
experiences to make decisions about their own actions. While
personal experience can be a useful guide to behaviour, this is not
necessarily always the case. If one encounters truly novel situations,
there may not have been any prior experiences that are available
for guidance. Furthermore, personal experiences may be based on
such limiting conditions that they become more of an impediment
than an aid to conduct in new situations. Educational planners
should therefore try to use care and caution when using personal
experience to guide their planning decisions.
Research is a way of knowing based on systematic and reproducible

procedures that aim to provide knowledge that people can depend
on. It is, however, a somewhat expensive way of knowing since
it demands that people who engage in research follow particular
canons that usually require the use of special procedures,
instruments, and methods of analysis. A major advantage of
research as a way of knowing is that it is both deductive and
inductive. It is self-correcting in that knowledge produced through
research is public and subject to verification by others. Of the
various ways of knowing, it probably produces the most dependable
knowledge. This has been its major appeal in modern society and
is probably most responsible for the high status that it is accorded
– however it is not without problems in terms of application
and interpretation. Throughout this module the reader will be
continuously alerted to some of the problems that occur frequently
with research.
© UNESCO 5
In broad terms, research is generally concerned with the study

of relationships among variables. A variable is a characteristic
that can take on a number of values. Height, for example, is a
variable that can take on a number of values, depending on the
stature of the individual being measured. Achievement, attitudes,
interests, and aspects of personality are all variables because they
can take on a number of values, depending on the individual
being measured. Variables do not refer only to characteristics of
individuals. Variables can also refer to ‘treatments’ that might be
applied to a group of individuals. For example, school subjects
such as mathematics can be taught in very different ways to classes
of students in schools. Thus, ‘method of teaching mathematics’
is a variable and each different way of teaching mathematics is a
different value of this variable. ‘Type of school organization’ is also
a variable since students can be grouped in many different ways for
learning. Each way of organizing students would then represent a
different value of ‘type of school organization’. Many educational
research studies are concerned with studying the relationship
between a variable that describes a particular instructional
intervention or method of organization and a student outcome
variable, such as achievement, attitudes and behaviours developed,
length of job search, etc.
At this stage, it would also seem important to indicate what

research is not. Research does not provide fixed and immutable
knowledge. That is, knowledge gained through research is relative
and probabilistic in nature, not absolute and certain. For example,
a researcher may have conducted a study and found a particular
relationship between two quantities, say pressure and volume
of certain gases – but this relationship is not necessarily static.
It probably varies with temperature and, even with a known
temperature, the relationship may not be exact because there is
invariably some error in any research finding. The key message here
is that knowledge produced through research can vary depending
on the conditions under which it was obtained. Researchers,
6
Research as a way of knowing
accordingly, often attach probabilities to their findings. This point

will be emphasized throughout this module. Suffice it to say for
now that research knowledge is not absolute, but relative and with a
particular likelihood (or probability) of occurrence.
© UNESCO 7
3 Types of educational research

There are many types of educational research studies and there are
also a number of ways in which they may be classified. Studies may
be classified according to ‘topic’ whereby the particular phenomena
being investigated are used to group the studies. Some examples
of topics are: teaching methods, school administration, classroom
environment, school finance, etc. Studies may also be classified
according to whether they are ‘exploratory’ or ‘confirmatory’.
An exploratory study is undertaken in situations where there
is a lack of theoretical understanding about the phenomena
being investigated so that the main variables of interest, their
relationships, and their (potential) causal linkages are the subject
of conjecture. In contrast, a confirmatory study is employed when
the researcher has generated a theoretical model (based on theory,
previous research findings, or detailed observation) that needs to be
‘tested’ through the gathering and analysis of field data.
A more widely applied way of classifying educational research

studies is to define the various types of research according to the
‘kinds of information’ that they provide. Accordingly, studies
may be classified as: (1) historical, (2) case, (3) longitudinal, (4)
survey, and (5) experimental. Within each major type of study
there are other types of studies. For example, case studies are
often ‘ethnographic’ studies that focus on detailed investigations
of an individual or group’s socio-cultural activities and patterns.
Historical studies deal with past events and depend heavily
on the use of source documents. Case studies seek to study an
individual or particular group of individuals and are therefore
not always intended to lead to inferences that are generalizable to
wider populations. Longitudinal studies are concerned with the
8
study of individuals over time in order to describe the process of
development and the stability and change in various characteristics.
Survey studies furnish a picture of a group of individuals or an
organization at a particular point in time. They often contain a
number of items of information for describing the individuals
or organization. Finally, experimental studies assess the effects
of particular kinds of interventions on various outcomes of
individuals.
Each type of research study has its own particular canons,

procedures, techniques, and methods of analysis. This module
has been restricted to judging research reports that are produced
from survey studies and experiments. These comprise the major
portion of educational research and are of the greatest relevance
to educational planners. This focus is not intended to demean
historical, case, and longitudinal studies. These three types of studies
are very important and may well enter into the decision-making
process.
© UNESCO 9
4 Association and causation

One of the most important distinctions that readers of research
reports must be aware of in order to judge them properly is the
distinction between association and causation. As described
above, research involves the study of the relationships among
variables. The relationship between two variables may be one of
association, or of causation, or of both of these. The difference
between the concepts of association and causation is critical to the
understanding of research. An association between two variables
states that there is a relationship. It does not necessarily mean that
one variable causes another (or vice versa). Causation, on the other
hand, means that one variable is the cause of another.
In many studies, investigators are simply able to establish an

association between two variables, say, method of instruction and
student achievement – which does not necessarily mean that one
variable (method of instruction) has caused the other (student
achievement). On the other hand, a study that establishes a causal
relationship between two variables is stating that one variable is
responsible for changes in another variable. Properly conducted
experimental studies come closest to establishing the existence of
causal relationships. Survey studies can only establish associations
and ‘suggest’ causal linkages, but they cannot establish causal
relationships.
Sometimes investigators who conduct survey studies attempt to

claim that they have established causal relationships. They have
not. No matter how elegant the methods of analysis that are used,
survey or descriptive studies can not establish causal relationships.
The most that can be claimed for survey or descriptive studies, no
10
matter how carefully planned and carried out, is a presumption of
causation. The presumption may be quite strong, however, if there
is considerable previous evidence and a strong theory favouring a
certain conclusion. For example, the evidence on the relationship
between cigarette smoking and lung cancer in human beings is
based on associational studies (it would be highly unethical to
conduct experimental studies in this area with human beings).
However, the weight of evidence from many studies and a strong
physiological theory makes the conclusions from such studies
strongly presumptive of a causal relationship even though these
studies do not definitively establish a causal relationship.
© UNESCO 11
5 The main characteristics of

experimental and sur vey
studies
The main characteristics of experimental studies are: (1) active
manipulation of treatment variables by the researcher, and (2)
the use of random assignment of units (usually students) to each
type of treatment. These characteristics constitute the essential
controls exercised by a researcher to establish a causal relationship.
For example, consider a situation where a researcher is interested
in studying the effect of two methods of teaching multiplication
of decimals on student achievement as measured by a test of
multiplication of decimals. In a true experiment, the researcher
selects the method of teaching to be studied, instructs two groups
of teachers, (each in one of the selected methods), assigns students
in a random fashion to one of the two types of classes, follows each
class to see that it is following the prescribed method of instruction,
and tests each student at the end of the period of instruction on
a common test of multiplication of decimals. The resulting data
are then analyzed and if the difference in the average level of
performance between students in the two methods of instruction
differs sufficiently, one comes closer to obtaining a causal
relationship than in a situation where pre-existing conditions are
merely compared. Such a study is experimental in nature because
the researcher was able to exercise full control over the selection
of methods to be studied, the random assignment of teachers to
each method of instruction, and, finally, the random assignment
of students to each method of instruction. Any study that does not
12
exercise this level of control cannot be termed an experimental
study and any causal conclusions from it must be regarded as
presumptive.
Survey studies also typically report relations among variables.

These relationships are associational and not causal. For example,
Coleman, Hoffer and Kilgore (1987) report the results of a large-
scale study comparing the academic performance of students
in public (government) and private (non-government) schools.
Coleman, Hoffer and Kilgore found that students in private schools
outperformed students in public schools in various tests of school
achievement. Clearly, there is a relationship between type of school
(public versus private) and school achievement. Is the relationship
a causal one? The answer to the causal question is not known since
Coleman, Hoffer and Kilgore were not able to assign students at
random to the two types of schools. In fact, an examination of the
backgrounds of the students shows that they were quite different
at their time of entrance into their type of school. Students who
attended private schools came from more affluent families, had
higher levels of material resources in the home, were higher in
achievement when they entered their private school, and held
higher levels of expectation about what they would achieve in
secondary school than students who attended public schools. The
fact that the two groups of students differed so greatly at the start
of the study and that it was not possible to equalize the groups
attending the two types of schools through randomization makes it
impossible to establish a causal relationship between type of school
attended and academic performance. Causal inferences in this case
would be, at best, presumptive.
Survey studies are characterized by the study of relationships

among variables in already existing units. No attempt is made to
randomly assign individuals to groups and groups to treatments.
This restricts such investigations to studies of associations among
© UNESCO 13
variables. However, this does not mean that the possibility of causal
relationships cannot be explored. They often are. In recent years,
a number of highly technical statistical procedures have been
developed that are used to explore possible causal relationships
among variables. The general term for such procedures is causal
modelling. Briefly, such procedures allow one to set up a network of
hypothesized causal relationships among variables and to test the
tenability of these relationships.
These models and the associated statistical procedures that are used
are often quite complex. They are based on a rather simple notion,
however. Whereas the existence of an association among two
variables does not mean that the two variables are causally related,
the researcher can examine the information collected in a study to
see if an association exists between the two variables in question. If
there is no association, then the researcher’s theory is disconfirmed.
However, if the two variables are associated, then the possible
causal relationship between the variables remains tenable. This is
not to say that a causal relationship has been established. It is only
that the existence of a causal relationship remains a possibility.
A great deal of rather complex analytic work goes on in the area
of causal modelling and readers of research reports often have
difficulty in following it. The basic goal of causal modelling should
be clearly kept in mind though. Causal modelling, at best, tests the
tenability of causal relationships; it does not establish them.
There is a third class of studies that lie somewhere between

experimental and survey studies. These studies are called quasi-
experimental. These are studies in which the researcher does not
exercise full control over the selection and scheduling of treatments
and the assignment (random) of students to groups for purposes of
study. The researcher exercises only partial control over the study.
For example, the researcher may not have the power to assign
students to groups, but may be able to schedule which groups
receive particular treatments. Strictly speaking, such studies are
14
The main characteristics of experimental and survey studies
not experimental studies because of the lack of full control over

the research situation. However, it is clearly a better state of affairs
than exists in a survey study. Causal conclusions drawn from the
results of such studies are still presumptive, but the presumptions
are often fairly strong. It is up to the reader of the reports of quasi-
experimental studies to decide whether the causal speculations
of the researcher are warranted. Unfortunately, there is no simple
formula for judging whether they are causal in nature.
© UNESCO 15
6 Qualitative and quantitative

studies
Experimental, quasi-experimental and survey studies are regarded
as quantitative studies because of the collection of information that
is quantifiable and subjected to statistical analysis. Many research
studies are quantitative in nature. They are designed to expose
relationships among variables. In contrast, there are studies that are
basically qualitative in nature. The information collected in such
studies is usually not quantified and not subjected to statistical
analysis. Usually, case studies and historical studies are qualitative
in nature although, on certain occasions, they may employ
quantitative procedures. Some years ago, for example, a quantitative
study was undertaken to resolve the question of authorship of some
important historical papers in the USA. The researchers in that
study had samples of the writings of two individuals who were quite
prominent in their time and, based on the characteristics of the
writings of each author, the researchers were able to use statistical
procedures to resolve the issue of authorship. Such examples of the
use of quantitative methods in historical studies are rare.
It is often difficult to distinguish between qualitative and

quantitative studies at the level of research technique or data
collection procedures. Both qualitative and quantitative studies
may use the same techniques or procedures. Thus, for example,
interviews and direct observation can and are used in both kinds
of studies, often with excellent results. The information that is
obtained, however, is treated quite differently in the two kinds of
studies. In quantitative studies, information obtained through the
use of interviews or direct observation is typically subjected to
16
statistical analysis while in qualitative studies, such information is
not subjected to these procedures.
Currently, there is considerable debate regarding quantitative and

qualitative studies. The issue is being fought out at a philosophical
as well as a methodological level. The details of that debate are not
of concern here. What is important is that the reader recognize
the distinction between the two types of studies. In addition, it is
important to have a sense of what each type of study can contribute
to education.
Quantitative studies, when properly conducted, can establish

relationships among variables. However, they often tell us little
about how causal relationships work. They may tell us, for example,
that the use of a certain procedure, say peer tutoring, leads to
higher levels of achievement among both tutors and tutees.
However, the precise mechanism by which peer tutoring results
in higher achievement is not ascertained in such studies. It often
requires finely detailed qualitative studies such as ethnographic
studies of individuals to determine the way in which peer tutoring
leads to higher achievement. While some researchers and even
philosophers of science see quantitative and qualitative studies
as being in opposition to one another, their functions can be
complementary as long as one does not expect some kind of
philosophical purity in research.
This module deals exclusively with quantitative studies that are

used to establish relationships among variables. Educational
planners are usually required to deal with quantitative studies since
they provide most of the information on which planning decisions
are based. Qualitative studies do play a useful role in attempting to
understand the mechanisms by which relations among variables are
established. They are often undertaken after quantitative studies
have established the existence of an important relationship, or they
may be undertaken for purely exploratory purposes in order to gain
an understanding of a particular process.
© UNESCO 17
7 Experimental studies and

some factors that often
threaten their validity
The basic structure of experimental

studies
As noted in the previous discussion, properly designed and
conducted experimental studies provide a powerful means of
establishing causal relationships among variables of interest. The
reason why this is so are inherent in the basic structure of an
experimental study. To illustrate, we may represent the basic design
of a classic experimental study as follows:
Subjects
O1T1O3
O2T2O4
18
In the diagram, ‘R’ denotes that a random assignment procedure
has been used to assign students to groups and groups to treatment
conditions. ‘O’ denotes an observation of performance. Such
observation may consist of tests that are administered to students.
Note that observations ‘O’ are obtained both before the introduction
of the treatment conditions and after completion of the treatment
period. While it is considered desirable to have both before (pre-
test) and after (post-test) observations, the former are not considered
crucial in an experimental study and are sometimes omitted
without jeopardizing the integrity of the design. The symbol ‘T1’
denotes the major treatment condition that is being studied. It may
be a particular instructional treatment or method of teaching, a
type of organizational structure, or some other intervention that
is being studied. ‘T2’ denotes the alternative treatment to which it
is being compared. In some studies, T2 is often a condition of no
treatment at all. In this way, researchers can assess the effect of a
particular intervention in relation to no intervention at all. The ‘no
intervention’ alternative is only occasionally studied in the field of
education since there are usually legal regulations that prevent very
uneven treatment of students – unless, of course, all treatments are
considered equally beneficial.
To assess the effect of ‘T1’, a comparison is usually made between

the average levels of the post-tests for the two groups (‘O3’ and
‘O4’ in the above diagram). If ‘O1’ and ‘O2’ are present, a somewhat
more complex analytic procedure, analysis of covariance, is usually
employed to adjust for any existing random or chance differences
between groups before the treatments are introduced.
As mentioned above, experimental studies can be used to establish

causal relationships as they employ random assignment to ensure
comparability of the groups being studied. The random assignment
of students to groups and the random assignment of groups to
treatments serves as a way of equalizing groups, except for minor
random or chance differences, before the initiation of treatment.
© UNESCO 19
If there are differences in performance between the groups at the

completion of treatment, the most likely (and possibly the only)
reason for the difference is the differential effectiveness of the
treatments. The use of a group other than the one receiving the
treatment of interest, technically called a ‘control group’, provides
the necessary comparability to estimate the effectiveness of the
major treatment.
In theory, experimental studies are the preferred way of estimating

the effectiveness of educational treatments and organizational
structures. In practice, experimental studies are subject to
various limitations. These limitations stem from the way in which
experimental studies are actually conducted in various settings.
The limitations of experimental studies fall under the two general
headings of internal invalidity and external invalidity. Internal
invalidity refers to the influence of extraneous factors that can mar
a well designed study. External invalidity refers to the lack of ability
to generalize the findings of a particular study to other settings.
Each is important and will be considered in turn.
Validity of experimental studies

The most difficult task in conducting an experimental research
study in the field of education is to hold all variables in the
educational situation constant except for the treatment variable.
The degree to which these ‘extraneous variables’ may be controlled
by the researcher is often referred to as the ‘internal validity’ of the
experiment.
Campbell and Stanley (1963) wrote a classic paper that provided

a comprehensive list of the factors that threaten the validity of
experiments. These factors, and some examples of how they might
influence and/or distort research findings have been listed below.
20
Experimental studies and some factors that often threaten their validity
1. History
In educational research experiments, events other than the
experimental treatment can occur during the time between the
pre-test and the post-test. For example, an in-service programme
to improve the knowledge and proficiency of teachers of reading
may be undertaken by a particular school. At the same time, some
of the teachers may be enrolled in university courses leading to
an advanced degree. As part of the programmes, these teachers
may be taking a course in the teaching of reading. It is certainly
possible to assess the teachers’ knowledge and proficiency in the
teaching of reading at the conclusion of the in-service programme,
but it would be virtually impossible to determine how much of their
performance is due to the in-service programme and how much
to their graduate course. This inability to determine the source of
an effect, namely, the enhanced knowledge and proficiency in the
teaching of reading renders the results of the study uninterpretable.
History, as a source of internal invalidity, opens the results of a
study to alternative interpretations. Readers of research reports
should routinely ask themselves whether a demonstrated effect is
due to the intervention under study or to something else.
2. Maturation
While an experiment is being undertaken, normal biological
psychological growth and development processes are almost certain
to continue to occur. These processes may produce changes in the
experimental subjects that are mistakenly attributed to differences
in treatment. Maturation effects are often noticed in long-term
experiments in which students learn a great deal through the
natural processes of exposure to stimuli that are a normal part
of their socio-cultural environment. For example, students at a
particular grade level in primary school who have mastered some of
the rudiments of reading will undoubtedly improve in their reading
© UNESCO 21
ability simply as a result of being confronted with printed material

in a variety of situations – magazines, newspapers, and the like. The
problem for the researcher is to determine to what extent reading
improvement for such students is due to the effects of instruction
and to what extent it is due to growing up in a culture where one is
constantly exposed to reading material.
3. Testing
In most educational experiments a pre-test is administered before
the experimental treatment which is then followed by a post-test.
The very administration of the pre-test can improve performance
on the post-test in a manner that is independent of any treatment
effect. This occurs when pre-testing enhances later performance
through providing practice in the skills required for the post-test, by
improving the ‘test-wiseness’ (or test-taking skills) of the students,
or by sensitizing the students to the purposes of the experiment.
The effect of retesting can sometimes be reduced by making sure
that students are given a different set of questions to answer when
a test is readministered. There are two ways to eliminate a testing
effect. The first would be to test only once, at the completion of
the treatment period. This can be troublesome since it would
deprive the researcher of information about the proficiency of
students at the beginning of the programme (‘O1’ and ‘O2’ in the
above diagram). The second way to eliminate a testing effect is to
randomly divide each group of students in half and administer the
test to one half of the group before the period of treatment and to
the other half of the group after instruction.
4. Instrumentation
A difference in the pre-test and post-test scores for an experiment
may sometimes occur because of a change in the nature or
quality of the measurement instrument during the course of the
22
experiment. For example, the scores of essay tests may change from
pre-test to post-test because different standards are used by two
sets of scorers on different occasions. If the scorers of the pre-test
essays are particularly stringent in their grading while the scorers of
the post-tests are lenient, then gains in the essay scores may all be
due to the differences in standards used by the scorers rather than
the exposure of students to effective teaching. The same situation
may hold for more objective measures of student performance. For
example, the researcher might simply ask easier questions on a post-
test than on a pre-test. Instrumentation problems also often arise
when the amount of proficiency required to go from, say, a score
of six to twelve is different from the amount required to go from
a score of twelve to eighteen. Test scores are typically treated as if
the difference between score points is uniform throughout the test,
and therefore the research worker must be sensitive to the nature
of the instruments that are used and the units of measurement that
express performance.
5. Statistical regression
When students are selected for a treatment on the basis of extreme
scores, later testing invariably shows that these students, on
average, perform somewhat closer to the average for all students.
This phenomenon was identified by the psychologist Lewis
Terman in his studies of gifted children over half a century ago.
Terman sought to identify a group of gifted children. His major
criterion for classifying them as gifted was the score obtained on
the Stanford-Binet Intelligence Test. As part of his initial follow-
up of these children, Terman had them retested and found, to his
surprise, that the average intelligence test score had ‘regressed’
rather dramatically (eight points) toward the average. More recently,
remedial educational programmes have been developed in a
number of countries to help disadvantaged students. A common
practice in such programmes is to select individuals who score
© UNESCO 23
extremely low on some test. On later testing, these students show

a considerably higher average level of performance. While some
increment in performance may have occurred, much of the apparent
improvement is simply due to statistical regression: in this case the
direction is upward instead of downward, as in the case of Terman’s
studies of gifted children. In both cases the phenomenon is the
same: individuals initially selected on the basis of extreme scores
will, on retesting, show less extreme scores.
6. Selection
In a study that seeks to compare the effects of treatments on
different groups of students, the group receiving one treatment
might be more able, older, more receptive, etc. than a group
receiving another, or no treatment. In this case, a difference
between groups on post-test scores may be due to prior differences
between the groups and not necessarily the differences between
treatments. For example, if students volunteer to participate in an
experimental learning programme, they can differ considerably
from students who decide to continue in a conventional programme.
In this case, the act of volunteering may indicate that the volunteer
students are looking for new challenges and may approach their
lessons with greater zeal. If differences favouring the experimental
programme are found, one faces the task of trying to decide how
much such results reflect the effects of the programme and how
much the special characteristics of the volunteer students.
7. Drop-out
In experiments that run for a long period of time there may be
differential drop-out rates between the groups of students receiving
the experimental treatments. For example, random allocation of
students to several educational programmes may have ensured
24
comparable groups for the pre-test – but if one group incurs a loss
of low-performing students during the course of the experiment,
that group’s average performance level will increase, regardless of
the effectiveness of the programme to which it was exposed.
8. Interaction
It is possible for some of the above factors to occur in combination.
For example, a source of invalidity could be selection- maturation
interaction whereby, due to a lack of effective randomization, major
age differences occur between the treatment groups – which, in
turn, permits the possibility of differential rates of maturity or
development between the groups. These latter differences may
result in differences in post-test scores independently of treatment
effects. Another example illustrates how the joint operation
of several factors can lead one erroneously to conclude that a
programme has been effective. A study was conducted on the
effects of the ‘Sesame Street’ educational television programme
in the USA. In the first year of the study of that programme’s
effectiveness, four different groups of children were examined
to judge the instructional effects of the programme. The groups
were established on the basis of the amount of time spent viewing
‘Sesame Street’. This ranged from rarely or never watching the
programme to viewing it more than five times a week. Scores on
the ‘Sesame Street’ pre-tests were found to be highly related to the
amount of time spent viewing the programme. That is, the higher
the initial score, the more time was spent watching the programme.
Scores on the post-test, and hence the gains, were in the same
way highly related to the time spent viewing the programme. The
combination of pre-test performance and self-selection into viewing
category made it impossible to assess the effectiveness of the first
year of ‘Sesame Street’ from the data collected.
© UNESCO 25
9. Diffusion of experimental treatment

In some experiments ‘the treatment’ is perceived as highly desirable
by members of the control group. This may lead the control subjects
to seek access to the treatment – either by communicating with
the treatment subjects or by some means that were not anticipated
during the design of the experiment. This problem is a source of
major concern in the evaluation of new curriculum programmes
where the teachers and students in the treatment group are given
access to an attractive curriculum based on exciting and innovative
teaching materials. During the course of the experiment, curiosity
might lead many of the teachers of the control group to discuss
the new programme with the treatment group teachers – even if
they have been instructed not to do so. Subsequently, the students
in one group may learn the information intended for those in the
other groups. Thus, the study may become invalid because there
are, in fact, no real differences between the treatment and control
curricula. While it may not be possible to have complete control
over such contact in some instances, the monitoring of programme
implementation in both the group receiving the experimental
treatment and the group not receiving that treatment should reveal
how likely a threat diffusion or imitation of treatments is to the
validity of the study.
10. Resentful demoralization of students

receiving less desirable treatments
The members of the group not receiving the treatment that is
being studied may perceive that they are in an inferior status
group and either ‘lose heart’ or become angry and ‘act up’. This
could lead to an after treatment difference between groups that
may not be a consequence of treatment effectiveness but rather of
resentful demoralization by the students receiving the alternative
treatment. Some monitoring of the group receiving the alternate
26
treatment should reveal how plausible this threat is to validity. The

threat can be controlled somewhat by planning that separates the
group receiving the treatment of interest from the group receiving
the alternate treatment in either time or space. Alternatively,
arrangements can be made to enliven or ‘spice up’ the alternative
treatment so that it appears as desirable as possible to participating
students.
11. Generalizability
The ten threats to validity described above focus upon the need
to ensure that the effect of the experimental treatment is not
confounded by extraneous variables. All of these threats can
be managed through exerting firm controls over the design and
execution of an experiment. However, the implementation of these
controls may lessen the ‘realism’ of the environment in which
an experiment is conducted and consequently may affect the
generalizability of the research findings to other populations, other
treatments, other measurement instruments, and other social/
economic/cultural/environmental settings. Any research study
is conducted in a particular time and place and with particular
students, treatment variables, and measurement variables. To what
extent can the results of any one study be generalized to other cases?
Strictly speaking, this question is unanswerable. However, at the
very least, the reader of research reports must decide how similar
the group under study is to the groups he/she is responsible for and
how similar the conditions in his/her setting are to the one in which
the study was conducted. In order to make these decisions it will
be extremely helpful if the writer of a research report has carefully
described the setting in which the study occurred, the students
who were studied, the particular treatment that was investigated,
and the types of measurements that were taken. In short, the
generalizability of a study’s findings for a different setting is a
matter that requires careful consideration, not automatic acceptance.
© UNESCO 27
8 Sur vey studies and some

factors that often threaten
their validity
The basic structure of survey studies
There is no single definition that can be used to provide a
comprehensive description of the structure of survey studies. There
are many types of survey studies but they all have one key feature
in common: they all obtain measures from a scientific sample of
subjects selected from a well-defined target population. In a cross-
sectional survey these measurements are used to prepare summary
statistics and then make inferences from these about the nature of
the target population. In a longitudinal survey the focus is on the
use of a series of time-related measurements of the same sample
of individuals. Both cross-sectional and longitudinal surveys may
be used for descriptive purposes, or for examining relationships
between important variables, or for exploring conceptual models
derived from proposed networks of variables. The following
discussion of factors that threaten validity has been limited to the
use of cross-sectional studies for descriptive purposes.
A survey study may be regarded as a snapshot of a situation at

a particular time. Descriptive studies, because of cost, are rarely
conducted for an entire population and therefore a sub-set of a
population, called a sample, is chosen for closer study. The selection
of a sample for a survey is a critical part of such a study since the
28
sample has to be chosen in such a way that it is representative of the
larger population of which it is a part.
As noted earlier, survey studies can be used to establish

associations between variables, and do not permit the drawing
of causal relationships. Despite their limitations, survey studies
play an important role in education. They can result in useful
descriptions of the current state of affairs in a situation and
have often been used as the basis for introducing changes,
especially when the state of affairs that is described is considered
unacceptable. Thus, for example, a study of the school achievement
of students in a particular locality, or even a nation, may reveal
levels of achievement that are deemed unacceptable to educational
authorities.
Survey studies have also been used for comparative purposes. The
studies conducted by the International Association for the Evaluation
of Educational Achievement (IEA), have been used to compare the
performance of students at various age and grade levels in different
nations. The identification of achievement differences has often led
to a closer examination of various nations’ educational systems with
a view to improving them. Within nations, comparisons have often
been made between various types of schools, for example, single-
sexed versus coeducational, in order to assess the relationship
between the sex composition of schools and achievement.
The validity of survey studies

Survey studies require particular attention to be given to the scope
of data collection and the design and management of data collection
procedures (especially sampling, instrumentation, field work, data
entry, and data preparation). If a survey study has problems with
any of these areas, then the validity of the study’s findings may be
threatened.
© UNESCO 29
In the following discussion some of the factors that often threaten

the validity of survey studies have been presented. Since these
factors are mainly concerned with ‘generalizability’ they also have
the capacity to threaten the validity of experimental studies.
1. The scope of the data collection

The first step in the conduct of a survey study is specifying the
entity that is to be described. This, of course, will depend on the
purpose of the researcher. In some cases, one may wish to describe
some features of a single locality such as the size of classes or
the qualifications of teachers at a particular grade level. In other
cases, one may want to assess the attitudes and achievement of
students at a particular level in some school subject. Usually,
researchers who carry out survey studies gather information on
a number of different variables. The reason for doing this is that
once one undertakes the collection of information from people, it
is usually a matter of little additional time to collect information
on a large number of variables rather than on just a few variables.
However, there is a danger here. Sometimes researchers who are not
completely clear as to their research objectives collect information
on a large number of variables without knowing why they are doing
so. This is often referred to as ‘shotgun’ research. The hope of such
researchers, usually unfounded, is that if they collect information
on as many variables as possible, they are apt to include in their
list of variables some that may turn out to be important. Readers
of research reports should generally be suspicious of studies that
collect information on large numbers of variables – often hundreds.
Data collection efforts in education systems in some countries

often give insufficient attention to whether it is really necessary
to study the whole population of students, teachers and schools.
The coverage of a whole population, because of the breadth/depth
tradeoff, usually results in little information about many units. In
this situation, important variables may be omitted and/or measured
30
Survey studies and some factors that often threaten their validity
with insufficient attention to reducing measurement error. For

most purposes, sample surveys, when designed and executed
appropriately, can provide as much information as complete
censuses at considerably less cost. For example, sample surveys are
often adequate for providing accurate estimates of participation and
repetition rates, and are virtually mandatory for estimating national
achievement levels, particularly for students in grades not regularly
examined for selection purposes.
2. The sample design

The first step in the preparation of a sample design for a survey
study is to develop descriptions for the desired target population
(the population for which results are ideally required), the defined
target population (the population which is actually studied and
whose elements have a known and non-zero chance of being
selected into the sample), and the excluded population (the
population comprised of the elements excluded from the desired
target population in order to form the defined target population).
A population is defined by specifying the characteristics that all
elements of the population have in common. For example, one
may define a population as all students between age ten years
zero months and ten years eleven months attending full-time
government schools in Budapest, Hungary. Similarly, one may
define a population as consisting of all students enrolled in a first
year course in French in Swedish schools. In each of the above
instances, the specification of the common characteristics of the
members of the population defines the population.
Since it is usually not possible to study an entire population,

because of cost and logistical considerations, a sub-set of the
population is selected for actual study. This sub-set is called a
sample. One of the challenges that researchers face is to select
a sample from a population for study in such a way that it will
provide precise estimates of the defined target population
© UNESCO 31
characteristics. Unfortunately, in many survey studies the sample

estimates provide very poor estimates of defined target population
characteristics because of the following five problems.
• The defined target population and the excluded population are

never clearly defined. This may arise because the researcher
either does not bother to specify the size and nature of these
populations or, due mainly to lack of precise information,
is unable to provide precise definitions. Unfortunately, this
problem often goes hand-in-hand with the researcher making
generalizations about a desired target population that, upon
careful scrutiny, is quite different from the defined target
population.
• The participants in the study are nominated rather than

sampled. This approach is often justified in terms of cost
or accessibility considerations, however both of these
‘constraints’ can usually be addressed by adjusting the defined
target population definition and then applying appropriate
stratification procedures. These nonprobability samples,
sometimes referred to as ‘nominated samples’, are generally
described in scientifically meaningless terms such as ‘quota’,
‘representative’, ‘purposive’, ‘expert choice’, or ‘judgmental’
samples. Kish (1965) characterized data collections based
on this approach as ‘investigations’ and pointed out that
they should not be confused with appropriately designed
experiments or surveys. The main problems associated with the
use of nominated samples are that it is not possible to estimate
the sampling errors or to have any idea of the magnitude of the
bias associated with the selection procedures (Brickell, 1974).
Consequently, nominated samples should be used only for the
trial-testing of instrumentation or new curriculum materials
because in these activities it is sometimes desirable to employ
a ‘distorted’ sample that has, for example, a disproportionately
large number of students at the extremes of a spectrum of
ability, ethnicity, socio-economic status, etc.
32
• The sampling frame used to list the defined target population

is faulty because it is out of date and/or is incomplete and/or
has duplicate entries. The construction and maintenance of
a comprehensive sampling frame for schools, teachers, and
students may be neglected because it is considered to be
too expensive or because the systematic collection of official
statistics in a country is error-prone. This is sometimes the
situation in countries where population growth rates are high
and where large and uncontrolled movements of population
from rural to urban settings are commonplace. However, there
are also a number of countries that are unable to provide
accurate information in this area because the management and
financing of schooling is undertaken by local communities, or
because there is an independently managed non-government
school sector. The researcher faced with these difficulties often
proceeds to use a faulty sampling frame based on poor quality
official statistics in the mistaken belief that there are no other
alternatives. In fact, there are well-established solutions to
these problems that employ ‘area sampling’ (Ross, 1986) and,
provided that a trained team of ‘enumerators’ is available to list
schools within selected areas, it is possible to prepare a high
quality sample design without having access to an accurate
sampling frame based on a listing of individual schools.
• Confusion surrounding the terms ‘total sample size’ and

‘effective sample size’ results in the total sample size for a
complex cluster sample being set at the wrong level either
by the use of simple random sampling assumptions or, quite
frequently, by guesswork. In school systems that are highly
‘streamed’, either explicitly on the basis of test scores or
implicitly through the residential segregation of socio-economic
groups, the use of complex cluster sampling can have dramatic
effects on the total sample size that is required to reach a
specific level of sampling precision. This occurs because the
streaming causes larger differences in mean scores between
© UNESCO 33
classes than would be the case if students were assigned at

random to classes. (The magnitude of these differences can be
measured by using the coefficient of intraclass correlation).
Researchers with a limited knowledge of this situation often
employ simple random sampling assumptions for the estimation
of the required total sample size. In order to illustrate the
dangers associated with a lack of experience in these matters,
consider the following two examples based on schools in a
country where the intraclass correlation for achievement scores
at the Grade 6 level is around 0.6 for intact classes. A sample
of 40 classes with 25 students selected per class would provide
a total sample size of 1,000 students – however, this sample
would only provide similar sampling errors as a simple random
sample of 65 students when estimating the average population
achievement level. Further, a sample of 50 classes with
4 students selected per class would provide a total sample size
of ‘only 200 students’ but would nevertheless provide estimates
that are more precise than the above sample of 1,000 students.
• The wrong formulae are used for the calculation of sampling

errors and/or for the application of tests of significance. This
usually occurs when the researcher employs a complex cluster
sample (for example, by selecting intact classes within schools)
and then uses the sampling error formulae appropriate for
simple random sampling to calculate the sampling errors
(Ross, 1986). The most extreme form of this mistake occurs
when differences in means and/or percentages are described
as being ‘important’ or ‘significant’ without providing any
sampling error estimates at all – not even the incorrect ones.
These kinds of mistakes are quite common – especially where
‘treatment versus control’ comparisons are being made in order
to compare, for example, current practices with new curriculum
content or new teaching materials (Ross, 1987).
34
3. Instrumentation
Once one has decided what population to study, the next step is
deciding what items of information should be collected via the data
collection instruments (tests, questionnaires, etc.) that have been
designed for the study. One may choose to study a very limited set
of variables or a fairly large set of variables. However, the research
questions governing a study should determine the information
that is to be collected. Some variables may be obtained at very low
cost. For example, asking a member of a sample to report his or her
sex will require only a few seconds of the respondent’s time. On
the other hand, obtaining an estimate of a student’s proficiency in
mathematics or science may require one or more hours of testing
time. Time considerations will be a major determinant of how much
information is collected from each respondent. Usually, a researcher
or a group of researchers will need to make compromises between
all the information that they wish to collect and the amount of
time available for the collection of information. The data collection
instruments should be clear in terms of the information they seek,
retain data disaggregated at an appropriate level, and permit the
matching of data within hierarchically designed samples or across
time. Furthermore, they must be designed to permit subsequent
statistical analysis of data for reliability and (if possible) validity.
The basic requirements are that the questions posed do not present
problems of interpretation to the respondent, and that, when forced
choice options are provided, the choices are mutually exclusive and
are likely to discriminate among respondents.
The presentation of test scores and sub-test scores should be

accompanied by appropriate reliability and validity information. At
the most minimal level for norm-referenced tests, a traditional item
analysis should be undertaken in order to check that the items are
‘behaving’ in an acceptable manner with respect to discrimination,
difficulty level, and distractor performance. An attempt should
be made to establish the validity of tests where this has not been
© UNESCO 35
carried out previously. The quality of the instruments prepared for a

survey study is heavily dependent upon the time and effort that has
been put into the pre-testing of tests, questionnaires, etc. The usual
result of a failure to pre-test is that respondents can be confused
and therefore answer inappropriately. When obtaining measures
of student achievement, pre-testing is absolutely necessary so
that items may be checked for their difficulty and discrimination
levels. If items are either too hard or too easy, there will be little
discrimination in the resulting test score.
Where open-ended responses are to be subsequently coded

into categories, pre-testing can assist in the development of the
categories, or can even lead to eliminating the need for a separate
coding step. For example, in one international survey, students
were asked to indicate the total number of brothers and sisters in
their family. The question was asked in the form of a forced choice
response with the maximum value being ‘five or more’ and in
several countries more than 80 per cent of the children indicated
this category as their choice. Pre-testing would have indicated the
need to extend the number of categories to allow for the very large
family sizes in these countries, or to leave the question open-ended.
Many of the weaknesses and limitations of educational research

stem from inadequacies in the measures that have been used.
Simply put, if the measures that are used to answer research
questions are deficient, it will not be possible to obtain correct
answers to the research questions. Four major questions that should
be asked in selecting or devising measures for research studies are:
• What validity evidence is available to support the use of a

measure? Validity refers to whether an instrument is measuring
what it is supposed to measure. For published instruments
developed by others there should be adequate documentation
to justify their use with the intended target population. If
such information is not available, pilot studies may need to be
36
undertaken to establish a basis for usage. For locally developed

measures such as achievement tests, validity can be built into
the measure through the use of a carefully developed test plan.
However, the development of a test according to a detailed
test plan may not guarantee validity of the measure. For
example, a test of science achievement may contain so much
verbal material that for a student to score well he/she must
demonstrate a high level of reading comprehension as well as
the relevant science knowledge. This would invalidate the test
as a measure of science achievement. A suitable remedy in this
situation would be to rewrite the items of the test in a simpler
language.
• What reliability evidence is available to support the use of a

measure. Reliability denotes the accuracy or precision with
which something is measured. For published measures, or
measures developed by others, it should be expected that
reliability information will be available. If such information
is not available, it will be necessary to conduct pilot studies
to determine the reliability of the instrument. For locally
developed instruments, a trial of the instrument will be needed
to determine reliability. In general, one should avoid using
instruments that test student achievement which have reliability
coefficients below 0.8, and definitely not use any achievement
test with a reliability lower than 0.7. Such instruments contain
so much measurement error that they cannot provide adequate
answers to research questions.
• Is the measure appropriate for the sample? Instruments

are developed to be used with particular groups of people.
A science test, developed for use in one region, may be
inappropriate for use in another region where the curriculum
is different. Careful review of an instrument, along with some
pilot work, may be necessary to determine the suitability of an
instrument for a particular group of people.
© UNESCO 37
• Are test norms appropriate? Sometimes norms are available

for help in interpreting scores on various measures. If the
group to which a test is to be given is a sample from the target
population on which the norms were developed, norms can be
a useful aid in interpreting test performance. To do so, several
requirements must be met. First, the sample that takes the test
must be clearly a part of the population on which the norms
were developed. Second, the test must be given without any
alterations such as omission of certain items or changes in
directions for administration. Third, the time limits for the test
must be strictly followed. All of these conditions must be met if
norms are to be used.
38
Other issues that should be 9
considered when evaluating
the quality of educational
research
Previous sections have identified specific threats to the integrity
of research studies. In this section more general considerations
regarding the quality of research studies are presented. Some of
these may strike the reader as being self-evident. If this is the
case, then the reader has a fairly good grasp of the basic canons of
research. Unfortunately, too many studies are carried out in which
these canons are violated.
Every research study should contain a clear statement of the

purpose of the study. Furthermore, such a statement should appear
very early in the report. If specific hypotheses are being tested,
these too should be clearly stated early in the research report. If no
hypotheses are being tested but rather research questions are being
posed, then these too should be presented early in the report. The
frequency with which such dicta are violated is astonishing. Unless
the reader is informed early on as to the purpose of a study and
the questions to be answered, it is difficult to judge a presentation.
Usually, failure to state the purpose of a study early in the report
is an indication that the author is unclear about the nature of the
study. If this is the case, then it is likely that the research report will
contain little that is of value.
© UNESCO 39
A second issue that bears some comment is the review of the

literature. Reviews of the literature are intended to furnish the
reader with some background for the study. These can range from
just a few pages to a lengthy chapter. Practice varies considerably
and there are no firm rules to follow. Also, space in publications is
usually at a premium and writers are frequently asked to trim the
review of the literature to a bare minimum. Despite these factors,
some review of the literature is needed in a research report. The
review is intended to inform the reader of the existence of previous
work in the area and provide a foundation for the present study. It
also furnishes some minimal assurance of the writer’s familiarity
with the area being studied and the likelihood that whatever errors
occurred in previous studies are not apt to be repeated.
There are some types of errors that occur in a research situation

that have come to be given particular ‘names’. They are generally
associated with experimental studies. They include the following:
the Hawthorne effect, the John Henry effect, the Pygmalion effect, and
Demand characteristics effect.
The Hawthorne effect refers to an effect detected in early studies

of worker morale where the fact that subjects were aware of being
involved in an experiment resulted in increased output and morale,
regardless of the nature of the particular treatment to which they
were exposed.
The John Henry effect is the opposite of the previously described

threat to internal validity referred to as ‘resentful demoralization’.
In the John Henry effect, students and teachers in the group
not receiving the experimental treatment, and knowing that
they are not receiving it, join together in putting forth greater
effort to perform well. Such increased effort would probably not
have occurred if an experiment was not being conducted. The
consequence of the John Henry effect is higher performance of
the control group leading to a misleading result of no difference
between the experimental and control groups.
40
Other issues that should be considered when evaluating the quality of educational research
The Pygmalion effect refers to experimenter expectancy effects

that can influence student performance. The claim for this effect
originated in a study conducted some years ago in which teachers
were told that some of their students would have a spurt in mental
growth during the course of the school year. In fact, the students
who were supposed to show this increase were chosen at random
from students in the class. Some support for this effect was found
in the lowest two grades of the school, but the study’s results were
disputed by other researchers. Other studies have detected an
expectancy effect, especially when one group is identified as low
performers. If a group is labelled as low performers (whether they
actually are or are not), this can result in inferior treatment and
resulting low performance. While the available evidence does not
show strong evidence for such an effect, the possibility for it to
occur does exist in some cases.
The term Demand characteristics effect is concerned with all the cues
available to subjects regarding the nature of a research study. These
can include rumours as well as facts about the nature of the study,
the setting, instructions given to subjects along with the status
and personality of the researcher. These can influence the research
study and, more importantly, the results. At present, research is
underway to determine the conditions under which such factors
may influence the outcomes of studies.
© UNESCO 41
10 A checklist for evaluating

the quality of educational
research
The following framework for evaluating research reports (see
Box 1) has been adapted from material developed by Tuckman
(1990) and Ross et al. (1990). It is intended to furnish educational
planners with a set of criteria for judging research reports. While
some readers may be tempted to use the criteria included in the
framework without attending to the rest of the material already
presented in this module, it is strongly recommended that this not
be done. The criteria represent a summarization of the concepts that
have been presented earlier and are therefore likely to lack meaning
unless the reader has read the entire document. Therefore, it is to be
hoped that readers will devote as much attention to understanding
the ideas that underlie the criteria presented below as to the criteria
themselves.
To assist the reader in understanding and using the criteria, a

study from the literature will be examined with regard to these
criteria. The study that has been selected is ‘Adult Education Project
– Thailand’ by Thongchua, V.; Phaholvech, N.; and Jiratatprasoot,
K. It was published in Evaluation in Education: an international review
series, 1982 Vol. 6, pp. 53-81, and is reproduced in the Appendix.
This study sought to determine the effects of several courses of
vocational instruction, namely courses in typing and sewing of 150
hours and 200 hours duration.
42
Box 1 Research evaluation framework
1. Problem
a. is stated clearly and understandable;
b. includes the necessary variables;
c. has theoretical value and currency (impact on ideas);
d. has practical value and usability (impact on practice).
2. Literature review
a. is relevant and sufficiently complete;
b. is presented comprehensively and logically;
c. is technically accurate.
3. Hypotheses and/or questions
a. are offered, and in directional form where possible;
b. are justified and justifiable;
c. are clearly stated.
4. Design and method
a. is adequately described;
b. fits the problem;
c. controls for major effects on internal validity;
d. controls for major effects on external validity.
5. Sampling
a. gives a clear description of the defined target population;
b. employs probability sampling to ensure representativeness;
c. provides appropriate estimates of sampling error.
6. Measures
a. are adequately described and operationalized;
b. are shown to be valid;
c. are shown to be reliable.
7. Statistics
a. are the appropriate ones to use;
b. are used properly.
8. Results
a. are clearly and properly presented;
b. are reasonably conclusive;
c. are likely to have an impact on theory, policy, or practice.
9. Discussion
a. provides necessary and valid conclusions;
b. includes necessary and valid interpretations;
c. covers appropriate and reasonable implications.
10. Write-up
a. is clear and readable;
b. is well-organized and structured;
c. is concise.
Suggested guide for scoring
5 = As good or as clear as possible; could not have been improved.
4 = Well done but leaves some room for improvement.
3 = Is marginal but acceptable; leaves room for improvement.
2 = Is not up to standards of acceptability; needs great improvement.
1 = Is unacceptable; is beyond improvement.
© UNESCO 43
The objectives of the evaluation project were as follows:
• “measuring the skills and knowledge gained by participants in

typing and sewing courses of different duration (e.g., 150 hours
and 200 hours);
• identifyg variables having an effect on the achievement of
participants at the end of the course;
• investigating whether the graduates took up employment in
typing/sewing within six months of the end of the course;
• assessing how participants utilized the skills they learned on
the course six months after completing their courses.” (p. 54).
1. Problem
The first criterion in the framework involves the research problem.
The above statement of objectives was clear and understandable,
had some theoretical value and considerable practical value. While
the necessary variables were not explicitly stated, they were strongly
implied. The first objective referred to skills and knowledge to be
gained in typing and sewing courses. Furthermore, the third and
fourth objectives specified employment in typing or sewing within
six months of the end of the course and use of skills within six
months after completing the course. The second objective referred
to variables “...having an effect on achievement of participants” but
did not specify what these variables were. These variables were
presented later in the report of the study. A rating of 4 would seem
suitable for the statement of the problem.
2. Literature review
The study report contained no literature review. This may have been
omitted due to the extreme length of the report (28 pages) or the fact
that the project was one that had an extreme practical orientation. In
44
A checklist for evaluating the quality of educational research
any case, the omission of any literature review was troubling. If one
were to be generous, one could give the study a rating of NA (Not
Applicable). The alternative would be to give it a rating of 2.
3. Hypotheses and/or questions

Since the study was not an experimental one, no formal hypotheses
were stated. The research questions were, however, strongly implied
in the statement of objectives quoted above. They were further
elaborated in the text. The expectation of the investigators was
clear; training in typing and sewing was expected to have a positive
effect on the skill and knowledge of the participants. Furthermore,
the training course was expected to lead to employment in either
typing or sewing or, at the least, use of these skills in the future
(six months after the completion of the course). A rating of 4 would
seem appropriate here.
4. Design and method

The authors described the procedures they followed in the conduct
of the study in considerable detail (pp. 54-57). The organization
of the study was presented with great clarity, including the
problems encountered in selecting sites and participants. These
were considerable due to the fact that there was inadequate
information about what courses were offered in each of the 24
Lifelong Education Centres throughout the country. Accordingly,
the investigators had to conduct no less than three surveys in
order to find out what courses were being offered in what centres.
The survey results led to some major modifications in the study
plan. For example, it was originally envisaged that courses of 150
and 300 hours duration would be compared. However, the survey
results revealed that very few centres offered courses of 300 hours
duration so the study plan was adjusted to compare courses of 150
and 200 hours duration. In general, the design did fit the problem
© UNESCO 45
fairly well. However, the study was not able to control adequately
for either internal or external validity. The reason for this was
that the researchers had virtually no control over the selection of
participants for the study or the assignment of participants to the
different length courses. Another design issue that the authors
should have advanced was the rather limited time of 6 months
that was used for the tracer study. Given the prevailing economic
conditions, perhaps at least one year would have been more
appropriate. At best, a rating of 3 must be given on this criterion.
5. Sampling
There were a number of difficulties encountered in sampling. First,
the selection of which of the Lifelong Education Centres would be
included presented problems. According to the investigators, “The
selection of centres was made by purposive sampling rather than
simple random sampling as originally foreseen, because of time
limitations and in order to have a sufficient number of participants”
(p. 55). The use of ‘purposive sampling’ was questioned earlier
in this module. It is a somewhat elegant way of saying that the
sampling was less than desirable. Second, within each centre,
already established classes of 150 or 200 hours duration were
selected at random with the proviso that the number of participants
per centre should be at least 30. This condition was not always
met. More serious, however, was the use of intact classes. The
inability randomly to assign participants to classes of different
durations presents real problems since it is possible that more able
participants could be assigned to one type of class length. Third,
the data collection was seriously compromised in some cases. The
authors report, “In some cases, centres had already completed the
course a week or two before the dates arranged with the centre for
the administration of the tests. The teachers tried to persuade the
participants to return to take the tests but, unfortunately, many of
them did not do so” (p. 55). In addition, “...participants enrolled but
never attended the course or dropped out of the course before it
46
finished because they had found a job, were already satisfied with
what they had learned, became tired of the course, or were needed
on the land for seasonal work” (p. 55). The effect of these events
was to introduce a bias into the study whose influence is unknown.
The lack of a clear definition of a target population, the lack of
probability sampling, and the difficulties encountered in actually
obtaining the sample raise serious questions about the adequacy of
the groups that were studied. A rating of 2 on this criterion seems
warranted.
6. Measures
There were several measures used to assess the achievement of
the participants. Tests were developed in both typing and sewing
for the study. In addition, questionnaires were developed to assess
the background of participants and other variables of interest.
A teacher questionnaire was also developed for use in the study.
The description of the development of the instruments was quite
thorough (pp. 57-64) and there is considerable evidence of content
validity. The reliability of the cognitive tests of typing were rather
low (Thai + English = .60 and Thai = .64). The performance tests, in
contrast, showed high reliabilities (typing = .87 to .95 and sewing
total = .92). Clearly, the instruments are one of the outstanding
features of the study. A rating of 4 or 5 is clearly warranted.
7. Statistics
The authors used standard statistical procedures to analyze
their data. They presented the means obtained on each measure
for each subject for each course length and indicated whether
differences between groups were statistically significant or not.
Unfortunately, they did not present the standard deviations of the
scores. This made it somewhat difficult to interpret the results that
were obtained. For example, the authors report (p. 70) a difference
© UNESCO 47
of 0.78 points between students basic knowledge on the Thai +

English typing knowledge test. This difference was indicated to
be statistically significant. This is all well and good, but one is also
concerned as to whether the difference is meaningful or not. The
inclusion of the standard deviations for the two groups would have
enabled a reader to judge the meaningfulness of the difference (the
means for the two groups were 7.23 and 6.45). A rating of 4 on this
criterion seems warranted.
8. Results
Some comments on the results were presented under the above
criterion. The general lack of differences between participants in
the 150 hour course and those in the 200 hour course are notable
and led to the conclusion that little was gained by having courses
of more than 150 hours. This is a finding that is likely to have
considerable impact on policy and practice. At the conclusion of the
report, the authors suggest that, “Serious consideration should be
given to abandoning the 200 hour sewing course ...” If adopted, such
a recommendation would have a great impact on policy and practice
and should result in a considerable saving of money. A rating of 4
seemed appropriate.
9. Discussion
The authors provided a thoughtful discussion of their results. At
the beginning of the study, the authors expressed the expectation
that graduates of the programme would be able to use their
newly acquired skills in typing or sewing to gain employment.
Subsequently, they found out that employment opportunities
were quite limited and that participants used their newly acquired
skills in sewing, for example, to sew clothes for themselves. The
absence of economic opportunities for programme graduates
was therefore not interpreted as any reflection on programme
48
effectiveness but rather a reflection of circumstances beyond the

control of programme personnel. In contrast, the authors did not
give sufficient attention to the differences between programme
participants in the courses of different lengths. In the 150 hour
sewing course, for example, 63 per cent of the participants
were between ages eleven and twenty while only 37.5 per cent
were at that age level in the 200 hour course. Furthermore, 47
per cent of the participants in the 150 hour course reported not
having a sewing machine at home compared to 28 per cent in
the longer course. Whether these differences might have affected
performance in the course is simply not known. There were also
some substantial differences between participants in the short and
longer typing courses. Again, how these differences might have
affected performance is unknown. It seems that in the light of the
differences that existed between the groups before the start of the
courses, one must be extremely careful in attributing performance
differences to the treatment effect (duration of the course). It is
quite possible that the differences that were found could be due
to the differences that already existed between the groups. At the
least, one must be quite tentative in drawing conclusions about
treatment effects. It would seem that a rating of 3, at best, should be
accorded the study on this criterion.
10. Write-up
Many of the comments that have been made about this study could
only be given because the authors were so clear and thorough in
their write-up. Despite a few omissions – standard deviations for
the performance measures, for example – the authors described
their study in a clear and thorough way. Some of this may be due to
the fact that the authors were given a generous page allotment by
the editors of the journal (28 pages). In any case, the study is almost
a model for a clear, coherent presentation of a research endeavour. A
rating of 4 or 5 is clearly warranted on this criterion.
© UNESCO 49
A summary of the ratings is presented below:
Criterion Rating
1. Problem 4
2. Literature review NA or 2
3. Hypotheses and/or questions 4
4. Design and method 3
5. Sampling 2
6. Manipulation and measures 4
7. Statistics 4
8. Results 4
9. Discussion 3
10. Write-up 4 or 5
A summary of the type presented above provides a quick indication

of the strengths and weaknesses of the study. Clearly, there were
problems in the areas of design and method and sampling. These
were fully described in the report of the study and commented
in the analysis presented above. A reader of the report, using
the criteria in the framework, can easily detect the areas where
problems occurred and direct attention to these areas in judging the
adequacy of the study and how much weight to give to the results.
It is hoped that readers of educational research reports will find the
criteria helpful in judging the research reports they are presented
with.
50
Summary and conclusions 11
This module has sought to furnish a guide to readers of educational
research reports. The intended audience for this guide is educational
planners, administrators, and policy-makers. A sincere effort has been
made to make this module as non-technical as possible. It is doubtful
that this effort has fully succeeded though. There are technical issues
involved in research studies and any attempt to avoid them would
be irresponsible. Rather, technical issues have been addressed when
necessary and an attempt has been made to address them in as simple
and direct a way as possible. It is felt that the ideas that underlie
technical issues in research are well within the grasp of the readers of
this module and that assistance with technical details can be sought
when needed.
Readers of educational research reports are not asked to suspend

their own judgment when they undertake to read and understand
research reports. Rather, it is hoped that the same abilities
of analysis and interpretation that they routinely use in their
professional lives will be applied to their judgments of reports of
educational research. As a colleague in philosophy once noted,
“There is no substitute for judgement.” It is this same quality of
judgement, aided by some technical understanding, that should be
used when faced with reports of educational research. There is also
ample room for the reader’s commonsense. Too often, educational
research reports announce what are termed ‘significant effects’ as
though this is all that is needed to make the results educationally
important. The term, ‘significant effects’ merely denotes that an
effect is not likely to be due to chance. Whether the effect is large
enough to be educationally meaningful is another matter entirely
and depends on the judgement of the reader. It is at this point that
© UNESCO 51
the perspicacity of the reader is needed and no amount of technical

competence will substitute for careful judgement.
The checklist that was presented for evaluating the quality of

educational research in the previous section was intended to
serve as a guide. It is hoped that it will be a useful guide, helping
to raise important questions that should be addressed when
judging research reports. The guide needs to be used judiciously,
however. Some of the questions and criteria in the guide may not
be applicable to a particular research study. If so, the reader should
simply disregard them. Thoughtful use of the guide includes
being able to disregard parts of it when they seem irrelevant or
inappropriate.
52
Appendix
Adult Education Project:

Thailand1
In 1976, a new adult education project was started within the
general area of non-formal education. The overall objectives of the
programme were to promote literacy skills, occupational skills,
spare-time earnings, knowledge, skills and attitudes for functioning
in society and, thereby, improving the living standard of the rural
population. The project is still continuing.
Nationally, the Project Office of the Department of Non-Formal

Education is responsible for the overall co-ordination of the project
and looks after such matters as project administration, construction,
procurement, expert services, fellowships and apprenticeships, and
the implementation of radio correspondence programmes.
There are also four regional offices – in the north-east, north,

south, and central Thailand. The regional offices are responsible
for servicing the various activities in non-formal education in their
region. They undertake curriculum development and the production
of materials relevant to the needs of their region, training, and some
research and evaluation.
One of the major activities is the work conducted by the Lifelong

Education Centres. Each centre provides courses in adult continuing
education, functional literacy, vocational education, and in topics
of special interest to groups requesting them. Each centre also
provides services for its immediate neighbourhood such as public
libraries, village newspaper reading centres, certain audio-visual
programmes and what are known as special activities (examples of
1. Extract [(Chapter 3) by Viboonlak Thongchua, Nonglak Phaholvech, Kanjani

Jiratatprasoot] in Evaluation in education: an international review series. Vol. 6,
No. 1, 1982 (pp. 53-107). Sawadisevee, A.; Nordin, A.B.; Jiyono; Choppin, B.H.;
Postlethwaite, T.N. (Eds.). Oxford: Pergamon Press.
© UNESCO 53
which are special talks on the law and elections, and participation
in special activities of the province).
The vocational education work is considered to be the most

important activity of these centres. The two most popular courses
in 1980 were typing and sewing, and these were selected for
evaluation.
The participants in these courses are thought typically to be 14

to 25 years old from poor rural homes and seeking permanent
employment. At any one centre, a course is provided for a minimum
of 15 persons. The overall purposes of the typing and sewing
courses are that the participants will become proficient in the skills
of typing and sewing. For typing the hope is that they will take jobs
after the course and so increase their family income. For sewing
the hope is that they will either take jobs or will sew clothes for
their families thereby reducing the amount of money they spend on
clothes and thus increasing the family’s income. The participants all
come from regions where the daily per capita income is about US$2.
Objectives of the evaluation project

The general aim was to measure achievement of participants in
the typing and sewing courses, identify the variables that affected
achievement and also examine how participants utilized the skills
they had learned at the centres. These overall objectives were
broken down into the following specific objectives:
• measuring the skills and knowledge gained by participants in

typing and sewing courses of different duration (e.g. 150 hours
and 200 hours);
• identifying variables having an effect on the achievement of

participants at the end of the course;
54
Appendix
• investigating whether the graduates took up employment in

typing and sewing within six months of the end of the course;
• assessing how participants utilized the skills they learned on

the course six months after completing their courses.
Design
A cross-sectional survey was conducted throughout the 24 Lifelong
Educational Centres to acquire a sufficient number of courses for
both subject areas. Practical tests and a cognitive test (typing only)
were administered at the end of each course to assess the courses
and to identify variables associated with achievement. Six months
after the end of the courses, a tracer study was conducted to assess
how and to what extent the graduates were using the skills learned.
There are 24 Lifelong Education Centres throughout the country

offering the vocational courses. It was decided to select five centres
offering the typing (150 hours) course and five for the typing (200),
five for the sewing (150) and five for the sewing (200). Therefore, 20
of the 24 centres were involved.
In each course at each centre, one class of 30 participants was

chosen at random. If 30 participants could not be found in one class,
another class would be added to make up the 30 participants. Thus,
there would be 150 participants for each of the four courses making
a total of 600 students.
For the tracer study, a sub-sample of the graduates would be

selected at random, to be followed up. This would be eight
participants from each course at each centre, making a total of 40
participants for each of the four types of courses.
© UNESCO 55
To obtain the sample, three surveys at different times were

conducted. The first survey was launched in early October, 1980,
in order to identify the centres offering 150 and 300 hours for
both subjects. Originally it had been thought desirable to compare
courses of 150 and 300 hours. However, the first survey indicated
that very few centres conducted courses of 300 hours. The idea of
comparing 150 and 300 hour courses was then discarded and a
second survey was conducted to assess whether it would be possible
to compare courses of 100 and 200 hours. The second survey (later
in October, 1980) showed that there were insufficient classes of 100
hours duration. An attempt was then made to identify sufficient 150
and 200 hour courses. A third survey was conducted in November/
December, 1980 and finally an appropriate number of courses was
identified.
However, it was difficult to identify sufficient centres providing

courses which had the required number of participants and, at the
same time, fell within our time limits for data collection. The time
limit was therefore extended from the end of March to early June,
1981. The selection of centres was made by purposive sampling
rather than simple random sampling as originally foreseen, because
of time limitations and in order to have a sufficient number of
participants. Purposive sampling involved all centres which
completed their courses between March and early June, 1981. There
were simply not enough centres to make random sampling possible.
Although the sampling design called for 30 participants per centre,

the final number fell short of 30. In some cases, centres had already
completed the course a week or two before the dates arranged
with the centre for the administration of the tests. The teachers
tried to persuade the participants to return to take the tests but,
unfortunately, many of them did not do so. The reason given was
that most of the participants were afraid to take the tests.
56
Appendix
In other cases, participants enrolled but never attended the course

or dropped out of the course before it finished because they had
found a job, were already satisfied with what they had learned,
became tired of the course, or were needed on the land for seasonal
work.
To sum up, the total achieved sample was 498, divided into 135
for sewing (150 hours), 147 for sewing (200 hours), 130 for typing
(150 hours), and 86 for typing (200 hours). The achieved sample is
presented in Table 1.
In order to obtain fairly equal numbers in both sewing groups, one

centre (Samusakorn) was added to the sewing 150 hour course
and one centre (Ayuthaya) was removed from the sewing 200 hour
course.
Given the major constraints of money, time and manpower, we were

unable to equate the groups any better. Because of the reduction
in sample size, the sub-sample size for the tracer study was also
affected given that about 25 per cent were to be followed up. The
sub-sample design for the tracer study became that presented in
Table 2.
In order to ensure that a 25 per cent sub-sample could be met, all

course participants were invited to come to their centres. Those
who did not were visited in their homes, but some could not be
contacted. The final tracer study included 63 per cent of course
graduates as presented in Table 3.
© UNESCO 57
Table 1 Total achieved samples in each center
Sewing
Center Morning Afternoon Evening Center Morning Afternoon Evening

150 hours shift shift shift Total 200 hours shift shift shift Total
1. Chiengmai - 23 14 37 1. Nakornswan 14 12 15 41
2. Khonken 11 13 6 30 2. Petchaboon 47 - 7 54
3. Nakornratchasima 10 - 11 21 3. Ratburi 14 - 8 22
4. Nakronswan 7 8 - 15 4. Surin 26 - 4 30
5. Samusakorn - - 9 9
6. Ubonratchatanee 12 11 - 23
Total 40 55 40 135 Total 101 12 34 147
Typing
1. Chiengmai 9 5 19 33 1. Angthong 2 - 12 14
2. Khonken 9 12 9 30 2. Ayuthaya 5 1 3 9
3. Nakronratchasima 16 - 6 22 3. Petchaboon 15 5 - 20
4. Uthaithanee 12 - 1 13 4. Samusakorn - - 17 17
5. Ubonratchatanee 9 17 6 32 5. Surin 9 10 7 26
Total 55 34 41 130 Total 31 16 39 86
58
Appendix
Table 2 Sub-sample for tracer study
Sewing Typing
150 200 150 200

No. Center hours hours hours hours
1. Angthong - - - 3
2. Ayuthaya - - - 2
3. Chiengmai 9 - 9 -
4. Khonken 8 - 8 -
5. Nakornratchasima 5 - 5 -
6. Nakornswan 4 10 - -
7. Petchaboon - 14 - 5
8. Ratburi - 6 - -
9. Samusakorn 2 - - 4
10. Surin - 7 - 7
11. Uthaithanee - - 3 -
12. Ubonratchathanee 6 - 8 -
Total 34 37 33 21
© UNESCO 59
Table 3 Achieved tracer sub-sample
Sewing Typing
150 200 150 200

No. Center hours hours hours hours
1. Angthong - - - 11
2. Ayuthaya - - - 4
3. Chiengmai 29 - 19 -
4. Khonken 16 - 14 -
5. Nakornratchasima 14 - 8 -
6. Nakornswan 14 21 - -
7. Petchaboon - 49 - 15
8. Ratburi - 19 - -
9. Samusakorn 8 - - 13
10. Surin - 20 - 10
11. Uthaithanee - - 6 -
12. Ubonratchathanee 9 - 16 -
90 109 63 53
Sub-Total
(66.7%) (74.1%) (48.5%) (61.6%)
Grand TOTAL 315 (63.25%)
60
Appendix
According to information obtained from the instructors of the

courses, they represented both good and poor performers on the
courses. The remaining students who did not come to the interview
session gave as reasons for being absent that they did not receive
the postcards, happened to be doing important business on that
day, or had moved to another area. A detailed analysis was carried
out to investigate whether there were important differences
between the characteristics of those students who attended the
tracer study interview (the tracer sample), and those who did
not attend and could not be contacted (the drop-out sample). We
compared these two groups on their background characteristics
(sex, age, occupation, level of education, family size, previous level
of skill, etc.), teacher information (sex, age, teacher experience,
teacher training, additional training, etc.) and information on the
students’ test performance.
The differences between the tracer sample and the drop-out sample
were small – being less than one third of a standard deviation
score on each student characteristic. The largest differences were
noted for hours of attendance in the sewing and typing courses.
The maximum difference in attendance hours was 4.8 hours for the
typing course.
This figure was relatively small in comparison to the total length of

the courses (150 to 200 hours).
These analyses demonstrated that, although the drop-out sample

consisted of some 37 per cent of the total sample, the loss of this
information (in the tracer study) had not created serious problems
of sample bias in our tracer study.
© UNESCO 61
The construction of measures

In all, five measurement instruments were constructed. A teacher
questionnaire was developed by the team members. We took it
with us while conducting the field study and asked the teachers
to complete it. It contained questions on educational and teaching
qualifications, sex, age, teaching experience, additional training on
the subject taught within the past five years, number of participants
in the classes and their attendance record, and equipment facilities
(quantity and quality).
There was one cognitive test for typing and practical tests in both
typing and sewing with a one-page student questionnaire on
student background information (sex, education, age, previous
experience, motivation, siblings, father’s education, mother’s
education, father’s occupation, mother’s occupation and machine
at home). A three-day workshop was held in Bangkok for course
content analysis and test construction. Eight teachers from Lifelong
Education Centres, mobile Trade-Training Schools and the Non-
Formal Education Department at the Ministry of Education, as well
as two local experts, were invited to participate in this workshop.
Cognitive test of typing: content analysis of the curricula for the
150 and 200 hour courses was undertaken by the instructors
teaching the typing courses and team members. Table 4 presents the
topic areas and objectives, and the number of items per topic. The
number of items per topic represents the weights accorded to each
topic. The items were in multiple-choice form with four alternatives
per item, only one was the correct answer.
62
Appendix
Table 4 Topics (objectives) and numbers of items

in pilot and final cognitive tests for typing
Number of items
Topics Pilot Final
1. Basic knowledge of typing 6 11

2. The parts of a typewriter and how to handle them 5 6
3. Maintenance of a typewriter 2 1
4. Body position when typing 3 2
5. How to feed and release paper and set intervals 7 8
6. How to manipulate typewriter keyboard 7 6
7. Gaining typing speed 3 4
8. Carbon copy typing 2 2
9. Typing on stencil 2 2
10. Principles of typing official letters 13 13
Total 50 55
The pilot test was administered to 134 students in a polytechnic

school in Bangkok and at Adult Education centres in Khonken
and Petchaboon. No difficulties were experienced with the
administration of the test. An item analysis was undertaken. Thirty-
five of the items were in the 20 to 80 per cent difficulty range and
had point-biserial correlations of over .30. Fifteen of the items were
either too easy or too difficult, or had low discrimination values.
The reliability of the pilot test was low (KR -20; .62) but it was
hoped that, by substituting twenty better questions for the fifteen
poor ones, higher reliability would be obtained. This then made for
a final test composed of 55 items (45 items for Thai typing plus 10
items for English typing) as presented in Table 4. The psychometric
© UNESCO 63
characteristics of the final test were: for Thai + English typing with
a maximum score of 55: X = 32.25 SD. = 5.621 and a KR 21 of .596;
for Thai typing with a maximum score of 45: X = 23.298 Sd. = 5.408
and a KR 21 of .636.
Practical test of typing: since, in the curriculum, courses were

offered for typing both in Thai and English, practical tests were
constructed for both languages. In the event very few participants
took courses in English typing, however, both tests were analysed.
A Thai text consisting of 678 strokes and an English text consisting

of 598 strokes were selected by the typing instructors. A participant
would type each text twice and would hand in the better version.
Two scores were calculated. The first was a combined speed and
accuracy score which was the number of correctly typed words per
minute. The formula used was:
No. of words = No. of strokes

5 or 4
No. of words/minute = (No. of words) – (No. of wrong words x 10)

Time in minutes
In the number-of-words formula, the denominator was 5 or 4

because this was the presumed average number of letters per word
in the English and Thai languages. This was the generally used
formula in non-formal education typing classes in the Lifelong
Education Centres in Thailand. In the number-of-words per minute
formula, the constant ‘x 10’ was used in the numerator because it
was assumed by the instructors that time spent in correcting one
mistake was equal to the time spent in typing 10 strokes. A separate
score was also given for a combined format and tidiness measure.
The test was piloted in the same schools as the cognitive typing test.
No problems were experienced in its administration. No changes
were made in the test. The psychometric characteristics of the final
tests are given in Table 5.
64
Appendix
Table 5 Psychometric characteristics of the final tests
Maximum
score Mean S.D. KR 21
Thai + English typing
Speed + accuracy 60 16.193 10.724 .914
Format + tidiness 10 6.521 3.220 .867
Thai typing only
Speed + accuracy 30 7.298 7.009 .917
Format + tidiness 5 2.721 2.115 .948
Practical test of sewing: It was agreed that five types of garments

where there was a good deal of variations in both the process and
the product would be the subject of the sewing test. The garments
were: a blouse with long sleeves, a blouse with short sleeves, a
blouse with built-in sleeves, a skirt with one fold in the front
and one at the back, and a six-piece skirt. After the pilot work,
five garments were scored according to the criteria given in the
following paragraph. The ‘blouse with built-in sleeves’ (see Figure 1)
had a normal distribution of scores.
© UNESCO 65
Figure 1 A blouse with built-in sleeves
The experts selected this for the final testing. The scoring criteria
were set by the sewing instructors and approved by the local
experts. The scoring system was set by examining the difficulty
and time consumed in the sewing process. Thus, the experts agreed
that out of a total of 100 points, body measurement should receive
10 points, building pattern 25 points, laying and cutting fabric 25
points, and sewing 15 points.
1. It was agreed that there were 10 basic body measurements

needed for the blouse with built-in sleeves and there would be
one point for each measure. These were (i) shoulder to shoulder,
(ii) chest, (iii) neck to breast and breast to breast, (iv) waist, (v)
round shoulder and underarm, (vi) and (vii) between arm pits
(front and back), (viii) and (ix) neck to waist (front and back),
and (x) shoulder to length desired.
66
Appendix
2. Building pattern (25 points) was divided into five sections; each
section was awarded five points. They were (i) front and back
pieces, (ii) collar, (iii) sleeve, (iv) bent lines, and (v) calculation.
In each section five points were awarded if the measurement
figures were correctly converted into pattern figures and the
pattern was drawn correctly. One point was subtracted for each
mistake.
3. Laying, tracing and cutting (25 points) was divided into three
sections. They were (i) front and back pieces (10 points), (ii)
collar (10 points), and (iii) interfacing of collar and sleeves (5
points). Scoring depended on how well students laid the cloth
in relation to its line (grain), on whether they left enough spare
cloth for sewing, and on whether they cut out all the pieces
correctly. Three points were subtracted for each mistake in
laying, cutting, front-and-back pieces, and collar; and two
points for incorrect interfacing.
4. Sewing (15 points) was divided into five sections and each
section was awarded three points. These were: (i) sewing the
shoulder and side seams, (ii) sewing the collar, (iii) sewing
the sleeves, (iv) sewing the button-holes and buttons, and (v)
hemming. If the participants did the sewing process correctly,
i.e., started from part 1 and went through to part 5 in the
correct order, they would get a full score. If they jumped one
step, they lost three points. Sub-scores were calculated for each
of the four major process sections.
There were two major sub-scores for product: goodness of fit (10
points) and tidiness of sewing (15 points). (See Figure 2).
1. Goodness of fit when the dress was fitted to the model was
also divided into five parts, each receiving two points. The
parts were body, sleeves, collar, length of blouse, and overall
goodness of fit. Two points were awarded for the fit in each part
of the text.
© UNESCO 67
2. Tidiness of sewing was sub-divided into five parts, each part

receiving three points: collar, sewing on buttons, button-holes,
stitching, and hemming. The pilot work was undertaken using
40 participants in the Khonken and Petchaboon learning
centres. No serious problems were encountered. The time limit
was set at five hours and this was adequate for the blouse with
built-in sleeves.
Figure 2 Body measurements
At the final testing, two Ministry sewing experts were the scorers.
Each scorer scored every garment according to the criteria and
their average was calculated to represent a participant’s score.
Unfortunately, scorers were instructed only to give a global score for
each of the six major sub-scores. If we were to repeat the exercise
again, we should have each separate item (for example, in body
measurement score). The reliability has been calculated for each
68
Appendix
sub-score using KR 21. The means, standard deviations and KR 21s

are presented in Table 6. All items included in the total sewing score
had loadings of at least .50.
Interview schedule for tracer study: This instrument was developed

by the team members. The questions contained information on
the centres from which respondents had graduated, on how much
money they could earn, on how much money they could save, on
how they utilized the knowledge they gained from the course.
Table 6 Means, standard deviations, and reliability

coefficients for the sewing test
Mean S.D. KR 21
A. Process
1. Body measurement (10) 9.222 1.518 .799

2. Building pattern (25) 19.294 3.751 .715
3. Laying and cutting fabric (25) 20.528 2.518 .862
4. Sewing (15) 10.209 4.440 .906
B. Product
1. Goodness of fit (10) 6.816 3.174 .895

2. Tidiness (15) 9.102 4.180 .867
Total score 75.41 14.51 .921
© UNESCO 69
Data collection
The research team and staff visited and collected data at sample
centres during March and early June, 1981, at the end of the typing
and sewing courses. The course instructors were asked to fill out the
questionnaires and return them to the research team. At that time
the team administered both cognitive and practical tests for both
courses to the students.
About six months after giving the tests, the tracer study was carried
out by the team members and some additional staff. All data were
coded and punched in Bangkok. The analyses were undertaken at
the National Statistical Office in Bangkok and the D.K.I. Computer
Centre in Jakarta.
Results Table 7 presents, in percentage form, the characteristics of

both students and teachers for the six courses. The data showed
clearly that the students coming to the sewing courses were quite
different from those who attended the typing courses. In the sewing
courses we found only women (except that one man attended one
of the short courses), who tended to be older, less well-educated,
more often in manual jobs and from manual workers’ homes but
with more previous experience. They most often chose to do these
sewing courses for themselves and their families, rather than to
further their careers or to gain educational credits. More than half
of them had sewing machines in their homes.
The teachers on these sewing courses were all women who tended
to be older and more experienced than the typing teachers, but with
fewer formal qualifications.
The long (200 hours) sewing course attenders were an elite. They
were older women who already had some experience in sewing and
were seeking to reach a higher standard. Sixty per cent of them
wanted to sew for themselves and their families, and 72 per cent
70
Appendix
Table 7 Student and teacher background variables
Typing Sewing
Short Long
(150 hrs) (200 hrs)
Short Long
Thai+Eng Thai Thai+Eng Thai (150 hrs) (200 hrs)
Student variables N=72 N=58 N=47 N=39 N=135 N=147
Sex : Percent women 43.1 55.2 40.4 48.7 99.3 100

Education
Lower primary 6.9 10.3 - - 51.9 46.9
Upper primary 25.0 20.7 29.8 - 30.4 23.1
Lower secondary 38.9 48.3 48.9 59.0 8.1 15.6
Upper secondary 20.8 13.8 14.9 23.1 1.5 8.2
Age
11 - 20 65.3 75.8 80.7 41.0 63.0 37.5
21 - 30 30.7 24.1 19.1 56.5 29.7 42.3
31 + 4.2 - - - 6.6 19.7
Previous experience : None 91.7 84.5 89.4 92.3 56.3 36.7

Occupation
Manual 19.4 17.2 17.0 30.7 55.6 47.6
Student 59.7 60.3 46.8 38.5 8.9 5.4
Motivation
Credit 16.6 44.8 44.7 7.7 4.4 0.7
Career 54.2 34.5 38.3 64.1 37.8 32.0
Self and family 23.6 12.1 14.9 23.1 53.3 59.9
Siblings : 7+ 47.3 24.1 40.5 36.0 34.8 34.7
Mother’s education
Primary 73.6 63.7 51.1 89.7 83.7 69.4
Secondary 8.4 1.3 - 2.6 - 0.7
Father’s education
Primary 68.1 63.8 48.9 66.6 78.5 64.7
Secondary 13.9 6.8 2.1 17.9 4.4 8.1
Mother’s occupation
Manual 63.9 62.1 46.8 74.3 77.0 55.7
Housewife 6.9 13.8 8.5 12.8 3.7 15.0
Father’s occupation: Manual 62.5 58.7 49.0 66.7 76.3 61.2
Machine at home : No 98.6 98.3 100.0 94.9 47.4 27.9
© UNESCO 71
Table 7 (continued)
Typing Sewing
Short Long
(150 hrs) (200 hrs)
Short Long
Thai+Eng Thai Thai+Eng Thai (150 hrs) (200 hrs)
Teacher variables N=72 N=58 N=47 N=39 N=135 N=147
Sex : Female 43.1 43.1 57.4 100.0 100.0 100.0

Qualification: Certificate 100.0 100.0 100.0 100.0 72.6 77.6
Additional teacher training
within 5 years 100.0 77.6 - 35.9 54.1 85.7
Teacher experience
None - - 23.4 84.6 28.9 39.5
11 + years - - - - 11.1 35.4
Teacher's age
Over 35 - - - - 17.8 88.4
Course variables
Class size : Under 16 69.4 60.3 34.0 61.5 43.7 59.9
Shift
Morning 34.7 51.7 42.6 28.3 29.6 68.7
Afternoon 40.3 8.6 10.6 15.4 40.7 8.2
Evening 25.0 39.7 46.8 56.4 29.6 22.4
Equipment : Lacking 59.1 74.1 76.6 2.6 56.3 72.1
Attendance : 75% + 90.2 82.9 79.0 64.5 92.6 63.5
Quality of facility
Excellent - - 34.0 48.7 17.0 -
Fair 98.6 69.0 66.0 51.3 83.0 100
Poor 1.4 31.1 - - - -
Table 8 Mean achievement scores in sewing for 150 and 200 hour courses
Course Body Building

hours measure pattern Cutting Sewing Fit Tidiness Total
200 9.15 19.33 29.08 11.01 7.29 9.70 77.56
150 9.29 19.44 19.90 9.56 6.37 8.57 73.12
Difference -0.14 -0.11 ***1.18 **1.45 *.92 *1.13 **4.44
72
Appendix
had machines at home. They were taught by the oldest and most
experienced teachers, generally during the morning shifts. As with all
longer courses, however, their attendance record was less complete.
The four varieties of typing courses were generally about half-

and-half men and women, and most had at least some secondary
education. They were also much younger than the sewing course
attenders (for example, 81 per cent of the long Thai plus English
course were teenagers), and about half of them were students.
Their teachers, likewise, were about half-and-half men and women

(though all the teachers on the long Thai typing course were
women) and tended to be much younger and better qualified than
the sewing teachers (all of them had at least a Certificate and some
had university degrees).
Students in these four courses varied a great deal, but to some

extent the long Thai typing course stood out: participants in this
course had a better education, they were a little older (mostly in
their twenties), they were more often in manual occupations and
they were most often career-minded in their motives for doing the
course.
Their parents were slightly better educated, though more often

in manual occupations than other typing students. More of them
attended evening shifts in rather better equipped classrooms where
they were taught by an all-women staff of mostly inexperienced
teachers. One got the impression here of a group that was striving
for upward social mobility by doing a longer Thai typing course in
their own time after work.
It was not clear why these latter students would not rather put
their energies into a Thai plus English typing course; perhaps their
knowledge of English was insufficient. Clearly, students did treat
these courses as graded steps in accomplishment: it was not true to
© UNESCO 73
say, for example, that students doing a combined Thai plus English
course were already able to type in Thai (about 90 per cent in both
the short and long Thai plus English courses lacked previous typing
experience). Quite a large number of students in these courses were
young teenagers picking up an additional credit while they were
studying the adult continuing education by attending day-time
classes.
The participants’ achievement

The major objective of the study was to examine the relative
effectiveness of the two types of courses, 150 hour and 200 hour
sewing and typing. The same tests were administered to students
in both courses. Table 8 presents the results for the six sub-tests in
sewing.
As can be seen, the total score for the 200 hour course was
significantly higher than that for the 150 hour course. The largest
differences were for sewing and cutting. However, for body
measurement and building patterns, there was no difference. Table 9
presents the results for typing. As can be seen, the total scores on
the cognitive test for longer courses were higher than for shorter
courses. However, it was only significant for the Thai typing course.
The major difference was for the principle of typing official letters.
For Thai plus English typing, the major differences were for ‘how to
feed and release papers and set intervals’ and ‘principle of typing
official letters’.
For the practical test, the scores for longer courses were not
significantly higher than for shorter courses on ‘speed’. Remarkably,
the scores on ‘format plus tidiness’ for longer courses were less than
for shorter courses.
74
Appendix
The objectives of the tracer study were to find out whether the
graduates took employment within six months of the end of the
course, and how participants used the skills they learned in the
course.
The criteria used for these objectives were the amount of money
the participants earned, the money they saved, the amount of
time spent on typing and the reason why they did not utilize their
knowledge and skills. Table 10 presents this information.
In the sewing courses, we found over 90 per cent of the graduates

for both courses used the skills they learned in the courses for
themselves and their families. Only 6.7 per cent of participants
(1.8 per cent for the long courses) said they did not use the skills
they had learned. The reasons they gave included the fact that they
had no time, no machine or that they lacked confidence. The data
showed that 63.5 per cent of participants from the short sewing
course and 46.3 per cent from the long sewing course saved less
than 50 Bahts (1 US Dollar is equal to 23 Bahts) per month.
© UNESCO 75
Table 9 Mean achievement score in typing

(150 and 200 hour courses)
I
Basic Body
Course knowledge Parts Maintenance position Feed return Manipulation
Thai
200 hours 2.18 2.54 0.69 1.77 2.74 4.90
150 hours 1.84 2.67 0.90 1.76 2.66 4.50
Difference 0.34 -0.13 **-0.21 0.01 0.08 0.40
Thai + English
200 hours 7.23 3.55 0.77 1.70 2.85 5.06
150 hours 6.45 4.14 0.47 7.78 3.97 4.07
Difference **0.78 -0.59 0.30 -0.08 ***-1.12 ***0.99
II
Practical
Speed
Course typing Carbon Stencil Principal Total Speed Format
Thai
200 hours 1.46 1.23 1.13 6.21 24.85 8.59 2.74
150 hours 1.29 1.16 0.90 4.59 22.26 7.47 3.02
Difference 0.17 0.07 0.23 ***1.62 *2.59 1.12 -0.28
Thai + English
200 hours 2.53 1.45 1.02 7.02 33.19 16.47 6.28
150 hours 2.07 1.36 1.44 5.97 31.84 16.01 6.68
Difference *0.46 0.09 ***-0.42 *1.05 1.35 0.46 -0.4
76
Appendix
Table 10 Information on tracer study
Typing 150 hours Typing 200 hours Sewing
Thai + Thai +
English Thai English Thai 150 hours 200 hours
N = 32 N = 31 N = 32 N = 22 N = 90 N = 109
Earn money or not?

No 84.4 96.8 93.5 95.5 56.7 58.7
Yes 15.6 3.2 6.5 4.5 43.3 41.3
• If they earn, how much/month?
(N=5) (N=1) (N=2) (N=1) (N=39) (N=45)
1 - 50 Bahts 40 - 50 - 46.3 59.6
51 - 100 Bahts - 100.0 - - 12.9 13.2
101 - 150 Bahts - - - - 10.4 2.2
151 - 200 Bahts - - - - 10.2 -
200 + Bahts 60 - 50 100.0 20.5 24.3
• If they don’t earn money, why?
(N=27) (N=30) (N=29) (N=21) (N=51) (N=64)
No employment 29.6 36.6 44.8 38.1 19.6 7.8
No time 7.4 33.3 34.5 23.8 27.5 46.9
No machine 22.2 6.7 17.2 33.3 15.7 7.8
Lack of confidence 3.7 6.7 3.4 - 31.4 26.6
Continue studying 18.5 10.0 - - 3.9 1.6
Work free of charge - 6.7 - 4.8 2.0 9.4
Sew for self/family?

No - - - - 6.7 1.8
Yes - - - - 93.3 98.2
• If sewing, how much money saved per month?
(N=84) (N=107)
1 - 50 Bahts - - - - 63.5 46.3
51 - 100 Bahts - - - - 15.6 33.4
101 - 150 Bahts - - - - 1.2 11.7
151 - 200 Bahts - - - - 3.6 3.7
200 + Bahts - - - - 15.6 3.6
© UNESCO 77
Table 10 (continued)
Typing 150 hours Typing 200 hours Sewing
Thai + Thai +
English Thai English Thai 150 hours 200 hours
N = 32 N = 31 N = 32 N = 22 N = 90 N = 109
• If they don't sew, why?
(N=6) (N=2)
No machine - - - - 33.3 -
Lack confidence - - - - 33.3 -
No time - - - - 33.3 100.0
After graduation, ever typed?

No 21.9 19.4 32.3 40.9 - -
Yes 78.1 80.6 67.7 59.1 - -
• If typed, how many minutes per week?
(N=25) (N=25) (N=21) (N=13)
1 - 60 64.0 52.0 81.3 61.5 - -
61 - 120 12.0 8.0 4.8 - - -
121 - 180 8.0 - 9.6 - - -
180 + 16.0 40.0 4.8 38.5 - -
• If don't type, why?
(N=7) (N=6) (N=10) (N=9)
No machine 85.7 66.7 80.0 85.9 - -
No employment - - - - - -
No time - 33.3 10.0 - - -
Isn't part of job 14.3 - 10.0 11.1 - -
78
Appendix
In this research study, we defined participants who earned money

from sewing or typing as those who took up either permanent or
part-time jobs. Only 43.3 per cent of the short course participants
and 41.3 per cent of those in the long course took up jobs. Of
those, 46 per cent from the short course and 60 per cent from
the long course earned less than 50 Bahts per month. Again, the
major reasons for not taking up a job were lack of time and lack of
confidence.
From the four types of typing courses, many participants stated that
they continued to type, i.e., use the skill six months after the end
of the course. Seventy-eight per cent from the short course (Thai +
English) and 59 per cent from the long course (Thai only) used their
typing skills. Those who did not use their skills gave as reasons that
they had no typewriter at home, no time or that their occupation did
not require the typing skills they had learned in the course. Only
a few (15.6 per cent) earned money from typing because many of
them were unable to find a job (there were very few employment
possibilities in their area) and they did not possess a typewriter at
home. Some stated that they lacked confidence or that they typed
free of charge for their friends. Many of them had no time because
they were still studying.
However, just over 50 per cent typed for their own pleasure up to
one hour a week and some of them up to three hours per week.
Those who had taken the Thai typing courses spent more time
typing than those who had taken the Thai plus English courses. In
fact, 39 per cent of Thai typing course participants typed more than
three hours a week.
© UNESCO 79
Regression analyses
Two main sets of regression analyses were conducted. The first
concerned sewing and the second typing.
Sewing included three regression analyses. The first predicted end-

of-course achievement and the criterion was the total sewing score.
The second concerned the tracer study and the criterion was the
amount of money earned per month six months after the end of the
course. The third was money saved per month six months after the
course.
For typing, it was intended that a measure of typing speed which

incorporated a correction for typing accuracy, and a measure of
typing knowledge would provide criterion variables for regression
analysis. However, a detailed investigation of the distributions
and factorial structures of the two measures indicated that neither
would be appropriate for use as a criterion variable. The measure of
typing speed had a severely bimodel distribution which occurred
because the correction for typing accuracy was applied in a fashion
that many students whose ‘actual’ corrected score was negative
were given a score of zero. The measure of typing knowledge was
subjected to a principal components’ analysis followed by Varimax
rotation and it was discovered that there was no sound evidence
for assuming that the total score on the test was assessing a single
dimension which could be readily interpreted.
Consequently, the regression analyses for typing were limited to the

use of only a single criterion: the number of minutes per week spent
in typing six months after the end of the course. For each analysis
the possible independent variables were scrutinized for skewness.
High skewed variables were dropped. Correlations were calculated
among independent variables and between these variables and the
criterion.
80
Appendix
At the same time, we had a chronological model in mind. We

assumed that various pre-course variables would be important and
that these would be included in one cluster or block of variables
because the participants had been exposed to these before coming
to the course. Secondly, there were a set of variables which
characterized the course itself. These would be entered as a second
cluster or block of variables.
Sewing: On the basis of correlation with the criterion and with

other variables, the following were selected. For all regressions,
father’s occupation, participant’s age, kits (availability of a
sewing machine at home) and previous ability (meaning previous
experience in sewing) were included in Block I. For the tracer study
criteria, one extra variable was added, namely ‘objective’, which was
a coding of the reason why the participant wished to take the course
(1 = for a career, 2 = to gain a credit, and 3 = to sew for the family or
for herself).
For the end-of-course achievement, regression Block II included

teacher training (the level of pre-service teacher training), additional
training, (0 = no additional training, 1 = some additional training
in the last five years), facility (0 = teachers perceived the facilities
for sewing at the centre to be inadequate, 1 = adequate), quality of
facility (1 = very poor, 2 = poor, 3 = fair, 4 = excellent), the shift in
which participants attended the centre (1 = morning, 2 = afternoon,
and 3 = evening), the age of the teacher, and whether the course
was a 150 hour or 200 hour course (1 = 150, 2 = 200). It had been
proposed to enter shift as a dummy variable, but because of the
correlations, it was decided to leave it coded as 1, 2, 3.
For the tracer study regression analyses, Block II consisted of

teacher training, shift, 150-200 hour course, and size of class. For
the tracer study, the sewing test score at the end of the course was
entered as a third block.
© UNESCO 81
Let us now examine the results. Tables 11 and 12 report the results
for achievement in course and tracer study respectively. The
complex multi-stage sample designs employed in this study did
not conform to the well-known model of simple random sampling.
Consequently it was not possible to employ the standard error
estimates provided by SPSS computer programme package (Nie
et al., 1975). Instead, it was decided to use the formula provided
by Guilford and Fruchter (1973: p. 145) for the standard error
of correlation coefficients to obtain approximate estimates for
the standard error of a standardized regression coefficient.
This decision provided more conservative error limits than the
SPSS programme and consequently represented a more realistic
approach to testing the significance of the standardized regression
coefficients (Ross, 1978). Accordingly, as the number of students
in Table 11 is 282, .12 represents two standard errors while the
number of students in Table 12 is 199 so 0.14 represents two
standard errors. Only 28 per cent of the variance was explained by
the variables in the model of predicted achievement at the end of
the course (Table 11). This is disappointing and clearly more effort
will have to be made to identify other variables which are likely
to be influencing sewing achievement, and include these in future
studies of this kind. The only variables to survive the regression
were participants’ age, the additional training of the teacher, and
the quality of the facilities as perceived by the teacher. Older
participants had higher achievement scores, better quality facilities
were associated with higher achievement and so was the fact that
teachers who had received some in-service course on sewing in
the last five years had participants who scored higher on the end-
of-course achievement test. The latter two variables were clearly
policy variables and it would seem that it would be advantageous to
attempt to supply machines and materials of adequate quality and
to give teachers special in-service courses.
82
Appendix
At the same time, it is interesting to note that being in the 150 or

200 hour course did not produce significantly different achievement
scores, other things being equal. Nor did the shift in which the
participant was enrolled. The initial pre-service training of the
teachers was not associated with achievement.
Table 11 Simple correlations, Beta coefficients,

and R-squared for sewing achievements
Block I Block II
r ß r ß
Father’s occupation -.16 * *
Kits .20 * *
Participant’s age .30 .25 .15
Previous experience .16 * *
Teacher training -.01 *

Additional training .43 .34
Quality of facility .16 .21
Facility -.11 *
Shift .15 *
Teacher age .10 *
150/200 hour course .15 *
R2 .12 .28
Note : N = 282, 2 se (ß) = 0.12.
ß not exceeding two standard errors are asterisked (*).
© UNESCO 83
Table 12 Simple correlations, Beta coefficients, and R-squared for earning

and saving money
Money earned Money saved
Block I Block II Block III Block I Block II Block III
r ß r ß r ß r ß r ß r ß
Father’s occupation .00 * * * -.34 -.33 -.28 -.24
Participant’s age .00 * * * .15 * * *
Objective -.21 -.23 -.17 * -.23 -.26 -.19 *
Kits .18 .31 .34 .31 .07 * * *
Previous experience -.09 -.24 -.25 -.26 .13 * * *
Teachers' training .13 * * .38 .15 .19
Size of class -.03 * * -.30 -.24 -.26
Shift -.15 -.23 -.29 .20 * *
150/200 hours .17 -.26 -.30 -.13 -.15 -.19
Total sewing score .28 .35 .38 .36
R2 .13 .21 .31 .20 .33 .44

Note : N = 199, 2 se (ß) = 0.14.
84
Appendix
Table 12 presents the results of two regression analyses, the first

making ‘money earned’ as the criterion, and the second taking
‘money saved’ as the criterion. Five variables were significantly
associated with money earned. Those were the possession of a
sewing machine at home (as we would expect), previous experience
in sewing (but this was negatively associated with earning, meaning
that those who had no previous experience earned more after the
course than those with previous experience), shift (those in the
morning shift earned more), those in the 150 hour course earned
more than those in the 200 hour course (in fact, being in the 150
hour course was worth 289 baht per month more than being in the
200 hour course), and the end-of-course achievement as measured
by the test.
Size of class, level of teacher training, the reason for joining the
class, age and father’s occupation were not associated with money
earned. Again, however, the regression accounted for only 31 per
cent of the variance.
For money saved, five variables were important. Participants whose

fathers engaged in agriculture and labour saved more. Participants
whose teachers had higher levels of pre-service education, those in
smaller classes in the 150 hour course, and again, gratifyingly, those
with higher end-of-course achievement scores saved more. In this
case, the regression accounted for 44 per cent of the variance.
What can we glean about the adult education non-formal

programme in sewing?
Firstly, the sewing course itself was important because those with
higher scores earned more and saved more, but we must remember
that higher scores were primarily a function of participants’ age,
additional teacher training and the quality of the facilities.
© UNESCO 85
Secondly, there was no advantage for anyone to have been in the

200 hour course as opposed to the 150 hour course. There would
seem to be good reason to abandon the 200 hour course and provide
additional training for teachers and improve the quality of facilities.
Why size of class should be inversely related to ‘money saved’ (and
shift to ‘money earned’) is not immediately clear. We suspect that
the class-size might be so large that the teachers were not able to
supervise them all. Therefore, after graduation from the course, they
lacked confidence to sew even for themselves and their families.
Regarding the shift and ‘money earned’, the morning shift students
might have been those who were in the waiting period for jobs and
wanted to take sewing seriously as a career (as opposed to evening
shift enrollees who entered to take it as a hobby).
Typing: As mentioned earlier, the criterion measure was limited

to time spent per week on typing six months after the end of the
course. We were hesitant about the use of this criterion variable
because we believed that it would contain substantial measurement
error. For example, time spent in typing could be influenced either
by a desire to practice typing, or perhaps the need to type personal
and/or family papers and documents. It would therefore be likely
that participants might spend a great deal of time typing in one
week yet very little in the following week. The students’ answers
to the question might therefore have represented an ‘average time’
spent over the time since their course had finished or it might have
represented an estimate of the amount of time they had spent in the
few weeks before their interview.
Variables entered into the regression were arranged in two blocks,

bearing in mind the same process by which we chose and grouped
the variables for sewing analysis. In fact, the two block regression
model was equivalent to the first two blocks used to examine money
earned and money saved in sewing. The third block was omitted
because the cognitive measure was unsuitable. Variables in Block I
included father’s occupation, participants’ age, typing equipment at
home (whether or not they or their family owned a typewriter) and
86
Appendix
personal study objective (for a career, for credit or for self, family
and others). Block II included teacher qualification, size of class, the
shift attended (morning, afternoon, or evening), and finally the type
of course participants enrolled in (150 or 200 hours).
The achieved sample size for these interviews was 116 participants
(53.7 per cent of all typing). As a result, 0.19 was calculated as being
about two standard errors for the beta coefficients in this analysis.
The results of the analysis are presented in Table 13.
Table 13 Time spent typing per week
Block I Block II
r ß r ß
Father’s occupation .07 * *

Objective .09 * *
Participant’s age .19 .22 .22
Kits .16 * *
Teacher training -.04 *

Size of class -.13 *
Shift -.14 *
150/200 hours -.12 *
R2 .09 .14
Note : N + 116, 2 se (ß) = .19.
The result of the study showed that simple correlations between

predictors and the criterion were all below two standard errors,
except for the variable ‘objective’ which barely reached significance.
This variable was again the only variable which continued to be
© UNESCO 87
significant in both blocks at the end of the analysis. This meant

that typing students whose objectives were to take the course for
miscellaneous purposes, e.g., for self and family, typed more than
those who did not have this objective. But in the case of sewing,
the result was reversed, that was participants in the sewing courses
who used sewing skills six months after the course were those
whose motivation to take the course had been that of making a
career in sewing.
The total amount of variance predicted at both stages was rather

low, reaching only 14 per cent when all predictors had been
entered. In summary, at both the simple correlational level and
at the multiple correlational level, we had very limited success in
explaining the amount of time spent in typing by students after
they had finished their courses.
It is extremely difficult to draw any policy recommendation

from this section of the analysis because of the reasons outlined
above. At best, we can make two suggestions: first, examine the
personal study objective variable closely. We can suggest that
typing courses should be opened at all three levels, that is basic
typing, intermediate typing and advanced typing so that students
in the more advanced course are those who want to continue to
take typing for very obvious reasons, e.g. a career. This approach
should, then, lead to a strategy to reduce drop-outs. Secondly,
draw upon the experience we have gained to look more closely at
the potential for extreme multidimensionality in tests of practical
vocational skills such as typing. To the best of our knowledge, this
field of research has received limited attention in our country. We
believe it is a field which presents different problems than covered
in the research available from western countries because of the
substantially different structure of the language.
88
Appendix
Conclusions and recommendations

The study had four main aims. The first was whether participants
in the 200 hour course learned more than those in the 150 hour
course. In sewing, the participants in the 200 hour course gained
significantly higher scores on the end-of-course sewing test. In
typing, there was no difference in scores between the 150 and 200
hour courses. Whereas we have confidence in our sewing test, we
must express some concern about our typing test, so that the typing
results must be interpreted with caution.
The second aim of the study was to identify those variables

influencing the achievement of participants at the end of their
non-formal education course. As can be seen from the detailed
discussion earlier, in the study we failed to produce a good typing
test and therefore present no multivariate analyses using this
criterion. However, for sewing three variables were identified, two
of which were important. The first was additional teacher training.
Participants with those teachers who had received at least one in-
service teacher training on sewing in the last five years obtained
higher sewing scores than participants with teachers who had not
attended such in-service courses. The second was the quality of
facilities. The higher the teachers rated the quality of the facility,
the higher the scores of the participants. Facilities included the
sewing machines, zigzag machines and irons. The third was
characteristic of the participants. Older participants tended to score
slightly higher than younger ones. Of these three variables, it was
‘additional teacher training’ which was by far the most important.
The third and fourth aims were to discover whether participants in

the study took up employment using their skills within six months
of the end of the course. The definition of employment was whether
participants earned money either from full or part-time work. Using
this definition, only 3.5 per cent of those in the Thai only (short and
long) took up employment. In the Thai plus English typing courses,
© UNESCO 89
16 per cent from the 150 hour course and 7 per cent from the 200
course were employed. In sewing, 42 per cent became employed but
with no difference between the short and long courses. The result
for typing is disappointing, but participants were asked how much
they typed, irrespective of whether this was for employment or not.
A further 66 per cent from the Thai typing course indicated that
they typed, but without reimbursement. From the Thai plus English
course, a further 62 per cent from the short course and 53 per cent
from the long course typed, but again, without reimbursement.
In general, typists were typing below one hour per week after
six months. Many of the participants in the typing courses were
students in other adult continuing education courses held in the
centres. These courses were for obtaining educational certification
equivalent to full-time schooling. Fifty per cent of all participants
in the Thai typing courses were students as were just over 50 per
cent in the Thai plus English courses. It could not be expected that
those who were still students six months after the end of the course
would be employed. Only 42 per cent of the participants in the
sewing classes were employed but another 50 per cent sewed for
their families and their own use. Thus, the sewing course can be
regarded as highly successful.
A further analysis was conducted to determine how much money

the sewing participants earned and saved six months after their
course, and which were the major determinants of earning and
saving. Forty-two per cent were earning money varying in amount
from 25 to 3,000 Baht per month. Five factors were important in
determining money earned – the possession of a sewing machine
at home, their achievement at the end of the course, which course
they attended (those in the 150 hour course earned more), which
shift they attended (the morning shift earned more) and their
previous experience (those with experience in sewing before the
course earned less). Money saved was also mainly determined by
five factors, only two of which were the same as for money earned.
These two were the course (150 hour course participants saved
90
Appendix
more), and the score at the end of the course. The other three factors
were being from farming families, being members of smaller classes
at the centre, and having teachers with higher levels of pre-service
training.
To summarize the findings for sewing it would appear that the

important variables which affect the work of the centres are teacher
training, both pre- and in-service, the quality of facilities, the shift
(morning shift performed better) and the length of the course (150
hour participants performed better and earned and saved more). It
would be unwise to comment on typing because it was impossible
to evaluate it accurately partly because so many of the participants
are still studying.
On the basis of our conclusions, we make the following suggestions

to the Department of Non-formal Education at the Ministry of
Education:
• Serious consideration should be given to abandoning the

200 hour sewing course and concentrating on improving the
150 hour course by ensuring better quality of facilities and
providing in-service training for all teachers.
• Repeat the typing study with two changes incorporated. First

construct psychometrically sound tests of typing and, secondly,
make the tracer study cover a period of 18 months after the end
of the course.
© UNESCO 91
E XERCISE
Select two published articles from an educational research or
educational evaluation journal that have the following general
features.
• An article that describes the evaluation of a new textbook,

teaching method, or curriculum reform by using an
experimental approach, which includes ‘treatment’ and ‘control’
groups of students.
• An article that describes the use of a sample survey approach

in which data are collected for the purposes of ‘monitoring’
and/or ‘evaluating’ student educational achievement and the
general conditions of schooling.
Use the ‘Checklist for Evaluating the Quality of Educational Research’

described in this module to examine the quality of the two articles.
92
Appendix
References
Brickell, J.L. (1974). Nominated samples from public schools and
statistical bias. In American Educational Research Journal. Vol. 11.
No. 4, pp. 333-41.
Coleman, J.; Hoffer, T.; Kilgore, S. (1987). Public and private schools.
Washington: National Center for Educational Statistics.
Kish, L. (1965). Survey sampling. New York: Wiley.
Ross, K.N. (1986). Sample design options for a multi-purpose survey

of villages in Indonesia. Assignment report. Jakarta: Office of
Educational and Cultural Research and Development, Ministry
of Education and Culture.
Ross, K.N. (1987). Sample design. In International Journal of

Educational Research. Vol. 11, No. 1, pp. 57-75.
Ross, K.N.; Mählck, L. (eds.). (1990). Planning the quality of

education. UNESCO: International Institute for Educational
Planning. Oxford: Pergamon Press.
Thongchua, V.; Phaholvech, N.; Jiratatprasoot, K. (1982). Adult

Education Project – Thailand. Evaluation in Education. Vol. 6, pp.
53-81.
Tuckman, B.W. (1990). A proposal for improving the quality of

published research. Educational Researcher. Vol. 19, No. 9, pp.
22-25.
© UNESCO 93
Additional readings
Borg, W.; Gall, M. (1989). Educational research: an introduction (Fifth
edition). New York: Longman.
Cook, T.D.; Campbell, D.T. (1979). Quasi-experimentation: design and

analysis issues for field settings. Boston: Houghton-Mifflin.
Hedges, L.V.; Olkin, I. (1985). Statistical methods for meta-analysis.

Orlando, FL: Academic Press.
Keeves, J.P. (1988). Educational research, methodology, and

measurement: an international handbook. Oxford: Pergamon
Press.
Kerlinger, F. (1986). Foundations of behavioral research

(Third edition). New York: Holt, Rinehart and Winston.
Millman, J.; Goroin, D.B. (1974). Appraising educational research: a

case study approach. Englewood Cliffs, NJ: Prentice-Hall.
Walberg, H.J.; Haertel, G.D. (1990). The international encyclopedia of

educational evaluation. Oxford: Pergamon Press.
Wolf, R.M. (1990). Evaluation in education (Third edition). New York:

Praeger.
94
Quality (SACMEQ).


and innovation”.
5
Module
Graeme Withers
Item writing for tests

and examinations



Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66

Module 5 Item writing for tests and examinations
Content
1. Item writing – art or science? 1
2. Test specifications or blueprints 5
3. Developing the detailed matrix 8
4. Setting parameters for layout and instructions 16
5. Item types or formats 18
6. Selecting item types for test purposes 29
7. Item writing as a creative act 34
8. Panelling or moderating drafted items 46
9. Stage One – editing or vetting 50
10. Advance preparation for final formatting 56
11. Field or trial testing 59
1
12. Item analysis 61
13. Stage Two – editing for publication 64
14. But is it a good test? 66
15. Training scoring teams 76
16. Further reading 81
17. Exercises 83
18. Glossary of terms 84
II
Item writing – art or science? 1
Within the field of test development, the tasks and/or questions

that are used to construct tests and examinations are referred to as
‘items’, and the range techniques involved in preparing those items
are collectively referred to as ‘item writing’.
Is item writing an art or a science? The best item development

techniques combine elements of both these intellectual activities.
On the one hand, there is a fair amount of experimental method,
which we might recognize as scientific, incorporated within the
whole set of procedures for developing a good item, or the sets of
such things we call ‘tests’. However, as this document will make
clear, writing a good item is also a highly creative act. By the end of
the process something new, powerful, and useful has emerged – a
test instrument which has used words, symbols or other materials
from a curriculum or a syllabus in a new way, often to serve a
variety of educational purposes. In doing so, the item developer
needs imagination and ingenuity as well as knowledge: form,
structure and balance become important, as they are to a sculptor
or a musician.
Why are these building-blocks of tests called ‘items’, anyway? Is

it merely educational jargon? Why not call them ‘questions’? The
choice of the word ‘item’, in preference to ‘question’, draws attention
to two matters. One is that items are often not in question format
– the test-taker is required to perform a specific task, or reveal
specific knowledge, which is implied in the words given on the test-
paper, rather than explicitly offered as a direct question.
© UNESCO 1
The other matter concerns the independence of items in a test. Like

the items on a shopping list, they are discrete, or they should be. If
you can’t get one right, that should not stop you from having a fair
chance of obtaining success on all of the others.
The term ‘item writing’, used in the title of this document, draws
attention to this essential independence – the separate skills,
abilities or pieces of knowledge which make up human learning are
considered individually in the test. This is the prime focus. However
the discussion (and the test development process) begins and ends
with consideration of a second focus. It considers what happens
when these items achieve additional significance or importance
by having been grouped or combined with other items to form a
test instrument. We must continually remember that our building
blocks are part of a larger whole.
Several times in the paragraphs above, the word ‘development’ has

been used. This is not mere jargon either. It draws attention to the
fact that items do not spring to life ready-made in an item writer’s
brain. A thorough developmental process, often in a defined and
specialised sequence, occurs. Like the student knowledge to be
tested, it occurs gradually – sometimes with false starts, and often
with much wastage on the way, as the test writers clarify their
initial ideas, add to them, try them out, and finally decide what will
serve their purposes best. Note also the use of the plural ‘writers’.
Item writing is best done using a team approach at various stages of
the exercise.
The purposes, or objectives, of the final test are crucial to the

process too. This is where the activity of item writing really starts.
What exactly do we want to or need to find out, about student
knowledge? Textbooks on educational assessment offer long lists
of possible purposes for testing: they cover mastery of processes
or knowledge; assessing more general achievement within an
educational domain; aptitude for further study; diagnosis of specific
2
Item writing – art or science?
learning difficulties, and so on. However, the real objectives for

any test come directly from considering two things in conjunction.
The first is the actual educational context in which the results
of the testing will be used. The second is the knowledge and
understanding the test-taker will be expected (or able) to bring into
the test-room. The objectives which are determined will therefore
be local and specific, and determining what they are is the starting
point for determining what the test and its items ought to look like.
Much of the material in this paper concerns one particular branch

of item writing which is perhaps the most difficult to master – the
development of multiple-choice items and tests. However this
emphasis should not obscure the fact that all items, from simple
short-answer formats to extended response essays, need the same
careful and methodical processes during their development if
the final tests are to be reliable and valid measures of learning
outcomes. The suggestions made as to process can (and should)
be applied whether the user is a teacher in a school preparing
a semester test or an administrator setting up a large national
assessment program.
Where does basic test-writing expertise come from? Documents

such as the present one can provide a survey of some of
the intellectual processes required of a test writer, and the
organisational structures which will assist the work. However the
key knowledge a test writer needs derives directly from professional
experience as teachers or educators – our knowledge of the courses
we present and the learning habits and patterns of students as they
make their way through those courses. This knowledge builds up
gradually over years, and while we might not be able to articulate it
very clearly to another person as we begin to write the test, we will
use it constantly as we review what it is we want to test and how
we might best do it. We will sometimes make into ‘testing points’
the sorts of problems that we know from classroom or lecture room
experience students find in mastering the material. We don’t do
© UNESCO 3
this from a negative position. A ‘golden rule’ presented later in this

document firmly asserts: ‘Avoid tricks and trivia’. We are not out to
trap or trick as many of the student test-takers as we can. We do it
because we need to be able to confirm that students (or a majority of
them) have indeed mastered the material we have taught them.
There is no substitute for this detailed knowledge of the teacher’s

craft and the learning which goes on in classrooms. No matter how
expert test writers might be in the art of test writing itself, they will
produce poorer tests than students deserve if they are not equipped
in this way. If we do not have this knowledge ourselves we should
involve people who do within the test development program as
often and as decisively as possible. While the processes of testing
will use many insights we have learned over the years as to the
best ways of measuring human capacities, the measurement aspect
should not be allowed to predominate. What must predominate
are the specifically educational aspects of the testing. Involving
practitioner teachers will help ensure this.
4
Test specifications or 2
blueprints
Why specifications? In most professional endeavours, it is economical

of people’s time, and necessary from the point of view of sound
practice, to have a thoroughly planned view of the whole exercise
before one starts to cobble together bits and pieces (in this case, the
items), hoping they will do to meet a professional objective or serve
a professional purpose. The term sometimes used is ‘blueprint’, by
analogy with an architect’s need to design the whole structure in
detail before the builders start work.
What do the specifications or blueprints consist of? What needs to be

considered? The objectives mentioned in the previous section provide
the foundation statement – once this statement is clarified and
defined, the other steps may then follow. Other modules in this series
will offer more theory and detail as to why and how specifications
should be prepared, but they are particularly relevant to the item-
writer’s endeavours and so a summary is given here, too.
Figure 1 offers a summary of the whole sequence of such steps in

the specification process.
Figure 2 presents an example of an initial specification for a recent

piece of UNESCO-sponsored test development. The matrix (step 8)
for this specimen appears later, as Figure 3a.
© UNESCO 5
Figure 1
WHAT A TEST SPECIFICATION SHOULD INCLUDE
1. The test title.
2. A statement of the fundamental purpose for the test (e.g.

testing prior achievement; developed ability; aptitude for
further study).
3. The curriculum (or part thereof), or some statement of the

learning experiences of the test-takers, which is to be covered
by the instrument.
4. A brief description of the clientele or test population (age:

educational level or background: assumed knowledge or skill
level: any varieties or special groups within this commonality).
5. The range of appropriate assessment types to be used (both

in terms of the formats within the proposed test and also any
other assessment practices which might run alongside the test
itself, such as oral assessments, interviews, practical work,
etc.).
6. The intended uses that will be eventually made of the test

scores (these need to be prepared in discussions with eventual
users of what the test reveals).
7. The time and other relevant conditions available for testing.
8. A detailed matrix which shows how the test will be developed.
6
Item writing – art or science?
Figure 2
A SPECIMEN TEST SPECIFICATION
1. THE PACIFIC ISLANDS LITERACY LEVELS
2. A study of the achieved levels of literacy development by primary

school students.
3. The test will encompass writing and reading comprehension skills

in English and relevant vernacular languages, together with basic
numeracy.
4. The test is intended for students in Class 4 in ten countries of

the Pacific region. A national sample based on school types and
geographical regions will be tested in each country. (It should
be noted that the age of such students will vary from country to
country, as will the amount of English and vernacular instruction
students have received).
5. The test will elicit samples of writing in each language, based on

one task statement per language, together with short-answer
comprehension responses based on two short story-passages,
again one per language. Twenty-five numeracy items, testing basic
computation skills only, will be given to all test-takers.
6. The student scores will be published in five levels for literacy

and four for numeracy. Criterion descriptors of the levels will
be published. School results will be aggregated. Administrators
and policy-makers may then compare English and vernacular
performance in each country, and determine overall levels
of performance according to gender, geographical and other
variables.
7. The total test will be administered on school sites by supervisors

external to the school, and will take 45 minutes. Papers will be
hand-scored by external assessors.
8. [see Figures 3-6]
© UNESCO 7
3 Developing the detailed

matrix
Once the early steps of the sequence of specification have been
taken, the test-writer is in a position to be more specific about what
the test will actually look like. It is probably easiest to do this in the
form of a matrix, with cells to be filled in progressively as work on
the test proceeds. The size and complexity of these matrices can
vary enormously – for a school end-of-semester examination in a
particular course or subject they might get very complex indeed.
Many different learning areas will need to be covered and many
different item-types used.
Who designs them? In a school, the teacher who is setting the test is
the designer, with help and a critical perspective being given where
appropriate by a senior teacher or subject leader. If the test is to be
given to more than one class taken by a number of teachers, each
teacher should participate in the design until agreement is reached
that the test would be fair or valid for each class.
At a systems level, such a matrix is designed by a panel or

committee consisting of policy experts, curriculum experts and
those who will eventually develop the test.
The basic matrix will look something like Figure 3.
8
Figure 3
Broad areas or objectives
Detailed
content
If we now ‘translate’ Figure 3 to represent the objectives of the

UNESCO study used as an example in Figure 2, the matrix would
look like: Figure 3a.
Figure 3a
English Vernacular Numeracy
Writing skills
Reading ability
Computation
The next addition to the matrix is the score or value weighting. In

the specification which led to Figure 3a, a separate score or level
rating was to be given for each of the three broad areas of English,
the vernacular and numeracy. Hence Figure 4 shows a total of 100%
in each column, and becomes:
© UNESCO 9
Figure 4
Writing skills 50% 50% nil
Reading ability 50% 50% nil
Computation nil nil 100%
The word ‘nil’ in two of the Computation cells indicates that no

reading or writing of words in any language was to be involved
in the computational process – everything had to be done using
numbers or symbols. In a more sophisticated test for a higher grade
level, this decision may well have been changed.
The time weighting for the test now becomes easier to define. The
specification for whole test (Figure 2) shows that 45 minutes is
available. Figure 5 shows the matrix with appropriate time values
inserted.
Figure 5
Writing skills 8 mins 8 mins nil

50% 50%
Reading ability 8 mins 8 mins nil

50% 50%
Computation nil nil 13 mins

100%
10
Developing the detailed matrix
One more element remains to be added to the matrix used in this

example – the formats which have been chosen for the test items
themselves.
In the UNESCO study being used as an example in this section,

only a small space of time was available for each of the cells of the
matrix. Hence the solution to the format problem could not involve
vast amounts of reading or writing by the test takers, or large
numbers of computational exercises to test their basic numeracy.
For policy reasons, equal amounts of time had to be given to testing
in English and testing in the student’s vernacular language, and this
suggested selection of formats which were similarly parallel.
The time allocations were suggestive. Thirteen minutes for

computations suggested 25 exercises at about two a minute. Eight
minutes for reading comprehension suggested 8 short-answer
questions (one a minute), and the free writing exercise suggested a
sentence every two minutes. So long as the test-takers were given
a few minutes to read through the paper before the test actually
began, the stimulus material for the reading and writing exercises
could be fully informative but not particularly extensive. More
reading comprehension questions could perhaps have been asked
if a multiple-choice format had been chosen, but this format was
decided against – additional information about a student’s writing
skills would be elicited if they prepared their own sentences as
responses to the comprehension questions.
Accordingly, the fully-developed version of the matrix looked like:

Figure 6.
© UNESCO 11
Figure 6
Writing skills 8 mins 8 mins nil

50% 50%
5 sentences 5 sentences
free response free response
Reading ability 8 mins 8 mins nil

50% 50%
8 questions 8 questions
short answer short answer
Computation nil nil 13 mins

100%
25 questions
fill-in-blank
What the literacy test designers in the worked example in this

section were doing was trying to achieve an economical and elegant
solution to their particular problem of test design, as it had been
presented in the early steps of their specification. Had they started
at ‘the other end’ of the process, and tried writing items before they
had a view of the whole, much of their time and effort may well
have been wasted.
As the test development proceeds, many other aspects of test design

will come into play, which Figures 3-6 merely hint at but which are
detailed later in this document. One which might be mentioned
immediately is the need for variety. Test-taking can be a desperately
dull experience for the candidate – item writers should aim for some
degree of variety in both the stimulus material and in the formats
for test response. Even in the 45 minute test used as an example
above, there was considerable variety – two writing tasks, two
12
comprehension passages to read (with a different response mode

from the writing exercise), and a totally different format for the
quantitative exercise.
Even now, the specification procedure is not quite complete. For

example, if we take the bottom right-hand cell from Figure 6 dealing
with computation, we might be a little more specific about what is
to go on when students complete this part of the test.
Decades ago, Benjamin Bloom and others constructed what was

called The Taxonomy of Educational Objectives. This list divided
educational experiences into two main domains, cognitive and
affective. Within the cognitive area, he pointed to matters such as
acquiring factual knowledge, being able to comprehend, developing
the ability to analyse, to synthesise and to prepare evaluations as
being a useful (and important) way to structure what we teach and
hence what we test. We can use these elements to structure our
specification too, to ensure that we obtain a good coverage of what
went on (or should have gone on) during the learning process.
Hence our bottom right-hand cell might be expanded to look like

Figure 6a. Remember that in this example we are dealing with
very young children, so the topics will not be very sophisticated,
and the objectives also will be the simpler ones in Bloom’s list.
We will not be testing the higher-order ones he mentions such as
Analysis, Synthesis or Evaluation. The totals could be expressed as
the number of items, or the number of marks given for that topic or
objective within the total sub-test – or both, as we did in Figure 6
itself, when we were looking at the full test.
Figures 6 and 6a also raise two other important issues with regard
to test specification. The first is: ‘Just how long should a test be?’
There are, of course, no hard and fast rules, but common practice
suggests that 45 minutes is a maximum for any sort of formal
testing in middle primary school. This maximum might rise to an
© UNESCO 13
hour for upper primary and junior secondary, two hours for mid-
secondary, and three hours for the most senior school students. The
test-writer might comfortably achieve a satisfactory coverage of the
topics and objectives to be tested in less time, and should aim to do
so where possible. Two tests a few days apart will work better that
one large, over-long one.
The second issue relates to the fact that it is very easy to write
items to test simple knowledge, and too often that is all that test-
writers do. They forget application, analysis, synthesis and so on.
Putting these categories clearly into a test specification reminds us
that they are there to be tested, and they need to be tested. They
may also require different mark weightings: one mark for a simple
factual recall item is fine, but an application or detailed analysis of
some learning may deserve more, as in Figure 6b. The number of
items per cell will depend on the available time, and your estimate
of a good balance to cover all the learning to be tested. Remember
too the test writer’s responsibilities to create a good impression on
teachers, and have a positive effect on learning. If the instrument
does no more than test factual recall, often that is all that will be
taught – teachers are great ones for scanning past papers to see
what is expected, in order to give their students the best chance. If
they find nothing but lower-order skills, then learning and teaching
in the whole education system may become the poorer for it.
A final review of the specifications (including the finished matrix)

then needs to take place. The basic question during this review
relates to the decisions which have been taken from the point
of view of coverage of the curriculum. The design panel needs a
positive answer to the following – “how adequate is the proposed
design as a sample of the areas of learning or knowledge under
review?” Even now, the item-writer might not be quite ready to start
work on the actual items!
14
Figure 6a
Objective or
behaviour
Knowledge Comprehension Application TOTALS
Classroom
topics or content
1. Addition
2. Subtraction
3. Multiplication
4. Division
TOTALS
Figure 6b
Objective
Knowledge Comprehension Application TOTALS
Content
2 items 2 items 2 items 6 items

1. Addition
@ 1 mark @ 1 mark @ 2 marks 8 marks

2. Subtraction

3. Multiplication

4. Division

TOTALS
33 marks
© UNESCO 15
4 Setting parameters for layout

and instructions
A start should now be made on developing a view of the layout of
the eventual test paper and sketching, before they get forgotten,
the instructions which the candidates and supervisors will need to
smooth the test process.
Questions of economy may well arise first, if resources are scarce.

Here are a few:
• how much paper will be available for printing papers for the
total number of candidates to be tested? Or will that not be an
issue?
• will this mean a four- or eight-page test paper, or will a longer

one be possible?
• will students answer on the test book (which means it is not re-
usable) or on a separate answer sheet?
• will the test be hand-scored or machine-scored, by an optical

mark-reader (OMR)?
• is colour printing of any illustrations on the paper feasible?
Final answers to these questions will not necessarily be decided at

this stage, but preliminary answers certainly should be. Obviously
with layout much will depend on the actual items used. However,
16
the broad parameters ought to be ready for the item-writers to have
in mind once they begin work.
Economy matters in a different way, too – economy of words and

space. When the test writer receives the detailed specification and
the layout design, it ought to contain all the detail which will be
needed during the item writing process itself. Before this process
can start, consideration should be given to including the following
within the intended layout:
• an introductory section, where student information will be

gathered. In a school, this may be no more than the candidate’s
name and class group. However, in a larger test program, school
name, sex, and various other pieces of personal information
may take up valuable space.
• a section of the paper where the criteria for assessment are

specified for the students to read before the test begins.
• various procedural hints for the candidates on how they might

do their best on the test.
• layout explanations, if the paper is in sections, or if students are

expected to spend varying amounts of time on particular parts
of the test.
In large test programs, particularly standardised ones, some of these

might also need to be converted into instructions for supervisors to
speak. This will help to ensure that all candidates (wherever they
are sitting the test) have access to the same information. Printing
the information is often effective, but having a supervisor saying it
aloud guarantees that everyone has been given the same chance.
In both test-paper instructions and supervisor’s script, aim for the

minimum number of words to convey the essential messages.
© UNESCO 17
5 Item types or formats

In preparing Figure 6 above, the literacy item-writing team made
a selection from a wide range of options so far as appropriate
item-types or item-formats were concerned. Later in this section
(particularly in Figures 10 and 11) some view of this range is
specified.
However, first it would be useful to look at what all items might

have in common: see Figure 7. This will help to distinguish between
the various options when they are presented later.
Figure 8 offers some examples of stimulus material used in tests,
while Figure 9 shows how the parts of an item are exemplified in
one format in particular – the multiple-choice item.
Figure 7
THE PARTS OF AN ITE M
1. All items use stimulus material of one sort or another.

• Sometimes it is no more that a sentence or a set of
symbols which directs the student what to do.
• Sometimes it is a passage, or a diagram, or an illustration,
relating to a whole set of items, which informs the
student.
• Some items have both directive and informative material
to stimulate candidates’ thinking.
…
18
…
2. All items have at least one ‘right’ answer or response,
in the sense that it would earn full credit from an
assessor if offered by a candidate.
• Sometimes the choice is ‘closed’, as in multiple-choice
items. The right answers are actual and printed on the
paper as an option.
• Sometimes, as in essay tests, these responses are potential
– not realised until someone reads the stimulus and makes
the response. (Sometimes, of course, in reality nobody
does get full credit – but all items should be written in
such a way that somebody might.)
• Some items are ‘open’. They have more than one ‘right’
answer, such as two essays which each score full marks
even though they are different. (Multiple-choice items
never do.)
3. All items will have inadequate answers, which might

earn partial credit from an assessor, or wrong ones
which would earn no credit at all.
• In most multiple-choice tests, inadequate or wrong
answers get no credit at all.
• In most essay tests, very few answers get no credit at all
unless a blank page has been submitted.
ANSWERS OR RESPONSES – RIGHT OR

WRONG , ACTUAL OR POTENTIAL – ARE ALWAYS
CONSIDERED TO BE A PART OF THE ITE M .
© UNESCO 19
Figure 8
SOME SA MPLES OF STIMULUS M ATERIAL
informative
• a passage in a multiple-choice test to which a set of items
refers
• a map in a Geography test which candidates are expected
to refer when answering some short-answer questions
• a photograph in an Art test about which candidates are
expected to write an essay
• a diagram in a maths test which forms the basis for the
solution of a problem
directive
• the leading sentence or ‘stem’ of a multiple-choice item,
such as: “In the passage, who ate the cake?”
• a short-answer item such as: “Use the scale of the map to
calculate the distance from the tower to the bridge”.
• a specific essay topic, such as: “Compare and contrast
the painting by Picasso in the photograph with two other
paintings you have studied by the same artist”.
• an extended response item such as: “Write a critical
review of one novel you have studied this semester”.
• a statement of a problem to be solved, such as: “What
is the area, in square metres, of the shaded part of the
diagram above? Show all your calculations”.
STIMUL ATE MENTAL IS INTENDED TO

STIMUL ATE MENTAL PROCESSING DURING
THE TEST, EITHER GENER ALLY OR DIRECTLY.
20
Item types or formats
Figure 9
THE PARTS OF A MULTIPLE- CHOICE ITE M
INSTRUCTION Read the following passage and answer the

to candidates question which follows.
INFORMATIVE Bob, Carol, Ted and Alice had just begun a

stimulus friendly game of poker, but already Ted had much
material the biggest pile of winnings. Carol had won a
small sum, and Alice had lost more than Bob.
DIRECTIVE
stimulus
material
* the NUMBER 1 At this stage of the game, who had lost the
and STEM most money?
* OPTIONS A Bob
for answering B Carol
C Ted
E There is not enough information to say.
KEYED RESPONSE D Alice

or right answer
DISTRACTORS A , B, C and E
or wrong
answers
© UNESCO 21
Test items can be usefully classified into three main categories:
1. Selected response items

A response is selected by the test-taker either from a given list of
possible choices, or from the stimulus material itself. The choice is
‘closed’: that is to say, only one option will receive credit. The type
includes the following formats:
a. True-false
According to the map, Bombay is the capital of India – true or false?
b. Matching items
Find the word in the passage which means HOLLOW.
Match these words with their antonyms:
INEBRIATED : SOMNOLENT : LUGUBRIOUS
CHEERFUL . . . . . . . . . . . . . . . . . . . . . . . . . . .
SOBER . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
WAKEFUL . . . . . . . . . . . . . . . . . . . . . . . . . . .
CALM . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
It is usual in items such as the above to include more options than

the given number of matches required.
22
c. Classification items
Fill in the blanks:
FRANCE PARIS EUROPE
. . . . . . . . . . . . . . . . . . . . KENYA AFRICA
. . . . . . . . . . . . . . . . . . . BOGOTA . . . . . . . . . . . . . . . .
SRI LANKA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The terms of the classification can either be given (country/capital/

continent) or left out, as in this example, to be inferred by the
candidate from the samples given for the categories. Simpler, two-
column classifications are of course possible.
d. Multiple-choice items
See Figure 9 for an example. It should be noted that multiple-choice
items can come in two styles. One might be called stimulus-related,
and the item in Figure 9 is an example of this kind. The other
might be called ‘stimulus-free’ in that options are printed, but the
candidate is expected to draw on prior knowledge, gained during
classwork, to select the keyed answer.
They can also vary according to structure. The example in Figure 9

has five options from which to choose. Other items may have only
four options: these are simpler to write and often easier to answer.
Three or six option items should be avoided: the former increase
the chances of successful guessing, and the latter are simply too
complex to handle in the circumstances of a test. Whichever
structure (four or five options) is ultimately chosen, all items in the
multiple-choice section or test should conform to that structure.
© UNESCO 23
In all testing using selected response items, it is important to ensure

the independence or discreteness of each item. No late item in a
series should depend, for understanding or for choosing the correct
response, on the test-taker having got a previous item right.
2. Constructed response items

In items of this kind, the candidates construct or prepare their
own responses on the spot, based on knowledge developed during
a course, or by rewriting bits of given stimulus material in a new
way. Many such items are ‘open’, in that equal credit during scoring
might be given to a number of totally different responses. Others,
such as some of the examples below, will have only one right
answer.
a. Short-answer items
Write a good title for the story you have just read.
What did the three cousins do in the story?
Why do you think Tina was angry?
b. Fill-in-the-blank and sentence completion items
16 x 4 = . . . . . . . . or . . . . . . . . . . .
Saigon was re-named Ho Chi-minh City, because . . . . . . . . . . . .
The word . . . . . . . . . . . . . . . accurately describes Jo’s behaviour

during the argument.
24
c. Cloze items
One subset of completion items might be especially mentioned.
These are so-called ‘cloze items’, where candidates’ word knowledge
is tested by asking them to insert appropriate words in regularly
spaced gaps in a passage of prose (e.g. where every fifth word
has been deleted, and a space left). The format began in reading
research as a test of readability of prose, but has been adapted for
educational testing purposes. Here is a sample of a pure cloze task:
One subset of completion . . . . . . . . . . . . . . . . . . might be especially

mentioned. . . . . . . . . . . . . . . . . . . . . . . . are so-called ‘cloze items’,
. . . . . . . . . . . . . . . . . . . . . . . . . . candidates’ word knowledge is
. . . . . . . . . . . . . . . . . . by asking them to . . . . . . . . . . . . . . . .
appropriate words in regularly . . . . . . . . . . . . . . . . . . gaps in a
passage . . . . . . . . . . . . . . . . . . . . . . . . .prose (e.g. where every
. . . . . . . . . . . . . . . . . . word has been deleted, . . . . . . . . . . . . . . .
a space left).
Some testers use a slightly different form, called ‘modified cloze’

where the omissions are not regular, but selected to test particularly
important words or concepts, or those gaps where only one
response is logically right. Here is the same passage in ‘modified
cloze’: even here some alternative insertions are possible.
One subset of . . . . . . . . . . . . . . . . . items might be especially mentioned.

These are so-called ‘ . . . . . . . . . . . . . . . . . items’, where candidates’ .
. . . . . . . . . . . . . . . . knowledge is tested by asking . . . . . . . . . . . . . . .
to insert appropriate . . . . . . . . . . . . . . . . . in regularly spaced gaps
. . . . . . . . . . . . . . . . . a passage of prose (e.g. where . . . . . . . . . . . . . .
. . . fifth word has been deleted, . . . . . . . . . . . . . . . . . a space left).
© UNESCO 25
d Extended responses (paragraphs and essays)

These tasks can either be specifically tied to informational stimulus
given on the paper, or rely on the candidate’s bringing knowledge
and understanding of a topic into the test-room in order to respond
to the directive on the test-paper.
Read the material presented opposite and prepare a balanced and

detailed critical response to the ideas about education presented by
the various writers and cartoonists. Your piece of writing should be
between 600 and 1000 words.
Write a paragraph which summarises Fairbank’s view of the First

Opium War as expressed in the passage.
Write an essay which traces the impact of the economic theories

and ideas put forward by John Maynard Keynes on the current
economic policies of this country.
Extended responses give the item writer a much greater chance

to emphasise the complexity of knowledge and ideas and, where
appropriate, the sequence of events or logical connections between
ideas. In all these items, later cognitive operations in the minds of
the candidates may depend heavily on right choices having been
made early in the test experience. Hence it is important to stress
(in the instructions) the need for students to undertake planning
and drafting of their work, and if possible allow time for these
somewhere during the test session.
These formats often bring into sharp focus the problem of choice
between different tasks on differing aspects of course work. If it is
at all possible, choice should be avoided and all candidates asked to
26
perform the same task. Where, because of the extent or complexity

of course learning, choice is unavoidable, care should be taken
to make the various choices as nearly approximate in conceptual
difficulty and ease of execution. This may not be possible without
field testing of the various options beforehand.
3. Problems for solution

The problems which test-writers set for solution by candidates
actually represent a sub-category of constructed responses, but
seem important and common enough to be discussed separately. In
a sense, setting an essay task for extended response has a problem
element built into it, especially if the candidate is required to read
a large amount of informative stimulus and choose some sort of
personal response to be developed during the test. For example, we
might look again at an example (given earlier under the heading
‘Extended Responses’) with this in mind:
Read the material presented opposite and prepare a balanced and

detailed critical response to the ideas about education presented by
the various writers and cartoonists. Your piece of writing should be
between 600 and 1000 words.
Other problems may not require words as a response, but

quantitative reasoning and expressions for their solution. Or
perhaps it is manipulation of materials that is required, as in a test
conducted in a laboratory or workshop or art studio.
Mathematical problems are probably the most common kind,

from simple primary examples to more complex ones at higher
levels of schooling. As in the first example, they too might involve
manipulation of objects as an aid to their solution.
© UNESCO 27
John’s boss has told him to set up six jars as a display for a jelly
bean promotion. John sets up six jars on a shelf, three full ones then
three empty ones.
“How does it look?” says John, about to leave for lunch.
“Well, I’d like it better if you alternated full and empty jars’, said the
boss.
John is hungry and in a hurry. What’s the least number of jars he

has to move?
Look at the photograph and plan of the room. You will notice none
of the corners is a right angle (90 degrees).
Prepare a measured drawing of a piece of furniture (cupboard,

table, etc.) which would be appropriate to the room, and would fit
the angle of one of the corners exactly.
28
Selecting item types for test 6
purposes
As stated earlier, the basis for any selection from within the wide
range of possible formats broadly categorised in the previous
section lies in the learning to be covered. The test writer must
achieve the best possible match between this learning and the
potential instrument, as revealed in the specification.
All formats have their advantages and disadvantages. A summary

of these is given in Figures 10 and 11. It is, however, merely
a summary. Multiple-choice items, for example, have a large
number of other problems associated with their construction. For
example, plausible distractors are hard to write; the items present
wrong information (the distractors) as if it were right, perhaps
reinforcing a wrong view in the student’s mind which gets taken
away from the test room; the format ignores the students whose
developmental stage leads them to be half-right, half-way to full
understanding – students get no credit for this. Moreover, if a
curriculum suggests that we should be testing student performance
with regard to aesthetic and affective criteria, or the making of
critical or value judgements, multiple-choice item writers will find
it hard to represent these satisfactorily in their items. The format
also disallows a plurality of answers, to represent the complexity of
much student learning, unless the items themselves become very
strained.
© UNESCO 29
Above all perhaps is the fact that multiple-choice items are

‘closed’. That is, they do not allow for the diversity of human
opinion which legitimately informs much of our appreciation,
apprehension and interpretation of the things we learn. A right
answer, one right answer only, is chosen for the candidate.
Sometimes this doesn’t matter, when there is only one possible
right answer, in the objective sense. But at other times the ‘right
answer’ might be a quite subjective matter of the test writer’s
opinion, and the candidate might have a different – and legitimate
– view. For instance, in making the critical judgements mentioned
above, human beings often differ in what they value or recognize
as being important. This is normal and legitimate – we are
entitled to our views if they are based on fact and reasonable
interpretations of what we read or see. Multiple-choice items
cannot handle this variety.
Many of these problems are solved by the selection of an ‘open’

extended response format, such as an essay. But these formats too
bring some of the problems associated with subjectivity, though
not when the student is preparing the response and can argue
for his or her own point of view. The problems come later, when
someone has to sit down at the task of assessing the value of the
work done. Any assessor brings to this task a view of what are
appropriate personal opinions and responses to the ‘topic’ of
the student’s writing, sometimes amounting to bias or outright
prejudice.
Before we take fright at all these problems, and allow our testing
to degenerate into mere assessment of general knowledge or the
most simplistic kinds of understanding and cognitive processing,
we should remember that testing of abilities in analysis, synthesis
and higher-order evaluations is possible, and should be done (as
an earlier section pointed out). It is just that one format can’t do
them all. We might never completely solve the guessing problem
(‘did the student really know this?’) but a range of formats in
30
Selecting item types for test purposes
one test will allow us to get a detailed picture of a wide range

of knowledge, understanding and ability – even creativity – as
learning outcomes. It’s not always easy to achieve this picture, but
it can be done, and it is often worth the attempt.
Especially in examinations. In this circumstance, there is a wide

range of purposes for the testing, some of which will require
different formats. Not only will variety help the students to make
their way more easily through the experience, but (as Figures 10
and 11 indicate) certain formats lend themselves more readily
to certain purposes than others might. For example, testing of
a range of detailed factual information is readily done using
matching or classification items (Figure 10), whereas testing the
candidates’ capacity to perform integrated, higher-order skills
such as synthesising knowledge or attempting evaluations might
mean setting a complex problem for solution, or need an extended
response to do those skills justice (Figure 11).
© UNESCO 31
Figure 10
CRITERIA FOR CHOICE OF SELECTED RESPONSE ITEM FORMATS
ITEM TYPE ADVANTAGES DISADVANTAGES
A True-false • easy to write • guessing factor very high (50%)

• easy to mark • limited to unequivocal choices
• easy to sample variety within a • cannot test higher order skills
course
B Matching items • useful for testing relationships • the cluster approach destroys item
• useful for testing factual independence
information • difficult to word instructions
• easy to construct a large number
C Classification • relatively easy to construct • the cluster approach destroys item

items • easy to mark independence to some degree
• useful for testing factual • limited to factual sorting
information • limited to unequivocal facts
• useful for testing simple
relationships
D Multiple-choice • reduces the guessing factors • little, if any, stimulus given to

items • versatile – can be used to creative thought
measure a wide range of cognitive • expensive and time-consuming to
processes construct
• reduces problem of subjective • difficult to measure organization
scoring and presentation of ideas
• analysis of results can provide • plausible distractors hard to write
much diagnostic information • presents wrong information as if it
• easy to mark was right
32
Selecting item types for test purposes
Figure 11
CRITERIA FOR CHOICE OF CONSTRUCTED RESPONSE ITEM FORMATS
ITEM TYPE ADVANTAGES DISADVANTAGES
A Short-answer • excellent for testing factual • unsuitable for measuring complex

items knowledge learning
• successful guessing is reduced • easy to respond in inappropriate
• easy to write ways
• easy to mark
B Fill-in-the-blank • easy to test a range of factual • hard to measure higher-order

sentence knowledge skills
completion • guessing factor is reduced • easy to respond in inappropriate
• easy to write ways
• easy to mark
C Cloze, modified • easy to construct • some ambiguities hard to decide

cloze • a good measure of word on
knowledge • many opportunities for choice
• tests passage understanding have little value
• often little more than guesswork
D Extended • a means of assessing higher-order • sometimes lead to inadequate

responses skills sampling of learning done
• relatively easy to construct • time-consuming and expensive to
• stimulate creative and critical mark
thought as well as learned • difficult to achieve inter-marker
responses reliability
• can measure learning in affective
domain
E Problem • a means of assessing higher-order • can be time-consuming to mark

solutions skills • sometimes difficult to establish
• can measure complex learning stable assessment criteria
outcomes
• relatively easy to construct
© UNESCO 33
7 Item writing as a creative act

The introduction to this document suggested that item-writing was
a truly creative act. We have set our specifications, considered the
curriculum and its learning outcomes, reviewed the options for
format-types, and made a selection. We are now in a position to
start the creative process.
From the start, three sets of cardinal distinctions might be kept

quite clearly in mind, as well as one golden rule. The distinctions
– the first and third are particularly important for multiple-choice
test writing – are between:
• simple comprehension and higher-level interpretations;
• ‘open’ items and ‘closed’ items;
• factual knowledge and inferential reasoning.
These simple distinctions will help improve the test quality overall.
They will help us to remember to develop items which do not
merely test factual knowledge and simple understandings but tap
into higher order skills which students develop as they learn.
Some easy, knowledge-based, or simple comprehension items have

a place in any test. They help ease students into the test-taking. But
we should recognize that they sometimes don’t require much more
than simple cognitive processing – searching for facts which are
fairly obviously in the stimulus material. The answering process
is ‘closed’ – there is only one right answer. If the items never get
34
beyond this level an important opportunity has been missed: there
is more to learning than this.
We should also expect our test takers to be able to read ‘between

the lines’ as it were – for example, to be able to:
• draw inferences which rely on implied or logical connections

between events or facts;
• understand implications or key underlying assumptions which

might not be stated at all in the material but which nevertheless
are important things to learn or understand about it;
• analyse, summarise and develop a personal and evaluative

critical perspective.
All these might come under the heading of ‘interpretation’. The

stimulus material is printed. The candidate develops his or her own
interpretation of it, and uses this in answering these more ‘open’
items. There will be no one right answer – many responses from
quite differing personal perspectives will deserve, and rightly be
awarded, full credit.
The golden rule, for all test-writers, is:
AVOID TRICKS AND TRIVIA
As item-writers we are not out to trick the candidates in any way.

Our professionalism as educators should mean that we really want
to find out what they know, rather than what they don’t. ‘Trick’
questions are out. Of course some questions may well be so difficult
for a few students that they might regard them as trick questions,
but that is not what is meant. Our intention should always be that
any student who has engaged in the course being tested, or who has
developed the relevant skills, has a reasonable chance of performing
© UNESCO 35
well on the item. By all means choose material or items which focus
on substantial misunderstandings or mistakes which you know
students might have or make, or which reflect on what you know
to be a common and important difficulty for students doing the
course. But make sure that it is common and important, not merely
a devious or slippery misinterpretation you have invented to trap
the unwary.
Trivial questions likewise should be avoided. In a test, we

have little enough time to test the important facts or concepts
or understandings of a course, without wasting this precious
commodity on inessential or peripheral or simply irrelevant matters.
This applies to distractors in multiple-choice items as well – don’t
waste everyone’s time in preparing a distractor which virtually
every test-taker is going to recognize as trivial, and hence to be
ignored.
Beyond these there are a number of other central matters which

an item writer needs to keep constantly in mind. These are
summarised in Figure 12.
The last two points in Figure 12 will assist in avoiding ‘test writer
bias’. A single view of the curriculum to be tested might be narrow,
or even faulty, even if the test constructor is experienced and an
acknowledged expert. In school-based test development it is often
impossible to achieve this variety, where only one person really
knows the course as taught. Elsewhere it often is possible, and is a
good principle.
Assuming that all that has been done, the team has been set up,
curriculum has been clarified and formats decided upon by the
group, the next stage of the work is individual – drafting and
revising the item material. As an example, we might look at an
individual working on development of items in one particular
format – perhaps the most difficult of all – multiple-choice.
36
Item writing as a creative act
Figure 12
PRINCIPLES AND PROCEDURES DURING ITE M

DEVELOPMENT
• Use the specification, don’t ignore it – have it handy throughout

the development.
• Allow (and take) as long as possible for the whole process.
• Once the format is chosen, select the material and give yourself
time to ponder it: familiarise yourself with its main points and
other minor ones which might form the basis for good items
and distractors.
• Set up the scoring procedures and develop any assessment

criteria simultaneously with the test development.
• Develop the items using co-professional input – other teachers

who might be involved in the course, or who will use the results
– during material selection, item review and editing.
• If possible arrange for more than one person to work on

actually developing items, and use a range of items from these
different sources in the final test product.
© UNESCO 37
Figure 13
THE PROCESS OF MULTIPLE- CHOICE ITE M

DEVELOPMENT
1. First, search for informative stimulus material related to the

course or the objectives of the test. In rare cases, the material
might have to be written or otherwise prepared by the item-
writers themselves. However, more usually it exists already
and can be chosen from books, periodicals and other sources.
Keep records of where you found the materials.
2. Look for a variety of stimulus material on a single topic. The
range of types of stimulus includes written, pictorial, graphical
and tabular material. Keep your candidate audience in mind.
3. A decision is made about whether one stand-alone piece of
stimulus material will do to test a curriculum element, or
whether several pieces in conjunction would provide a better
test, allowing possibilities of comparative items.
4. Extra, relevant pieces should be selected, in case the first ones
prove to be less impressive than they first appeared.
5. Read, and re-read (perhaps several times over) the material,
and make one-line notes about possible testing-points
discovered during these readings.
6. If no testing-points appear in some sections of the material,
then look at the possibility of cutting the material to remove
the extraneous section(s). But don’t hack it to pieces –
meaning tends to get lost!
7. When you feel that the material and its possibilities have been
fully comprehended or digested, the one-line notes from Step
5 are sketched as possible ‘stems’ for individual items.
…
38
…
8. Write these sketches at the top of individual sheets of paper
or cards (one per item), and just under them some preliminary
sketches for possible distractors might be made as they occur
to you. Do these in pencil, not ball-point.
9. It is not necessary to ‘finish’ one item (complete with stem
and distractors) before going on to the next. Ideas will emerge
at different times (sometimes quite inconvenient ones),
especially if enough time has been allowed and the process
isn’t rushed.
10. Once an item has been sketched, underneath it write a draft in
the correct format. If you’ve used separate pieces of paper the
items can be drafted in any order (or even left incomplete for
the time being, if a third or fourth distractor just won’t come
to mind). Do all this in pencil, too.
11. Assemble the various pieces of paper (or cards) into what
seems a reasonable order in terms of the difficulty of the
items – the general rule is “easiest to hardest”, but there are
often exceptions to this. Place the stimulus material on top of
the pile and have the sequence typed out.
Figure 13 summarises an enormously complex process, and needs

a little elaboration here and there. In point 1 the importance
of keeping bibliographical records of where you found your
material might be stressed – you will need this later when seeking
permissions to re-publish it, and such details easily get lost.
At point 2 you will need to consider the appeal of the material for
various groups of test-takers: will it appeal to both girls and boys?
Urban and rural dwellers? Is there some significant sub-group who
© UNESCO 39
will fail to understand it completely because of language or dialect

problems? (Would it help to asterisk especially difficult words and
define them at the foot of the passage?) Or are there particular
emotional overtones (such as references to death or disasters) that
might disadvantage particularly susceptible individuals?
At point 5, you might ask yourself: ‘Why does this piece appeal to
me? What would I hope my candidate-readers would learn from
the experience of reading this piece or looking at this picture?’ In
these ways you might develop for yourself a fuller understanding
of what makes the stimulus tick, as it were. Simultaneously, it will
ensure that you see the central importance of the piece, as well as
exploring some of the details that later will come in handy for item
or distractor development.
At the stage of point 7, you will need to remember two things:

the candidate’s background knowledge is not expected to be an
alternative to, or substitute for, using the stimulus material. The
intention is that they cannot do the item without having read the
stimulus. Similarly, they should be able to understand the problem
from the stem, directly not have to search through the options
before they understand what the question is actually about.
When you are sketching at point 8, if you are using a longish

passage of text, you will probably find yourself including line
references to help you ‘key’ your draft options back to the passage.
Give some consideration as to whether you might do this in the final
test also. Too often, test-takers find themselves involved in a search-
and-destroy mission trying to find where something mentioned in
an item is also mentioned in the passage. This creates an artificial
difficulty you can avoid for them. Save them the valuable time they
would otherwise lose: it doesn’t make it any easier to decide on the
right answer!
40
During the final assembly of your draft items (point 11), you will
also have an opportunity to review the content and quality of the
items you have written. You will be looking for a range of difficulty.
You will also be able to see the spread of the types of questions
you have prepared: are there enough global questions? Too many
particular questions focusing on no more than vocabulary? Is
inferential reasoning well represented? If the mood of a passage
is important, is it in an item in your pile? All these considerations
will help you develop an effective sequence for the material as you
present it during the panel meeting which is the next stage.
Here is an example of the process outlined in Figure 13. It should

be noted that the assumption is always that one passage yields a set
of items, not just one. Stand-alone multiple-choice items are hardly
an economic proposition in terms of space and time, and require
too many mental gymnastics on the part of the candidates who face
them. They also sometimes give the impression of learning as being
a matter of unconnected bits and pieces – hardly an impression we
want our students to carry away with them from the test room.
stimulus material – Steps 1-3
Passage
Multiple-choice item writing is a difficult art at the best of times

but how often we make it more difficult than it need be! We start
off thinking it will be easy and go at it very quickly before all the
matters we need to keep in mind have been considered. Rushing
the process won’t help – items do not spring fully-formed on to the
page. We need to ponder, review, get to know what it is we want to
ask and what the material might let us ask. Quickly prepared items
are often a waste of time – we find that out when we show them to
another person, and that person merely says: ‘trivial’ or ‘faulty’.
© UNESCO 41
one-line note – Step 5
main point
sketch for a stem – Step 7
What is the main point of the passage?
sketches for distractors – Step 8
• the difficulty of doing it

• time
• logical order
• plausible distractors
• getting a good critic
• avoiding trivia
draft of stem and revised distractors – Step 10
1. What is the main point of the passage?
A the time it takes

B the logical order of item development
C plausible distractors hard to find
D getting a good critic to look at the item
E finding enough distractors
F avoiding trivial questions
G the difficulty of item writing
42
It would also be possible to construct a negative item using much

the same material, to emphasise something of the complexity and
variousness of the views expressed in the passage – but only if we
saw that to be a point worth testing:
draft of negative stem and revised distractors – Step 10
1. Which of the following aspects of item-writing

is NOT considered in the passage?
A the time it takes

Opinion varies on the value of such negative approaches to

stimulus material. They are certainly harder for students than
straightforward approaches, but just occasionally they are
unavoidable. They should never include another, double negative in
any of the options.
As an item-writer moves through the process and constructs a

number of items on the chosen material, a number of subsidiary
issues emerge as needing consideration. Some in the following
summary have been mentioned before, some are new:
• Is the passage too long for the items written?
• Are there words in the passage which are too hard for the
candidates?
© UNESCO 43
• Are the stems clearly worded?
• Are the items independent or will the answer to number 4 be

dependent on getting some other item right?
• Are the items independent or will the distractors in item 10 give

away the answer to number 2?
• Is this distractor ambiguous?
• Do all the distractors share the same syntax and grammar?
• How plausible are these distractors?
• Do all the distractors in a given item refer directly to the stem

or are some wild?
This might also be the place to consider just how many options
(four or five) in a multiple-choice item one is going to have in the
final test. Professional opinion amongst item-writers varies about
this issue. Five options does cut down the correct-guessing rate, but
the number does make life harder for both the candidate and the
item writer. The feeling of the present writer is that four options
will do – a larger number of elegant, straightforward items might
appear on our test-papers, without the fifth, strained or irrelevant
distractor that is all we can think of. However, by all means present
five (or six, even) possibilities to the panel who will review your
drafts.
Two other hints are worth mentioning: one is to read through

the items aloud in the voice of a ‘student’ who might do the test,
thinking all the time of how they might respond to the material –
this helps one to distance oneself from the material over which one
has sweated long and hard. The second hint is to allow sufficient
time to leave the sequence of items alone for a few days, and then
come back to it afresh.
44
But to solve these (and many other) problems ultimately we need

help – fresh minds and critical eyes, just like those of the students
we will be testing. So begins the next stage of the development
process.
© UNESCO 45
8 Panelling or moderating
drafted items
Critical review of items is essential. And it is essential before the
items and the item-writer get locked into a situation where they
have to make do with what they’ve got. Two words are sometimes
used for this part of the process – ‘panelling’ draws attention to the
need for a panel of reviewers, or more than one critic; ‘moderating’
draws attention to the actual process of the meeting – moderating
the more extreme efforts of the writer by exposing the work to other
views.
Though it may seem wasteful, panels should always be supplied

(as far as is possible) with more than is eventually needed: more
items and often more distractors for each item. However the items
themselves should be in as good a condition as possible before they
are duplicated and sent to panel members:
• worked over as much as time permits, not just raw drafts
• typed, not hand-written;
• numbered and ordered, not random;
• complete with stimulus;
• any line references made have been indicated clearly in passage

and items;
• laid out in standard form.
46
The standard for multiple-choice items is as in:
Figure 14
STANDARD LAYOUT FOR A MULTIPLE-CHOICE ITEM
item number in bold, and 23 How many people are in the

indented stem group?
option letters in bold A less than ten

capitals, and options further
indented B between ten and twenty
C more than twenty
D The passage does not say.
• Any option which is a full sentence should have a full stop or

period, as in D above.
• Option letters should not have full-stops or brackets.
• If a negative word is used in the stem, such as NOT or

EXCEPT, it should be printed in bold capitals.
No indication should be given in advance to the panellists of the

author’s suggested right answer. Their task is to work through the
items as if they were doing a test, and come up with their view
of what is ‘right’. However the keys must be marked on the item-
writer’s copy to be used during the panel meeting – panellists get
justifiably restless if in the heat of the moment the item-writer can’t
remember what was to be keyed.
© UNESCO 47
The number and choice of panellists will depend on circumstances

and resources – panellists wherever possible should be paid! The
need for confidentiality must be stressed. There will be a panel
chairperson (not an item-writer whose work is up for review – there
is too much else to do), who might also act as a co-ordinator for
distribution of the materials well in advance. Other panellists are
chosen for their expertness, the variety of viewpoints they can
contribute, and their number should include some representation
from those who will eventually use the results. Gender balance
should be maintained where possible. Each item writer whose work
is being reviewed acts as secretary to the panel while those items
are being discussed.
Running a panel is not as easy as it might seem. Confidentiality

must be ensured before and during the sessions – some programs
ask panellists to sign legal declarations that they have and will
observe this need. In the sessions themselves, time very rapidly
runs away as people present and defend their points of view, or
argue about detail. Chairpersons have to be very firm about closing
off trivial discussions!
The sessions often also give rise to quite sharp inter-personal

conflicts, especially where the item-writer is new to the business.
It can be very difficult for such a person not to feel personally
attacked – his or her intellectual capacity, or even personality! It
is natural for the item-writer to be intent on acting as defendant
of as many of those hard-won items as possible. But this should
be tempered by the knowledge that things will have gone wrong
– even the most experienced item-writers mis-read passages or leave
out essentials. When such matters do come to the attention of the
panel (and that is why it is there, after all) it is up to the chairperson
to keep tempers cool and comments constructive rather than
personal. Participants are there to criticise the work, not the person,
and comments from an outsider such as ‘I just don’t understand
how you [the item writer] think’ are totally unacceptable.
48
Panelling or moderating drafted items
Where possible, tape-recording the session should be undertaken

as well as the keeping of written records. Sometimes, in the heat of
intellectual argument, points get made which might be missed by
the item writer-scribe but be quite valuable during the later editing
of the item concerned.
Opinion varies as to how long a panel meeting should last. The

standard of item criticism seems to fall away quite sharply after
about two hours. However, it is often more economical to go on a
little longer to finish the work rather than re-convene everyone at a
later time, and the chairperson might need to keep that in mind in
reaching a decision about when to stop.
© UNESCO 49
9 Stage One – editing or vetting

The item writer using the panel meeting records, written or taped,
to edit or vet the draft items as meticulously and slowly as possible.
The panel will have engaged in a variety of discussions and made
many suggestions: they will have offered hunches about the
validity or difficulty of items, have given their perceptions about
the plausibility of distractors, and have pointed out actual errors of
fact or language use. It is now up to the item writer to respond and
accommodate to as many of these comments as seem sensible from
a professional viewpoint. Not all will be – panellists are sometimes
wrong! – but most will be invaluable in getting the final form of the
items just right.
The vetting process involves some or all of the following:
• cutting or adding to the stimulus material where the panel feels

this is necessary;
• choosing words in the stimulus material for asterisking

and defining at the foot of the passage, if the panel thinks
the candidates won’t know them but they are vital for
understanding the passage;
• deleting whole items the panellists think are trivial or too

difficult;
• writing whole new items suggested as replacements or additions

by the panellists;
50
• deleting the less successful draft distractors, or ones the panel
feels are implausible;
• writing new distractors suggested by the panel;
• checking the language at every turn – accuracy, consistency,

clarity;
• establishing a final order for the items in the set or the whole
test.
Much of this activity will seem like unnecessary wastage, but it is

(on the contrary) essential – only the best distractors, keys, stems
and stimulus material should be taken out into the field for pre-
testing. Sometimes the circumstances of this field testing mean that
slightly variant versions of the same item can be tried on different
groups. This is sometimes a useful procedure where the item-writer
cannot make up his or her mind as to whose hunch to follow – one’s
own or a panel member’s.
As an example of what might happen to an item during the vetting

process, let’s look again at the stimulus and draft item presented
earlier.
Passage
Multiple-choice item writing is a difficult art at the best of times

but how often we make it more difficult than it need be! We start
off thinking it will be easy and go at it very quickly before all the
matters we need to keep in mind have been considered. Rushing
the process won’t help – items do not spring fully-formed on to the
page. We need to ponder, review, get to know what it is we want to
ask and what the material might let us ask. Quickly prepared items
are often a waste of time – we find that out when we show them to
another person, and that person merely says: ‘trivial’ or ‘faulty’.
© UNESCO 51
A the time it takes

During the panelling process, we need to inspect these suggestions

very carefully. The item writer has offered us more options than we
will need, so we might begin with them. What does each offer as
a summary statement of the process we have learned about in the
passage?
Time is certainly mentioned (A), as is the logic of the process (B).

Even though we might know that plausible distractors are hard
to find (C) or indeed enough distractors (E), the passage doesn’t
actually mention these matters, so they are the first candidates
for removal – relevant to the issue of item writing, but not to the
passage. Trivial questions do get a mention (F), as does the difficulty
of item writing in general (G).
So our item now looks like this:
A the time it takes

52
STAGE ONE – editing or vetting
Next, we might take a cold, hard look at the language used. The
stem, for example: is it explicit enough? We might try a slightly
more elaborate wording, such as «What is the main point the
author is trying to make in the passage?» If we used that (or even
the original), A doesn’t make much sense – what is “it” that is
taking time? Both A and B would need a verb: ‘indicating’ for A
and ‘following’ for B would help us understand things more readily
– D and F both have such verbs. G doesn’t, so we might insert
‘recognising’. What does our item look like now?
1. What is the main point the author is trying to make in

the passage?
A indicating the time that item writing takes

B following the logical order of item development
G recognising the difficulty of item writing
However this is still not right – the five options need something to
hold them together. We could do this by extending the stem to point
out something all the options have in common. The passage is being
written for item writers, so we might adjust all our wording yet
again to include them, in what is called a run-on stem:

the passage? Item-writers should
A allow for the extensive time that item writing takes

B follow the logical order of item development
D get a good critic to look at the item
F avoid trivial questions
G recognize the difficulty of item writing
© UNESCO 53
Now, what’s the right answer? B, D and F are all mentioned in the
passage so they are reasonably plausible as options. But is “the”
answer to be A or G? Back to the stem: we might ask ourselves
‘Is there only one main point in the passage?’ Time is certainly a
recurring problem – it gets mentioned or implied several times – but
beyond that it is the overall difficulty of the whole process which is
being indicated, with ‘not rushing’ being a major factor. If we only
need four options for our final item, we could collapse A and G into
a single option, such as: ‘preparing items slowly and methodically’,
and put the word ‘difficulty’ (or something like it) into the run-on
stem. Now we might have:

the passage?
Item-writers will overcome the difficulties
of item-writing by
B following the logical order of item development

F avoid trivial questions
G preparing items slowly and methodically
A final check: does the run-on stem agree syntactically with each
of the options? No: we’ve forgotten to put back the participle-form
into F: it needs to be ‘avoiding’. And since the run-on stem forms
four different sentences, we’ll need full-stops for each option. We
need too to adjust the option letters now the item is finished, with
the key now being D. Also note one other last-minute change:
options should all be about the same length in terms of words, so an
addition (‘all faulty’) to the new option C has been made from the
passage. The item now looks like this:
54
STAGE ONE – editing or vetting

the passage?
Item-writers will overcome the difficulties
of item-writing by
A following the logical order of item development

B getting a good critic to look at the item
C avoiding all faulty or trivial questions
D* preparing items slowly and methodically
Figure 15 offers a summary of the processes (and processing) which

would enable an item-writer to check the edited or vetted items for
the trial testing which now follows.
It should be pointed out that although the discussion in this and the
previous section has focused on the multiple-choice format as the
basis for the process description, the same sequence of procedures
should be followed with regard to other item formats. They too need
intense and critical review. about curricular relevance, the wording
used, and their position within the pattern of activities which runs
through the whole test.
© UNESCO 55
10 Advance preparation for

final formatting
Two other activities should also be taking place simultaneously with
detailed vetting. One is to begin obtaining publishers’ and/or authors’
legal permissions for use of any text or other material which is
copyright. Field testing of materials should not take place until these
have been approved – it would be a waste to test a passage or other
stimulus material in the field only to find later that permission was
not forthcoming and all the item-writing had been wasted!
The other activity is to design the test paper and answer sheet layout
and format, so that the field test version is as close as possible to the
real design envisaged for the final test. Layouts and instructions to
candidates need to be field-tested as well as items.
Figure 15
THE PROCESS OF MULTIPLE- CHOICE ITE M

DEVELOPMENT
1. The stimulus material to which the item refers will contain

material which is relevant to the objectives of the test.
Considering and responding to the material will be a
worthwhile educational experience for the test-taker.
2. The stem and the keyed answer together will represent a
meaningful and worthwhile response to a key or central issue
in the stimulus material, not a merely peripheral one. Doing the
item will not be a merely trivial experience for the test-taker.
…
56
…
3. Each of the distractors will be plausible: that is, they will
represent a possibly relevant view of the matters raised in the
stimulus and the stem.
4. The item will be independent. Finding the keyed answer will
not depend on successful answering to any other item in the
test. No clues as to the keyed answer will be given anywhere
else in the test.
5. A successful response to the item-stem will depend on the
test-taker understanding a key issue in the stimulus, not
eliminating distractors to find a ‘best answer’ or merely
recognising a stated fact.
6. The question stated or implied in the item will be positively
worded. Where an important issue in the material unavoidably
requires negative wording if it is to be tested at all, this will be
in the stem, printed in bold capitals (‘NOT’; ’EXCEPT’). No
additional negatives will be used in any of the options.
7. The item will contain four or five options for answering, and
be laid out in standard form.
8. Each of the options will be roughly the same length. If this is
impossible, then two groups of options will be of similar length
(e.g. two short and three longer).
9. The item will have been trial-tested and found to have a facility
between 20 and 80 percent.
10. In trial-testing, the keyed answer to the item will have been
found to discriminate positively, and distractors to discriminate
negatively.
© UNESCO 57
Once again a variety of layouts might be possible, so that later

options for choice are kept open. One or (better) two specimen or
‘practice’ questions should be prepared for the front cover, in case
candidates have not seen this style of question before – this is a
particularly important issue where multiple-choice items are used.
A first review of keyed answer order should also take place for
multiple-choice tests. Two consecutive items may have the same key
letter, but not more than two. Also, an approximately equal number
of items should be assigned to each option letter (A, B, C, D) over
the whole test. There is often a tendency amongst item-writers to try
to ‘bury’ keyed answers by assigning them to C or D. If these two
letters are overused, the candidate who guesses using them has a
more than 25 percent chance of getting each item right.
Incidentally, the wise item writers don’t throw away the successive
draft and re-workings of the items they have prepared. For example,
the final version of item 1 above may not ‘work’ in trial testing,
and may need to be later revised. The rejected wordings and extra
options may make that task easier if that happens.
58
Field or trial testing 11
In a school setting, a trial test of items on a population which will
not do the final test is often impossible – the only students who
could do the trial are the ones who have been taught the course, and
for whom the test has been written. However in larger programs,
field testing is indispensable, for four reasons:
• ‘debugging’ the whole test, to get rid of errors;
• as a check on the conceptual difficulty of the whole test;
• as a check on test-length and timing;
• as a check on the vocabulary level used in the test stimulus and

item material.
The sizes of the trial population needed for various tests will vary
considerably – the rule-of-thumb might be ‘the largest possible
population, given the available resources’.
Another rule-of-thumb might be to choose a population as close in

make-up to the one for which the test is ultimately intended – age;
level of schooling; gender composition; experience of the matter
which forms the basis of the test. Security might play a part in the
decision about who to use, too. The more people who see the test,
the more likely copies are to disappear. Thus it is essential that only
people connected with, or hired by, the test project team, should
handle papers and supervise trial sessions.
© UNESCO 59
Ample time should be allowed in the session for all students to

complete the test: trial data are needed on the items at the end of
the test as well as the early ones. If in early sessions it becomes
obvious that only a few students are reaching the end of the test
and doing these late items, some special arrangement might need
to be made in later sessions for some students to work through the
paper ‘backwards’ as it were.
What if trial testing is not possible? This throws additional

responsibility on the test developer to be ultra-meticulous in
checking the test before it’s used. Every effort must be made to
get other professionals to look at draft forms of the test and offer
critical comments.
60
Item analysis 12
There is not room in a paper of this size to canvass and describe
all the different options which exist for the analysis of item data.
Papers for other modules will explore the matter in some detail.
School-based tests will use simple rather than complex strategies
to obtain (and indices to express) this information. Larger test
programs will probably have the resources to engage in quite
lengthy and complex processing.
Broadly speaking, for multiple-choice items in large programs, the

analysis consists of statistics which show the facility of each item:
• the percentage of the whole test-taking population which got it

right;
• the discrimination index of each item: how well the keyed

answer distinguishes between students of high ability and
those less able;
• the response level for each item: how many actually attempted
it, right or wrong;
• the criterion score on each item: the mean score of all those who
did attempt it;
• whether any distractors did not function well: attracted too few
candidates, or a preponderance of those of high ability.
© UNESCO 61
However, from the item-editor’s point of view, it should be noted

that at least five other things are under analysis during this vital
stage of the whole development process.
Information is obtained about:
1. the parts of an item, especially in multiple-choice format with

its stems, keyed answers and distractors;
2. the integrity or worthwhileness of the item as a whole;
3. the performance of the item as a discrete test element;
4. the performance of an item with regard to other items in the

same set or test;
5. the integrity or worthwhileness of the test as a whole.
The distinctions between 2, 3, and 4 (which seem on the surface

to be saying much the same thing) are important, especially for a
multiple-choice test. Each, for the item editor, might offer a slightly
different reason for rejecting or retaining an item in the final test.
An item might be worthwhile as a measure of a particular higher
order skill (2). It might operate well as a discrete test element,
discriminating satisfactorily between the most able and the less able
(3). But it might be simply far too hard by comparison with all the
other items in the test (4) and deserve exclusion on that ground.
The interpreter of the item analysis sheets is faced with these

sorts of trade-off situations all the time. Here’s another. Two items
may prove to have keyed answers which show excellent powers
of discrimination (3). Each item may be well within acceptable
boundaries of ‘easy to hard’ (4). One deals with a higher-order skill
and has two ‘dead’ distractors, the other tests no more than factual
knowledge, though each of the distractors contributed well to the
overall performance of the item (2). Keep one item only? Which?
62
Item analysis
Keep both? Exclude both? There are no decision rules to cover

adequately the permutations and combinations which occur, or the
wide range of choices which becomes necessary. However, a sharp
eye should always be kept on the curriculum and the objectives
to be tested – the existence of the specification should not be
forgotten.
The test length should also be checked. If a large number of trial

candidates are found to have omitted one particular item, or failed
to complete the full test, then this tells us something about test
length. If everyone completes all items, this suggests that the
test time might be shortened or the number of items increased,
particularly if the trial test supervisors confirm that large numbers
of candidates sat around after finishing early in the trial session.
© UNESCO 63
13 Stage Two – editing for publication

When the test has been taken into the field and tried out, and the
item analysis has been completed, a second stage of editing then
occurs. Using the analysis results, each item is scrutinised and
decisions made about rejection or retention. There may be a little
judicious re-writing, but this should be limited to minor changes to
distractors only. If the stem and keyed answer need change, then a
new item has resulted, and this would require further trial testing:
it would be better to reject the item and use one that did work.
Once every item is clean, making up the final form actually begins.
As in so many matters to do with test development, there is a
sequence of activity which should be followed, in order to ensure
that new bugs don’t appear and the test as completely as possible
meets the original specification. Figure 16 suggests an appropriate
sequence.
Figure 16
EDITING FOR PUBLICATION – A CHECK LIST OF

ACTIVITIES
1. Read the item analysis and sort the items (or groups of items)
into three piles:
a) ready to go;
b) needing editing;
c) possible rejects.
2. Edit items and establish a final pool of items for the test.
Check the omit rate to establish optimum test length.
…
64
…
3. Check the specification against this pool for the number and
qualities of the items available. Reinstate any usable ‘rejects’ if
all the objectives are not satisfactorily covered.
4. Assign the items to a tentative order for the whole test and
enter the scoring scheme at the end of each section of the
test.
5. Check this order for:
a) order of difficulty (for example, in a set of multiple-choice
items, make sure some easy ones occur early, to give the
candidate some confidence);
b) keyed answer order and distribution;
c) balance and variety of item type.
6. Write or insert appropriate instructions for candidates.

Include suggestions for time to be spent on each section.
7. Assign the items to a preliminary paging of the test-paper:
some may have to be moved to allow a more satisfactory
layout. Allow sufficient space for answering if there is to be no
separate answer sheet.
8. Number all separate items consecutively. Use letters to
distinguish sections of the test if necessary.
9. Make a mock-up of the test paper (and answer sheet if used).
Read the paper through from beginning to end, to check
language, numbering (pages and any line-numbers used), all
labelling and diagrams, and the layout in general.
10. Photocopy the mock-up and ask a colleague who knows the
subject (but who has not seen the test) to ‘do the test’ on this
copy.
11. Check the trial completion by the colleague.
12. If possible, put the test aside for a week and then do it
yourself as a final check. Only then can you send the copy to
the printer!
© UNESCO 65
14 But is it a good test?

Figure 15 suggested some criteria by which we might judge whether
a multiple-choice item was a good one. The same sort of evaluation
should be made of whole tests, and so (by way of a summary of
the preceding sections) here are some suggestions as to criteria we
might use.
Hopkins and Antes (1990: 156-7) propose three main areas of

review, which they call ‘Balance’, ‘Specificity’ and ‘Objectivity’. They
attach to each area a pair of questions which reviewers (and test
writers!) should ask themselves to obtain a picture of how good the
test is. (The questions have been re-numbered from the original.)
balance
1. Are the items selected for the test representative of the
achievement (content and behaviours) which is to be assessed?
2. Are there enough items on the test to adequately sample the
content which has been covered and the behaviours as spelled
out by the objectives?
specificity
3. Do the test items require knowledge in the content or subject
area covered by the test?
4. Can general knowledge be used to respond to the test items?
objectivity
5. Is it clear to the test taker what is expected?
6. Is the correct response definite?
66
As a test-writer you rely on your test panel to verify positive
answers to many of these questions, but you will need to keep
them in mind as you go through the developmental process. The
specification will help you to meet the demands of Question 1.
Distinguishing content and objectives as horizontal and vertical
dimensions of your matrix (as in Figures 6, 6a and 6b) is an easy
way to start. Not every cell has to be filled, but ‘Balance’ requires
that every line and every column has something in it.
There is another point to be made about balance. Do we want our

test to consist wholly of items in only one format? We might decide
so: say, all essays or all multiple-choice items, for ease or economy
of marking. However, where variety of format is a feature, an easy
format (say, true/false) should not dominate the whole instrument
to the point where we miss out on testing other more complex
learning objectives.
The panel (or you own judgement) will help you accommodate to
Question 2. Where testing-time is insufficient to cover everything
taught (and there rarely is enough time), achieving the best
‘Balance’ means that the sample should consist of the most
important content and the most important behaviours.
Question 3 is really about relevance as an element of ‘Specificity’.

Here the panel will really help, by eradicating trivial items or
questions about matters which are merely peripheral and wasting
valuable testing time. As we said above with reference to multiple-
choice items, the stem and the keyed answer together should
represent a meaningful and worthwhile response to a key or central
issue in the stimulus material, not a merely peripheral one. Whether
multiple-choice or not, doing a test item should never be a merely
trivial experience for the test-taker.
In a similar way, panels will help with Question 4, with comments

such as: ‘It’s relevant, but it’s not specific to this course – even a
child in a nursery knows that!’
© UNESCO 67
Question 5 has a number of applications. Whether in the

instructions to candidates, or in the wording of an item itself, the
language you use must be clear, concise and explicit. Where you
think that definitions will help understanding (but not give away
answers), then print the definitions. Where there are rules or word
limits to be observed, state them. This will help the candidates do
their best, again without giving away answers.
Clarity and explicitness will also help achieve a positive answer

to Question 6. This doesn’t mean that all correct responses will
necessarily be the same: that is obviously an impossibility with
responses to an essay task, for example. But it does mean that all
candidates should know the boundaries for correct responses – how
one might be achieved, even if they themselves can’t.
To the overall appraisals above, we might add some more specific

indications of good test practice. Most of these will apply to items in
any format, whether selected-response or constructed-response.
• Provide enough information for the candidates to be able to

understand what is expected of them, but not so much that
they will become confused.
poor
Write all you know about the Spanish Civil War.
poor
Some people say that the Spanish Civil War was really a chance
for certain European countries to try out tactics and use new
weaponry in advance of the outbreak of a larger European War.
If you agree with this view, find some evidence for its truth and
write this in about a page. If you don’t agree, say why you don’t in
a piece of writing of about the same length.
better
“The Spanish Civil War was a dress rehearsal for World War II.”
Do you agree? In a short essay of about 500 words, support your
view with evidence.
68
But is it a good test?
• Indicate, by leaving space, or giving some other indication,

the amount which is expected in response to the item.
If leaving space, be generous: some students have large
handwriting.
poor
Write a few lines about the floating markets of Bangkok.
better
Give two reasons why the floating markets of Bangkok are
important to the city’s economy.
1 ......................................................................................................................
.........................................................................................................................
2 ......................................................................................................................
.........................................................................................................................
• Group items of similar format or content, and where

necessary label them. Arrange items in order from simple to
more complex.
poor
1. Find a word in the second paragraph of the second story that
means ‘briefly’.
2. Write a short character-sketch of one of the players in the first
story.
3. Why didn’t the captain perform very well?
4. Is it true that the Blue Team won the game?
better PASSAGE 1
1. The Blue Team won the game: true or false?
2. Write a five-line character-sketch of one of the football players
mentioned in this story.
© UNESCO 69
better PASSAGE 2
3. Write the word in Paragraph 2 that means ‘briefly’.
4. In three sentences, explain why the captain did not perform
very well.
• Make all directions and questions explicit and precise:

avoid ambiguity.
poor
Do you know what the word prescient means?
better
The word “prescient” means ...........................................................
. . . . . . . . . . . . . . . . . . . . . . .
poor
How did Tina feel in your own words?
better
Write two sentences which show, in your own words, how Tina
probably felt during the hold-up.
. . . . . . . . . . . . . . . . . . . . . . .
poor
Don’t attempt the essay until after you have written it out first.
better
Draft your piece of writing on the blank page, then write out a
good copy on the lined page.
70
• Where particular rules are to be used, or specific units,

designate these in the instructions.
poor
What is the total elapsed time?
better
The journey took ................. hours, ................... minutes.
. . . . . . . . . . . . . . . . . . . . . . .
poor
Since John walked right round the court once, how far did he
walk?
better
In walking around the perimeter of the court, John travelled
................... meters.
. . . . . . . . . . . . . . . . . . . . . . .
poor
Solve the following problems.
better
Solve the following problems. Express your answers correct to
two decimal places.
© UNESCO 71
• Wherever possible, indicate the criteria to be used in

assessment of an extended response item such as an essay
or a problem for solution.
samples
Up to five marks may be deducted if you do not show all your
working.
. . . . . . . . . . . . . . . . . . . . . . .
Your writing will be assessed according to the thought and content

displayed in the piece, the structure and organisation of the whole,
and your expression, style and mechanical accuracy.
• Wherever possible, present tasks which are new or unfamiliar

to the student, but which remain centrally relevant to the
learning which you expect to have been done.
(If the stimulus material to which an item or items refer is new

to the candidate, nevertheless it should contain material which
is relevant to both the objectives and the content of the course.
Considering and responding to such new material should be a
worthwhile educational experience for the test-taker: it should
enhance their knowledge as well as providing fertile material for
testing purposes.)
samples
Here are two documents relating to the study of Caribbean
history you have done this year. Both will be new to you. Read
them carefully and answer the questions which follow.
. . . . . . . . . . . . . . . . . . . . . . .
Listen to the following extract from an early symphony by Mozart

{No. 29, KV 201}. Although this work was not set for study this
year, the questions which follow will require you to compare the
piece with the other symphonies which you did study.
72
• If appropriate, indicate by means of a set of headings the

structure or organisation you expect or suggest that students
use in completing the task.
samples
In 750-1000 words write a critical review of the various changes in
the overall direction of Pablo Picasso’s painting style from 1905 to
1945.
You might mention:
a. the early work;
b. the impact of Cubism;
c. post-Cubist developments.
. . . . . . . . . . . . . . . . . . . . . . .
Choose any two poems by Schiller which you have studied this
year. Write a critical comparison of the two poems, showing what
each reflects about Schiller’s poetic achievements, any essential
differences in language or tone between the two, and your
personal assessment of their qualities.
• Avoid choice of items as far as possible, though not

necessarily choice within items. In the latter case, offer real,
not meaningless, alternatives. If they may choose their own
structure or form for presentation of their response, tell them
they may.
samples
“The recent economic history of Argentina might be well
described as the nation staggering from one crisis to another.”
In a piece of writing of 750-1000 words, give your view of the
country’s economic history since 1960, emphasising one or more
of the following:
© UNESCO 73
• international trade;
• domestic monetary policy;
• the impact of the war in the Malvinas (Falkland) Islands.
Write about the man in the photograph above. You may write in
any form you like: for example, a story, a letter, or a conversation.
• If separate stimulus material is used to prompt candidates

to a response, a successful answer to the item should depend
on the test-taker understanding a key issue or issues in the
stimulus, not eliminating irrelevant material in order to find a
‘best answer’ or merely recognising a set of stated facts. Test
taking should never become a search-and-destroy mission.
poor
List the names of the characters in the story you have just heard.
better
Name the character in the story you have just heard whose
actions contribute most to the build-up of suspense. In your own
words, tell what he or she did that was decisive in achieving that
build-up.
. . . . . . . . . . . . . . . . . . . . . . .
poor
Which of the towns on the map is the third-largest in terms of its
population?
better
Town X is the market town for the region shown on the map. One
reason is that it is a railway junction. Use the map and its legend to
identify three other reasons why it has become so important.
74
• Whenever you can provide helpful advice, print it, even at

the expense of a few more words.
poor
Candidates must do the questions in order.
better
You will give yourself your best chance if you work through the
questions in the order of presentation. However, you may need to
leave any particularly hard questions to come back to later.
. . . . . . . . . . . . . . . . . . . . . . .
poor
TIME: One hour.
better
TIME: One hour.
Leave a few minutes at the end of the test-time in order to check
your work thoroughly.
© UNESCO 75
15 Training scoring teams

The test has been printed and the real population for whom it
was intended has completed it. One further stage of the test
program remains to which the item-writer is well-placed to make
a contribution. This is the training of the team of people who
become responsible for assessing the candidates’ work. In some
test programs, of course, the scoring is done by machine which
scans and optically ‘reads’ the marks made by candidates in the
appropriate places on an answer sheet. However, in other programs
(particularly those which use extended essay responses as the prime
format for response) the scoring will be the result of intense human
activity, often by a large team of markers.
This team needs to be fully trained. And the training will need
the prior production of a set of criteria to be applied by all team-
members to the products of the testing, whether these be essays,
painted or sculpted works of art, diagrams, plans or computer-
output. To some extent all assessment is criterion-based in this
way. Someone exercises a judgement with some criteria or other in
mind. The need here is for the assessors to have common or shared
criteria, as far as such a thing is possible. The best assessment is
also criterion-referenced, in that the criteria not only determine the
award of credit by markers, but also underlie the reports of student
achievement which result from the assessment process. This is true
even if those results have to be norm-referenced or standardised
later for other purposes.
It is often possible to use trial test data to achieve the greater

commonality of assessment criteria for training purposes. If, as
76
should have happened, the extended response items were pre-
tested, samples on each of the finally-chosen topics or questions will
be available. In addition, the item constructor will have had a clear
idea of what was intended or foreseen as a good or medium or poor
response to the item – these foresights should yield a basic list of
assessment criteria with which to begin the training sessions.
The governing principle for selection of an assessor should be

expertise in and experience of the curriculum or course under
review. This will have yielded a personal set of assumptions about
learning and criteria for its assessment which the individual
brings to the first training session. At that session the task is to
get the item-writer’s criteria, the assessors’ criteria and the trial
test samples into a dynamic situation which eventually yields a
commonly held view about what should constitute the various levels
of performance to be decided for the candidate population.
A simple set of base criteria with which to start any session, in any
subject, might be the following:
1. the thought displayed by the candidate in preparing the work

and the content offered in presenting the work;
2. the structure and organisation of the content of the piece of

work, as finally presented for assessment;
3. the expression, style and mechanics of the finished piece.
The words in bold can be ‘translated’ to fit just about any criterion
set. In language tests, mechanics might mean accuracy of spelling
and punctuation and organisation might yield insight into
paragraphing skills. In mathematics tests, style might mean the
economy or elegance of the thinking which went into proposing
a particular solution to a problem. What this set (or a similar one)
might do is lead the assessors towards a fruitful discussion of
© UNESCO 77
the important criteria for their particular task. It might also help
them to avoid a common assessment problem: concentrating on
the surface features, to the exclusion of deeper, more important
qualities of a student’s work.
Figure 17
TR AINING A SCORING TE A M
1. Find a venue which enables group discussions to take place.
2. Select only assessors who are expert and experienced in the

appropriate subject or learning area.
3. Issue a basic list of criteria and a small set of student work

from the trial test to each assessor in advance of the first
training session.
4. At the first session, set up small-group interactions to review

and revise the base criteria in the light of the test item and the
advance reading.
5. In a large-group session, agree on a criteria list (or come as

close to agreement as possible!)
6. Each individual applies the criteria to another small set of trial

test materials (everyone does the same set).
7. In another small-group interaction, the results of the trial

marking are discussed, and the whole group revises the
criterion set as necessary.
8. Marking commences.
78
Training scoring teams
Figure 17 indicates a suggested sequence for organising and

conducting a training program for assessors. The sequence
emphasises prior knowledge of shared criteria amongst the
assessors. Step 5, for example, can be easily done if a ‘bundling’
exercise is undertaken, where cards (each containing one possible
criterion statement) are sorted into piles of like criteria, redundant
cards are eliminated, and lists made of the remaining ‘live’ ones.
The sequence also emphasises the use of samples of work from

‘real life’ in assisting the definition of criteria. What it does not
do is to tease out the wealth of possible strategies in applying
those criteria to these samples. One such strategy would be for
an assessor to use all the criteria simultaneously or holistically in
coming up with a grade or score for each piece of work. Another
would be to assess each piece analytically, element by element (or
criterion by criterion), and come up with a set of part-scores which
would then be added together. Yet another strategy would be to
find and publish a criterion example or piece of work at each score
or achievement level, and ask assessors to match each new piece of
work to one of these given examples.
Whichever method is chosen (and each has its advantages

and disadvantages), there must be provision of ample time for
discussion and agreement by those involved. There will also need to
be on-going analysis and follow-up of the actual performance of the
assessors during the marking period. This on-task monitoring might
also be accompanied by further occasions for group discussion
during the process. Individual variations in the interpretation of
criteria are only to be expected; score discrepancies are also to
be expected. The aim is not to eradicate them completely – this
would be impossible – but to lessen their eventual impact on the
reliability and validity of the whole marking process. (This might
mean, in extreme cases, removing poor assessors from the team.)
Reliability and validity might also be enhanced in various other
ways; for example, reducing the physical and psychological impact
© UNESCO 79
of a large workload by staggering the sessions to allow for leisure,

and by batching the data in smallish bundles so that a sense of
achievement is felt fairly constantly.
In the section above discussing the choice of item-types for test

programs, mention was made of the subjectivity involved in the
selection of ‘right’ answers for many multiple-choice questions.
Subjectivity, of course, has a large place in the allocation of grades
to pieces of student writing or other extensive responses to test
items. Despite our best efforts to get assessors to agree, share and
use criteria similarly, discrepancies of understanding and of grade
allocation will occur. They are unavoidable. The aim is to minimise
them. They do not invalidate the assessment process – they merely
form part of the ‘trade-offs’ we might have to make when we can
see no other way of assessing complex or extended responses to the
learning the students have done.
80
Fur ther reading 16
Below are citations to six books. Reference to them will take your
understanding further. As the publication dates indicate some are
quite elderly: nevertheless they are mentioned because this may
improve the chances of accessing them through libraries. Page
numbers are in bold.
Gronlund, N.E. (1976: third edition) Measurement and Evaluation

in Teaching. New York: Macmillan.
preparing instructional objectives 28-59
multiple-choice formats 188-209
measuring complex achievements 210-248
Hopkins, C.D., and Antes, R.L. (1990) Classroom Measurement and

Evaluation. Itasca: Peacock.
principles for writing good items 210-18; 243; 265
Lien, A.J. (1976: third edition) Measurement and Evaluation of

Learning. Dubuque: Wm C. Brown Company.
characteristics of good instruments 78-91
construction and use of classroom tests (principles and steps) 194-242
Lindquist, E.F. (ed.) (1955) Educational Measurement. Washington,

D.C.: American Council on Education.
summary of test planning 175-184
ideas for test items 190-3
formats 193-212
writing items 213-227
© UNESCO 81
Mehrens, W.A. and Lehmann, I.J. (1984: third edition)

Measurement and Evaluation in Education and Psychology. New
York: Holt, Rinehart and Winston.
objective and extended response tests compared 75-84
general considerations in item writing 84-89
essay testing 96-112
assembling, reproducing, administering and scoring 177-202
Miller, H.G., Williams, R.G., and Haladyna, T.M. (1978) Beyond

Facts: Objective Ways to Measure Thinking. Englewood Cliffs:
Educational Technology Publications.
multiple-choice items beyond the factual level 33-56
measuring predictive behaviours 73-96
measuring evaluative behaviours 97-116
82
Exercises 17
1. Select a curriculum area in which you are an expert or which

is important for some professional reason.
Using Figures 1 and 2 as a guide, write a brief specification for

a test or test program in this area. One page will do – omit
Step 8, the matrix.
In a small-group interaction (2-4 persons), review the

specifications prepared by individuals.
2. Using Figures 3-6 as a guide, use your specification, as revised

after from Exercise 1, to develop a full matrix for the test or
test program.
Include a set of multiple-choice items as one of the items in

the matrix.
Repeat the small-group interaction undertaken in Exercise 1.
3. Use your revised matrix from Exercise 2 to draft a small

set (3 or 4) of multiple-choice items appropriate to the
specification you have prepared.
If you cannot find a piece of stimulus material, you may need

to prepare or write a piece yourself.
Set up a small panel and review the items.
© UNESCO 83
18 Glossary of terms
answer sheet or booklet
a piece of test stationery separate from the question booklet, on
which candidates record personal details and the answers to the test
items.
constructed response item

a test item which allows or requires candidates to produce
individual responses rather than merely select from a list of given
options.
criterion score
the mean facility of an item taking into account only the
performance of those candidates who actually attempted to answer:
can also be calculated for whole tests or groups of items, using the
same rule.
discrimination
the ability of an option to distinguish between those groups of
candidates who had greater and lesser ability as indicated by
their performance on the whole test. The indices used are usually
expressed as positive or negative fractions of 1.0 and can be derived
using a number of different formulae (e.g. point bi-serial; phi
coefficients, etc.)
distractor
in a multiple-choice item, an option for choice which is not the
keyed answer, but which has been written in such a way as to
distract weaker candidates from selecting that key.
84
editing
preparation of refined versions of tests or items after other key
stages in the development process, such as panelling or trial-testing.
It is usually performed by the original item-writer(s).
extended response item

any test item which requires the production of a personal response
by the candidate which is longer than a sentence or two.
facility
the index obtained by a multiple-choice item during testing which
indicates the number of candidates who got it right: expressed as a
percentage of the total number of candidates who sat the test. The
index for an item to be used in a final test should always lie within
the range 20-80 percent.
final form
the test instrument after it has been trial-tested, analysed, edited
and prepared for publication.
instructions
• to candidates
information printed on the question paper or answer sheet which
candidates need to be able to complete the test satisfactorily,
but which does not actually form part of the stimulus material
or questions: in some cases, these may also be read aloud by a
supervisor.
• to supervisors
information provided for test supervisors or invigilators on how
to conduct the test session correctly: includes a script of any
instructions to candidates which are required to be read aloud.
© UNESCO 85
item
an individual task which forms one component of a test instrument:
usually applied in the context of a multiple-choice test to indicate a
single question, but can be used more broadly.
key, keyed answer

in a multiple-choice item, the option which is designated to be
correct, and for which a score is awarded.
key order
in a multiple-choice test, the sequence of letters attached to keyed
answers, as in 1 D, 2 B, 3 C, etc. The same key letter should be used
for no more than two consecutive items: viz. 3 C, 4 C, 5 A.
moderation
sometimes used to describe the process whereby expert panels meet
to discuss and offer critical comment on test materials.
multiple-choice
an item format whereby a restricted number of optional responses
is offered to candidates, from which they must select one as their
answer.
omit rate
a tally of the number of candidates who did not answer a test item:
especially important in estimating the performance of trial test
candidates in the later items of the test, with a view to establishing
an acceptable test length.
option
in a multiple-choice item, a set of responses (usually four or five)
from which the candidates select their answer.
86
Glossary of terms
panel, panelling
a group of experts called together to discuss and evaluate draft
items proposed for use in a test instrument.
question book(let)
a printed test instrument which contains instructions, stimulus
material and test items for students to work through during the test
session. Answers may be recorded in this book, or on a separate
answer sheet.
selected response item

any item which prints a limited range of options from which
candidates must select their answers.
specification
a document which specifies in some detail the nature and
composition of a test program or instrument: sometimes called a
‘blueprint’.
stem
in a multiple-choice item, the sentence(s) or part-sentence which
indicate the testing point or question, which candidates use to select
their answer from the options which follow.
stimulus
• directive
any information in a test which candidates need to understand the
specific task which they are being asked to perform (e.g. the stem of
a multiple-choice item, or a detailed essay topic).
• instructive
any information printed in a question booklet which candidates are
expected to refer to when answering the specific questions which
relate to it.
© UNESCO 87
trial form
the test instrument after it has been developed, panelled and edited
ready for administration to a trial population in the field.
vetting
the process of editing and arriving at a draft test form, using the
discussions and evaluations of draft items by a panel as a guide.
88
Quality (SACMEQ).


and innovation”.
6
Module
John Izard
Overview of
test construction



Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66

Module 6 Overview of test construction
Content
1. Assessment needs at different levels

of an education system 1
Student, teacher and parent assessment needs 2
Regional and national assessment needs 3
2. What is a test? 5
3. Interpreting test data 8

The matrix of student data 8
Designing assessments to suit different purposes 10
4. Inferring range of achievement from samples

of tasks 12
Choosing samples of tasks 12
The wider implications of choosing samples of tasks. 15
5. What purposes will the test serve? 18

Results used to compare students 19
Results used to compare students with a fixed requirement 21
Other labels for categories of test use 22
1
6. What types of task? 26

Tasks requiring constructed responses 26
Tasks requiring choice of a correct or best alternative 28
Is one type of question better than another? 31
7. The test construction steps 32

Content analysis and test blueprints 32
Item writing 35
Item review – the first form of item analysis:
checking intended against actual 36
Other practical concerns in preparing the test 37
Item scoring arrangements 38
Trial of the items 39
Processing test responses after trial testing 40
Item analysis – the second form involving responses
by real candidates 40
Amending the test by discarding/revising/replacing items 41
Assembling the final test (or a further trial test)
and the corresponding score key 41
Validity, reliability and dimensionality 42
8. Resources required to construct and

produce a test 45
9. Some concluding comments 48
II
Content
10. References 50
11. Exercises 52
© UNESCO III
Assessment needs at different 1
levels of an education system
Assessment of student learning provides evidence so that

educational decisions can be made. We may use the evidence to
help us evaluate (or judge the merit of) a teaching programme or we
may use the evidence to make statements about student competence
or to make decisions about the next aspect of teaching for particular
students.
The choice of what to evaluate, the strategies of assessment, and the

modes of reporting depend upon the intentions of the curriculum,
the importance of different parts of the curriculum, and the
audiences needing the information that assessment provides. For
example, national audiences for this information may include both
those who will be making decisions and those who wish or need to
know that appropriate decisions have been taken.
Educational decisions which require information about the

success of learning programmes, or which require information
about which students have reached particular levels of skill and
knowledge, depend upon valid (and therefore reliable) measures
to inform those who make the decisions. The type of information
will depend upon whether the decisions are being made at the
personal, school, regional, or national level. Variables which are
seen to influence the outcomes of education may, or may not, be
within the province of school systems to alter. For example, socio-
economic circumstances are known to have influences on student
achievement, but teachers are not generally able to change the
© UNESCO 1
socio-economic circumstances of the families in their school’s

community. By contrast, other variables are able to be manipulated
to produce changes in student achievement (We say these
variables are malleable). For example, better teacher in-service
training and the provision of improved instructional materials can
improve student achievement.
In order to measure progress, tests need to be given more than

once so that changes can be identified. For example, to assess the
impact of new programmes to improve schools, baseline measures
are needed to describe the effectiveness of the teaching provision
before the innovation, so that subsequent measures can be used to
judge the effectiveness of the implemented innovation.
Student, teacher and parent assessment

needs
At the individual student level, students, teachers, and parents
need information about student performance expressed in ways
which not only identify strengths and weaknesses, but which
also suggest what might be done to capitalise on the strengths
and to overcome the weaknesses. Assessment data can only
be understood in the context in which they were collected. For
example, a score of 59% is meaningless without knowing what
teaching/learning situations have been provided, how long the
educational programme has been offered, whether the student
has actually been present for all or most of the programme, what
questions were asked, and what answers were expected. Such a
score also has implicit messages about precision – the accuracy is
implied to be to the nearest half of a percentage point, although
such precision is very rarely achieved in educational assessment.
2
Assessment needs at different levels of an education system
School level assessment needs

At the school level, the school principal and senior administration
group generally require information about classes rather than
individual students. This information may be expressed in
association with information from classes in other schools in the
district, region, or nation. Such comparisons generally concentrate
mainly on relative standing. The relative standing of a particular
school may improve for reasons that are not related to the skills
of the teachers or the educational programme of the school. For
example, relative standing may improve because schools select
pupils who will do well even if the teaching is poor. Rather
than concentrating on relative standing, it is better to focus on
information expressed in terms of expected learning levels and
progress towards educational goals. Then actions taken can relate to
ensuring that accepted educational goals will be met for all students
in the school. In this case success would be judged by taking into
account the extent to which a school has ensured that every student
has made good progress.
Regional and national assessment needs

At the regional level (including state and provincial levels) the
information required is generally concerned with improving the
effectiveness of larger numbers of schools. Evidence of school
achievement might be based on a wider range of indicators, such
as effective use of resources provided to a school, provision of
educational programmes which meet policy guidelines, and the
extent to which the community where the school is placed is
involved in the educational programmes.
© UNESCO 3
At national level the information required must relate more to

policy issues, national planning, and the resource implications for
competing options in educational plans.
It is particularly important for National officials to be sensitive to long-

term trends in their education system’s capacity to assist all students to
make progress towards achieving a high standard of physical, social and
cognitive development. In some circumstances these trends will call for
intervention in what is seen as an emerging and widespread inability of
students to achieve success in a specific part of the curriculum. In other
circumstances, the focus will be on the curriculum itself because it may
be seen as being in need of revision and restructuring in order to take
account of recent research and/or new social and economic conditions.
(Somerset & Eckholm, 1990, p.18)
Those who are taking action should also know the likely direct and
indirect effects of various action options, and the costs associated
with those options. They will include politicians, high level advisors,
senior administrators, and those responsible for curriculum,
assessment, teacher training (pre-service and in-service), and other
educational planners.
That is, those taking action need to be able to provide evidence that
their actions do ‘pay off’. For example. politicians have to be able to
convince their constituents that the actions taken were wise, and
senior administrators need to be able to show that programmes
have been implemented as intended and to show the effectiveness
of those programmes. It is important for such officials to realise
that effecting change requires more than issuing new regulations.
At the national level, action will probably be needed to train those
responsible for implementing change.
4
What is a test? 2
One valid approach to assessment is to observe everything that is
taught. In most situations this is not possible, because there is so
much information to be recorded. Instead, one has to select a valid
sample from the achievements of interest. Since school learning
programmes are expected to provide students with the capability
to complete various tasks successfully, one way of assessing each
student’s learning is to give a number of these tasks to be done
under specified conditions. Conventional pencil-and-paper test
items (which may be posed as questions) are examples of these
specially selected tasks. However other tasks may be necessary as
well to give a comprehensive, valid and meaningful picture of the
learning. For example, in the learning of science subjects practical
skills are generally considered to be important so the assessment
of science subjects should therefore include some practical tasks.
Similarly, the student learning music may be required to give a
musical performance to demonstrate what has been learned. Test
items or tasks are samples of intended achievement, and a test is a
collection of such assessment tasks or items.
Single, discrete items may not be reliable (or consistent) indicators

of achievement. However, when a number of similar items or tasks
are combined as a test, we can look at patterns of success on the
test. Such patterns tend to be more dependable indicators because
they are based on multiple sources of evidence (the various separate
assessment tasks).
Clearly, the answer for one item should not depend on information
in another item or the answer to another item. Otherwise this
© UNESCO 5
notion of combining independent pieces of evidence would be lost.

(The same idea extends to other tasks where results are combined
with test results to document learning achievement.)
This approach of giving all students of a particular age the same

sample of assessment tasks is of value at both the individual student
and school levels. Teachers and school principals can examine two
types of profile to evaluate delivery of educational programmes.
The teacher might look at the performance of individual students
in each of the areas assessed in order to find out more about the
extent of progress since the previous assessment. The teacher and
the principal might look at the performance of the assessment tasks
themselves to identify those topics presenting special difficulties
within school classes or across classes.
Regional and national officials may wish to review performance

on particular assessment tasks also, but the large volume of data
and the expensive resources needed to collect the data, process and
interpret it, preclude collection of this information on an individual
basis for every student. Well-designed probability samples of
students will provide more economic and quite accurate ways of
estimating regional or national performance. Recent advances
in testing technology mean that students in these samples need
not attempt identical test questions – and therefore the ‘coverage’
of the collected information can be extended to a wider range of
topics. When such data are collected it is important to ensure that
information is gathered on variables that are influential but which
(for schools) are not malleable as well as on variables which can
be influenced by schools. With care, regional and national officials
can take (statistical) account of the non-malleable factors when
assessing the impact of variables that can be influenced by schools.
Generally traditional examinations are not appropriate for

reviewing regional or national performance over a period of time.
National examinations cannot show the extent of improvements
of teaching skill, the extent to which all parts of the curriculum
6
What is a test?
are working, or the magnitude of improvement which results from

deployment of resources as a result of policy changes. The scoring
of national examinations which are multiple-choice in format is
likely to be consistent, but such questions are not likely to be used
in a recurring pattern because of the potential for breaches of test
security in such a high-stakes assessment context. The scoring
of open-ended and short-answer format questions will include
variation due to scorer behaviour as well as that due to candidate
behaviour, and the same examiners are not likely to assess a
comparable question in subsequent years.
These problems in traditional examining mean that assessing

changes (in regional or national performance over time) requires a
range of specially developed low-stakes tests. In the development
of these tests, care needs to be taken that the questions used
on one occasion are comparable to those used on another, even
though they are not the same questions. This comparability must
be demonstrated empirically at some stage. Usually this means
that both sets of questions are given to another sample of students
representative of the range of achievement. Then questions
apparently similar with respect to content and coverage can be
checked to see whether students respond to the questions in a
comparable way. Questions that are comparable will have similar
ranges of difficulty, will reflect similar performance by significant
sub-groups of the population (such as males, females, ethnic
minorities, city and rural), and will have similar discrimination
patterns over the range of achievement. (In other words, low
achievers will have similar performances on both sets of questions,
middle level achievers will have similar performances on both,
and high achievers will have similar performances on both.
These specially developed tests can be used with relatively small
representative samples to assess the extent of changes for the
purposes of monitoring effects of additional funding, changes
in the provision of teachers, or the effects of introducing new
instructional materials.
© UNESCO 7
3 Interpreting test data
The matrix of student data

When data from a test are available, the requirements of the various
audiences interested in the results differ. This can be illustrated
using the matrix of information shown in Figure 1.
The students and their parents will focus on the total scores at
the foot of the columns. High scores will be taken as evidence
of high achievement, and low scores will be taken as evidence of
low achievement. However, in summarising achievement, these
scores have lost their meaning in terms of particular strengths and
weaknesses. They give no information about which aspects of the
curriculum students knew and which they did not understand.
Teachers, subject specialists, curriculum planners, and national
policy advisors need to focus on the total scores shown to the right
of the matrix. These scores show how well the various content areas
have been covered. Low scores show substantial gaps in knowledge
where the intentions of the curriculum have not been met. High
scores show where curriculum intentions have been met (at least for
those questions that appeared on the test).
8
Figure 1. Matrix of student data on a twenty-item test
Students
Items 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 15
2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 14
3 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 14
4 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 11
5 1 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 13
6 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 11
7 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 1 1 8
8 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 13
9 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 1 1 1 9
10 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 7
11 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 1 0 7
12 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 6
13 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 7
14 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 1 11
15 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 1 12
16 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 1 1 1 8
17 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 8
18 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 6
19 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 4
20 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 5
21 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 0 0 0 10
3 4 5 7 7 9 10 10 12 12 14 14 14 14 15 15 17 17 199
© UNESCO 9
Designing assessments to suit different

purposes
Traditional examinations give every student the same task so
that individuals can be compared, for example, by comparing the
column totals in Figure 1. As time is limited and the cost of testing
large numbers of candidates is high, the number of tasks used
has to be relatively small. As the costs of assessment are roughly
proportional to the number of cells in the matrix, the number
of questions asked in traditional examinations will be limited to
contain costs. The resulting matrix will be wide to cater for many
students but not very deep because of the limited number of test
items (see Figure 2).
Figure 2. Traditional examination data matrix
Students
Items 1 2 3 4 5 • • • • • • • • •
1
2
•
•
•
Information about many important issues cannot be collected

because so many students have to be tested. Traditional
examination questions are a sample of those assessment tasks that
can be given to all students in a convenient format and they usually
ignore those assessment tasks that cannot readily be given to all
students. By contrast, tests used in national assessments are likely
to differ from the usual public examinations. National assessments,
which gather information on a much larger number of topics, will
10
Interpreting test data
need to limit the number of students to contain costs. The resulting

matrix will be narrow, but very deep because of the larger number
of test items (see Figure 3).
National planners do not need to know every detail of every

individual’s school performance. Just as a medical practitioner can
take a sample of tissue or body fluids under standard conditions,
subject the sample to analyses, and draw inferences about a person’s
health, a national assessment can take a sample of performance
by students under standard conditions, analyse the data and draw
inferences about the health of the educational system.
If the sample of evidence is not appropriate or representative the

inferences made about current status in learning will be suspect,
regardless of how accurately the assessments are made.
Figure 3. National assessment test matrix
Students If the number of questions in a

national assessment is large, the
Items 1 2 3 4 5
testing required may be more
1
than can be expected of typical
2 students. It will be necessary
3 to choose more than one
4 representative sample of students
5 so that evidence can be gathered
• on each important issue.
•
•
•
•
•
•
•
© UNESCO 11
4 Inferring range of achievement

from samples of tasks
Assessment involves selecting evidence from which valid inferences
can be made about current status in a learning sequence. If we do
not select an appropriate sample of evidence the conclusions we
draw will be suspect, regardless of how accurately we make the
assessments. It is possible to make consistent assessments which are
not meaningful in the context of the decisions we must make.
For example, we could weigh the students and give the highest
scores to those with the largest mass. This assessment could be very
consistent, particularly if our scales were accurate. However, this
assessment information is not meaningful when trying to judge
whether learning has occurred. To be meaningful in this context,
our assessment tasks (items) have to relate to what students learn.
The choice of what to assess, the strategies of assessment, and the
modes of reporting depend upon the intentions of the curriculum,
the importance of different parts of the curriculum, and the
audiences needing the information that assessment provides.
Choosing samples of tasks

Tasks chosen have to be representative so that:
• dependable inferences can be made about both the tasks chosen

for assessment and the tasks not chosen;
12
• all important parts of the curriculum are addressed;
• achievement over the full range is assessed (not just the narrow
band where a particular selection decision might be required on
a single occasion).
Choosing representative tasks may be difficult. Remember that the

tasks have to represent the whole curriculum, not just the parts
that can be tested with pencil-and-paper test items. Further, some
pencil-and-paper test items are better than others in assessing
interpretation and understanding, and in providing information
in a form that can be used to make teaching decisions. If these
qualities are required by the curriculum then they have to be
assessed by representative tasks. Pencil-and-paper test items
which only require memory are much easier to write but cannot
provide other essential evidence. For example, being able to give the
correct answers to number facts such as 6+3=?, 9+5=?, and 7x3=?
does not provide sound and dependable direct evidence about
whether a student can read a graph, or measure the length of a
strip of wood. If reading graphs is important, then the assessment
tasks should include some that involve the reading of graphs. If
measuring lengths is important, then length-measuring tasks must
be used to decide whether this curriculum objective has been met.
Constructors of tests often draw up a list of topics and types of skill
to specify what the test should cover.
On subsequent occasions it may be necessary to choose different

representative tasks (otherwise the first group of tasks tested
may be the only ones taught and therefore will no longer be
representative of all the curriculum). The tests have to include easier
tasks as well as more difficult tasks. The easier tasks will allow
students to show more of what they have learned. The more difficult
tasks will allow the best students to show where they excel.
© UNESCO 13
The function of the tasks is to provide meaningful evidence. The

tasks have to be matched in difficulty (and complexity) to the level
of the students who are going to attempt the tasks. Tasks may be
too difficult. If students do not engage with the tasks, little or no
evidence is provided.
If students cannot do any of the tasks then they cannot provide

any evidence of their achievements. If two such assessments are
made it will appear that these students have not learned anything
(because there will be no change of scores) even though they may
have learned a great deal of important knowledge and skills. Such
assessments are faulty in that they fail to recognise learning that
has occurred. (A test with items which do not allow the less able
students to show evidence of their learning may be referred to by
saying that the test has a ‘floor’ effect.)
Tasks may also be too easy. If all students can do all of the tasks
then the most able students will not be able to provide evidence of
their advanced achievements. If two such assessments are made
it will appear that these able students have not learned anything
(because their scores cannot improve) even though they may have
learned a great deal of important knowledge and skills. Such
assessments are also faulty in that they fail to recognise learning
that has occurred. (A test with many easy items which do not
allow more able students to show evidence of their learning may be
referred to by saying that the test has a ‘ceiling’ effect.)
The range of complexity of tasks should be at least as wide as the

expected range of achievement for the students being assessed if
evidence of learning is required about all students. Writing test
tasks and items with desirable properties requires a great deal of
skill over and above knowledge about the curriculum, and about
how students learn. A team of trained item writers can usually
produce a better range of items to consider for trial than any
individual (or group of individuals working alone). Item writing
14
Inferring range of achievement from samples of tasks
without the benefit of interaction with colleagues is generally

inefficient and tends to be too idiosyncratic, representing only one
person’s limited view of the topic to be assessed. When inspiration
is lacking, the items written may degenerate to a trivial level.
The wider implications of choosing

samples of tasks
Assessment has considerable influence on instruction. Topics
chosen for assessment and the items chosen for those topics convey
to students a view of what is considered important by those who
make the assessments. Where the assessments are external to the
schools, the items chosen convey a similar message to teachers
as well. Conversely, topics or items not chosen indicate what is
considered not important.
When assessment results have high stakes (as in the case where
results are used to select a small proportion for the next stage
of schooling or for employment), the chosen assessment tasks
have a high degree of influence on curriculum, teaching practice,
and student behaviour. When public examination papers are
published, teachers and students expend a great deal of effort
in analysing these papers, to practice test-taking skills, and
to attempt to predict what topics will be examined so that the
whole curriculum does not have to be studied. These practices of
restricting learning to examinable topics may lead to high scores
being obtained without the associated (and expected) coverage of
the intended curriculum.
Narrow testing practices have undesirable influences on teaching.

For example, tests which encourage memorizing facts rather
than understanding relationships lead to teaching which ignores
understanding regardless of national needs for people with such
© UNESCO 15
skills. One possible consequence is for education authorities

(and the general public) to lose confidence in the examination
system because individuals with high scores do not have the skills
and understanding required. Note that when this happens, the
authorities often increase the score for a pass hoping that this will
help regain public confidence. However requiring a higher score
on an inadequate test cannot solve a problem which depends on
more relevant items being asked. The only solution is to improve
the quality of the assessments so that they match the curriculum
intentions.
This may cost more at first because the levels of skill in writing
such tests are much higher. Generally teams of item-writers are
required rather than depending upon a very limited number of
individuals to write all of the questions. The pool of experienced
teachers with such skills will not increase in size if teachers are
not encouraged by the assessment procedures to prepare students
for higher quality assessments. Further, item writing skills develop
in part from extensive experience in writing items (especially
those that are improvements on previous items). Such experience
of item writing, of exploring the ways that students think, and
of considering the ways in which students interpret a variety
of evidence, is gained gradually. Many good item writers are
experienced and sensitive classroom teachers who have developed
the capacity to construct items which reveal (correct and incorrect)
thought processes.
Examination assessment may fail to sample tasks where there

are multiple solutions, where problems may be solved in different
ways, or where more than pencil-and-paper skills are required. For
example, pencil-and-paper examinations are not good measures of
practical tasks, of a person’s skill in working in a group, or of the
capacity to develop alternatives if the first attempt fails to work. Yet
in many societies (if not all) being able to do practical tasks, to work
in groups, and to solve local problems are skills considered to be
16
Inferring range of achievement from samples of tasks
essential for survival and community participation. If an excessive

emphasis is placed on pencil-and-paper assessments then the result
may be to devalue some kinds of valuable skills that are essential in
many communities.
© UNESCO 17
5 What purposes will the test

ser ve ?
Test results are interpreted in many ways. One important way
involves comparing each student’s score with the scores of a group
of students (who are supposed to be like the student for whom the
comparison is being made). Such comparisons can tell us how well
one student scored relative to another but they do not tell us which
students are competent in a chosen area or suggest what might be
done to increase performance.
A second important way of interpreting results involves comparing

each student’s results with a set of fixed requirements. Such
comparisons can tell teachers and administrators the proportion
of students with acceptable levels of skill and can identify those
topics needing extra or different teaching and/or learning strategies.
Tests prepared for this second purpose can be used to rank-order
students as well. (Tests prepared for rank-ordering tend to exclude
questions which many can answer successfully because there is
little interest in what skills people have or do not have if the only
purpose is to establish an ordered list of students.)
Describing changes in terms of total scores only is

counterproductive. The scores can only be understood in the
context in which they were collected. For example, a score of 60 per
cent on one test (with easy items) may be worth less in achievement
level terms than a score of 30 per cent on another test with difficult
items in the same content area.
18
Results used to compare students
Norm-referenced tests: These tests provide the results for
a reference group on a representative test and therefore scores
on the test are normally presented in terms of comparisons with
this reference group. If the reference group serves as a baseline
group, norm-referenced scores can provide evidence of learning
improvement (or decline) for the student population although this
is in terms of a change in score rather than an indication of what
students can now do (or not do) compared with what they could do
before. If changes in score are reported (for example, a difference
in average score), administrators have little evidence about the
strengths and weakness reflected in the results for particular
topics and may rely on rather limited experience (such as their
own experiences as a student) to interpret the changes. This could
result in increased expenditure on the wrong topic, wasting scarce
resources, and not addressing the real problems.
This evidence may be compromised by the actual test becoming

known to teachers. The result may be that they (quite naturally)
begin to emphasize the work covered in the test and therefore
scores may well rise. This rise does not provide evidence of
improved performance on the curriculum as a whole by teachers
and students. Where the rise is at the expense of studies in other
important parts of the curriculum not sampled in this particular
test the effect is to destroy the representative nature of the actual
test as a measure of progress in the curriculum.
The use of norm-referenced tests also depends on the curriculum

remaining static. If curriculum changes are introduced or time
allocations are changed, a representative ‘snapshot’ of the initial
curriculum may not be representative of the changed curriculum.
Comparisons with the original reference group are then not
appropriate.
© UNESCO 19
Norm-referenced tests are often wide-range tests. They can be used

to provide an order of merit for competitive selection purposes
where the test chosen is relevant, in general terms, for the skills
needed. However the scores which are used to determine each
candidate’s standing provide little information of direct use to a
teacher. Other information is required with the test if advice to
students and teachers is considered to be one of the important
outcomes associated with using the test scores. Such tests are often
prepared with a particular curriculum in mind. It is important to
check each question against the curriculum to see whether there
is a good match between the curriculum and the norm-referenced
test authors’ ‘assumed curriculum’. For example, Australian
mathematics syllabuses introduce algebra and geometry in the
seventh year of schooling while some curriculum statements from
Northern America assume a much later introduction. Further, the
balance of items for the curriculum may not match the balance
shown in the chosen test. If some items in the chosen test are not
appropriate for the curriculum then the comparison tables of total
scores for the chosen norm-referenced will not be appropriate or
meaningful. Some test publishers are able to re-calculate norm-
referenced tables for meaningful comparison purposes, but this
often depends on the availability of the full data and trained staff
who are able to undertake the re-calculation.
Very few tests provide the user with a strategy for making such
adjustments for themselves although some tests prepared using
Item Response Theory or Latent Trait Theory do enable qualified
and experienced users to estimate new norm tables for particular
sub-sets of items.
20
What purposes will the test serve?
Results used to compare students

with a fixed requirement
Criterion-referenced tests: These tests report performance in
terms of the skills and knowledge achieved by the students and do
not depend explicitly on comparisons with other groups of students.
Often a criterion referenced test will include all of the criteria in the
curriculum that are of importance (rather than rely on a sample as
in the case of norm-referenced tests).
Criterion-referenced scores can provide evidence of learning

improvement (or decline) of the student population as an indication
of what students can now do (or not do) compared with what they
could do before. This evidence may be reported as proportions of
students who have achieved particular skills and is less susceptible
to curriculum changes (provided those skills are still required
in the changed curriculum). There is less likelihood of criterion-
referenced tests being compromised by the actual test becoming
known to teachers. If they emphasize the work assessed by each test
(rather than particular items being used for a test) they will have
covered all important objectives of the curriculum. A rise in the
proportion of successful students will provide evidence of improved
performance on the curriculum as a whole by teachers and students
(provided that the rise was not achieved by excluding students
on the basis of school performance or by being more selective in
enrolling students).
Mastery tests: These tests are generally criterion-referenced tests

with a relatively high score requirement. Students who meet this
high score are said to have mastered the topic. It is assumed that
the mastery test has sufficient items of high quality to ensure that
the score decision is well founded with respect to the domain of
interest.
© UNESCO 21
For example, in mathematics the domain might be ‘all additions of

pairs of one-digit numbers where the total does not exceed 9’. A
mastery test of this domain should have a reasonable sample of all
possible combinations of one-digit numbers because the mastery
decision implies that all can be added successfully even though all
are not tested. This simple example should not be taken to imply
that mastery testing is limited to relatively trivial skills. A more
complex example is the regular testing of airline pilots. Safety
requirements result in high standards being set for mastery in many
areas. Failure to reach mastery will result either in further tuition
under the guidance of an experienced tutor or in withdrawal of the
permission to fly.
Other labels for categories of test use

Speed test and power test: Some tests have very easy items
but there is a limited amount of time to answer them. Such speed
tests are used to see how quickly students can work on skills they
have already mastered. One example is a test of keyboard skills.
The teacher may wish to find out how fast students can maintain
accurate work when typing data on a typewriter or computer
keyboard.
In contrast, power tests are concerned with identifying skills which

have been mastered. Power tests require adequate samples of
student behaviour – so having sufficient time to attempt most of the
items is an essential pre-requisite for such tests.
Aptitude or ability test and achievement test: Achievement

tests may be used to assess the extent to which curriculum
objectives have been met in an educational programme. Such tests
should have tasks which relate to the learning that students have to
demonstrate. Since future learning depends to some extent on past
learning, success on such achievement tests may provide evidence
22
of future success (provided that other conditions such as good

teaching, adequate health care, and stable family circumstances are
maintained).
Tests which are constructed specifically to gather evidence about

ability to learn are referred to as aptitude or ability tests. Results on
such tests are used to predict future success on the basis of success
on the specially selected tasks in the aptitude test. Often these tasks
differ from the usual school learning requirements and depend to
some extent on learning beyond the school curriculum. Of course,
teaching students the test items and the corresponding answers
may result in an increase in score without actually changing a
student’s (real) aptitude.
Objective test: The term ‘objective’ can have several meanings

when describing a test. It can mean that the score key for the test
needs a minimum of interpretation in order to score an item correct
or incorrect. In this sense, an objective test is one which requires
task responses which can be scored accurately and fairly from the
score key without having knowledge of the content of the test. For
example, a multiple-choice test can be scored by a machine or by
a clerical worker without either the machine or the clerical worker
having had to reach a high level of expertise on the material being
tested.
A less common usage relates to the extent of agreement between

experts about the correct answer. If there is less argument about the
correct answer the item is regarded as more objective. However the
choice of which items (whether objective in their answer format or
not) are to appear on a test is subjective, in that it depends on the
personal preferences and experiences of those constructing the test.
Standardised test: The term ‘standardised’ also has a number

of meanings with respect to testing. It can mean that the test has
an agreed format for administration and scoring so that the task
© UNESCO 23
is as identical as possible for all candidates and there is little room

for deviation in the scoring of candidate responses to the tasks.
Another meaning refers to the way in which the scores on a tests
are presented. For example, if scores are given as a raw score
divided by some measure of dispersion like the standard deviation,
the resulting score scale is said to be in terms of standardised scores
(sometimes called standard scores).
Finally, the term can refer (loosely) to a published test which was
prepared by standard (or conventional) procedures. The usage of
‘standardised’ has become somewhat confused because published
tests often present scores interpreted in terms of deviation from the
mean (or average) and have a standard procedure for administering
tests and interpreting results.
Diagnostic test: This term refers to the use made of the

information gained from administration of the test. The implication
is that the test results will assist in identifying both the topics which
are not known and in providing information on potential sources
of the student’s difficulty. Teachers may be expected to provide
appropriate teaching for each difficulty exposed by the use of a
diagnostic test. For example, a simple open-ended mathematics
question about area, given to junior secondary level classes provided
a range of correct and incorrect answers. The question was:
A farmer built a fence around a rectangular plot of land. The longer

sides were 5 metres and the shorter sides were 3 metres in length
(See diagram below). What is the area of the fenced land?
5 metres
3 metres 3 metres
5 metres
24
Answers included 8 metres2, 16 metres, 16 metres2, 15 metres2,

30 metres2, and 225 metres2. Those who gave 8 metres 2, 16 metres
or 16 metres 2 as answers were confusing perimeter and area. They
probably added 3, 5, 3, and 5 to obtain 16 or added 5 and 3 to obtain
8. Those who gave 30 metres2 probably multiplied 5 by 3 twice and
added the two results, while those who gave 225 metres2 probably
multiplied 3 by 5 by 3 by 5. Being able to show how the wrong
answers were obtained may help the teacher to plan remediation
(or the curriculum developer to devise suitable activities which
will make the distinction between area and perimeter clearer so
avoiding the problem). Some of those who gave the correct answer
of 15 metres2 showed their understanding of the task by sketching
the 15 1-metre squares like this.
5 metres
3 metres 3 metres
5 metres
Practical test: In some senses an essay test is a practical task.

The essay item requires a candidate to perform. This performance
is intended to convey meaning in a practical sense by writing
prose to an agreed format. However the term ‘practical test’ goes
beyond performance and other tasks used in traditional pencil-
and-paper examinations. The term may refer to practical tasks in
trade subjects (such as woodwork, metalwork, shipbuilding, and
leathercraft), in musical and dramatic performance, in skills such
as swimming or gymnastics, or may refer to the skills required to
carry out laboratory or field tasks in science, agriculture, geography,
environmental health or physical education.
© UNESCO 25
6 What types of task?

The kinds of question to be used in a test depend upon the age
and learning experiences of the students, the achievements to be
measured, the extent of the answer required, and the uses to be
made of the information collected. The choice of tasks can also
be influenced by the number of candidates and the time available
between the collection of the evidence and the presentation of the
results.
Tasks requiring constructed responses

Some items require a response to be composed or constructed,
whether written, drawn, or spoken. An essay question, for example,
“Write three paragraphs describing the assessment context in your
own nation and identify the key issues that need to be addressed”,
generally requires the candidate to compose several written
sentences as the response. An oral test may have a similar task but
the candidate is required to respond orally instead of in writing.
The task may require production of a diagram, flow-chart, drawing,
manipulation of equipment (as in finding the greatest mass using
balance scales), or even construction (for example, weaving or
building a model). More extensive tasks such as projects and
investigations may require preparation of a report identifying the
problem and describing the approach to the problem as well as the
results obtained while attempting to solve the problem.
26
There are potential difficulties in scoring such prose, oral, drawn
and manipulative responses. An expert judge is required because
each response requires interpretation to be scored. Judges vary
in their expertise, vary over time in the way they score responses
(due to fatigue, difficulty in making an objective judgment without
being influenced by the previous candidate’s response, or by
giving varying credit for some correct responses over other correct
responses), and vary in the notice they take of handwriting,
neatness, grammatical usage and spelling.
One technique to avoid or minimise such problems is to train a

team of scorers. Such training often involves a discussion of what
is being looked for, the key issues that have to be identified by a
candidate. Then the scorers should apply what they have learned
by scoring the same batch of anonymous real samples of responses.
It is important to have a range of real samples. (The training is to
ensure that scorers can tell the difference between high quality,
medium quality, and low quality answers and assign marks so that
the higher quality answers will get better scores than the medium
quality answers, and medium quality answers in turn will get
better scores than low quality answers.) These results are then
compared (perhaps graphically) and discussed. The aim is not to get
identical results for each scorer. Rather, the aim is to improve the
agreement between scorers about the quality of each response. We
expect that there should be greater agreement between the scorers
where the responses are widely separated in quality. Making more
subtle distinctions consistently requires more skill. Members of the
scoring team may differ in the importance they place on various
aspects of a task and fairness to all candidates requires consistency
of assessment within each aspect. Even when team members agree
in the rank ordering of responses, the marks awarded may differ
because some team members are lenient while others are more
stringent. A more subtle difference occurs when some judges
see more ‘shades of grey’ or see fewer such gradations (as in the
tendency to award full-marks or no marks).
© UNESCO 27
Short-answer items may require a candidate to recall knowledge

rather than recognise it (to produce an answer rather than make a
choice of an answer) or may be restricted to recognition. The former
may be something like miniature essays (or the oral or drawn
equivalent), or may require a word or phrase to be inserted (as
in close procedure or fill-the-gap). Recognition tasks may require
a key element of a drawing/photograph/diagram/prose passage
to be identified, as in the case of a proof-reading test of spelling
or choosing the part of a diagram or poster which has a safety
message.
Scoring short responses carries some of the same difficulties as

scoring more extended responses but it is generally easier for judges
to be consistent, if only because the amount of information to be
considered is smaller and likely to be less complex. However, the
quality assurance process is still a necessary part of the scoring
arrangements for short responses. Tests that have only short
responses may neglect the real world’s need for extended responses.
Tasks requiring choice of a correct or

best alternative
Some items present a task and provide alternative responses. The
candidate’s task is to identify the correct or the best alternative.
Sometimes such tasks require items in one list to be matched with

items in another list but these tasks tend to be artificial; good
tasks of this type are difficult to construct. Also, scoring may
present problems when both lists are the same size. Those who are
successful in choosing some of the links have their task of choosing
the remaining links made easier. Those who are not successful with
some links are faced with a more difficult task. It is not usually
regarded as good practice to have success on one task influencing
success on another separate task.
28
What types of tasks?
Some good matching items can be constructed if the number

of links required in the answer is restricted. For example, this
mathematics task requires only one link to be made out of a
possible 6. (The six are A-B, A-C, A-D, B-C, B-D, and C-D. Note that
the links can be written in reverse too: B-A, C-A, and so on.)
Two of these shapes have the same area.
A B C D
Which two are they? ........... and ..............
Multiple-choice items present some information followed by three

or four responses, one of which is correct. The others, called
distractors, are unequivocally incorrect, but this should be obvious
only to candidates who ‘know’ that aspect of the work. An extreme
case is where there are only two choices ( as in ‘true-false’, ‘yes-no’,
“feature absent-feature present”).
For example, the following multiple-choice item has four options,

with only one the correct response.
The term platyrrhini refers to a group of animals which includes:
A Platypus
B Marmosets
C Flatworms
D Plankton
© UNESCO 29
There are some potential difficulties with multiple-choice items. For

example, it is possible to score this kind of test without knowing
any answers to the items. The so-called ‘correction for guessing’
does not work – those who are lucky in guessing correct answers
do not lose their advantage and those who are unlucky in their
guessing do not get any compensation. Those that do not guess may
be disadvantaged relative to those who have lucky guesses.
Further, using such a ‘correction’ increases the examiner’s work and

provides an opportunity for calculation errors which may reduce
the accuracy of the scores.
The probability of gaining high scores without knowledge is

greater if there are only two choices. This factor, combined with the
difficulty of constructing pairs of plausible choices, and the fact that
‘correction for guessing’ does not work, makes it unwise to use two-
choice items (like ‘true-false’) in tests.
However, with a well-constructed test with an adequate number

of items (each with three to five distractors), the probability of
achieving a high score by random guessing is very small. If all
items in a test are answered by all candidates, then applying a
correction formula does not alter the rank order of candidates.
In the educational context, most (if not all) tests should have
sufficient time for most students to attempt most items. In this way,
adequate time to attempt the items allows an adequate sample of
performance to be gathered.
30
What types of tasks?
Is one type of question better

than another?
One important advantage of multiple-choice items is that the
scoring is very consistent from marker to marker, relatively rapid,
and can be undertaken by machine or by clerical staff. By contrast,
performance tasks like essay items require markers skilled in
assessing essays in the appropriate content area, take more time,
and the markers may have problems in achieving consistency.
However, whatever type of question is used, the critical issue

is whether the test provides a valid assessment of skills and
knowledge in relation to the course objectives. It may be more
appropriate to have items of both types in one test or examination
(perhaps administered in separate sessions). It may also be
necessary to combine test results with other evidence from practical
tasks.
© UNESCO 31
7 The test construction steps

Before deciding to construct a test, one needs to know what
information is required, how quickly it is needed, and the likely
actions that are to be taken according to the results on a test. The
crucial question is, «What information is needed about student
achievement?» A second important question is, «Can we afford
the resources needed to gather this information?» These resources
include the costs involved in providing human resources, word
processing and computing facilities, materials and equipment for
photocopying and printing. Figure 4 shows the stages in the test
construction process.
Content analysis and test blueprints

A content analysis provides a summary of the intentions of the
curriculum expressed in content terms. Which content is supposed
to be covered in the curriculum? Are there significant sections of
this content? Are there significant sub-divisions within any of the
sections? Which of these content areas should a representative test
include?
A test blueprint is a specification of what the test should cover

rather than a description of what the curriculum covers. A test
blueprint should include the test title, the fundamental purpose
of the test, the aspects of the curriculum covered by the test, an
indication of the students for whom the test will be used, the types
of task that will be used in the test (and how these tasks will fit in
with other relevant evidence to be collected), the uses to be made
32
Figure 4. Stages in test construction
Decision to gather evidence
â
Decision to allocate resources
â
Content analysis and test blue print
â
à
Item writing
â
Item review 1
â
Planning item scoring
â
Production of trial tests
â
Trial testing
â
Item review 2
â
Amendment
(revise/replace/discard)
â
More items needed? à Yes
No
â
Assembly of final tests
© UNESCO 33
of the evidence provided by the test, the conditions under which

the test will be given (time, place, who will administer the test, who
will score the responses, how accuracy of scoring will be checked,
whether students will be able to consult books (or use calculators)
while attempting the test, and any precautions to ensure that the
responses are only the work of the student attempting the test), and
the balance of the questions. An example is shown in Figure 5.
This one-hour test is to assess prior knowledge of statistics of

teacher trainees before they commence an intensive course in test
construction and analysis. Items are to be multiple-choice in format
with 4 options being presented for each item. A passing score is
to be set for each content area; those below this cut-off score in an
area must attend additional classes to improve their skills in that
area. The 54-item test is to have several parallel forms and will be
administered on a secure basis by the lecturer in charge of the test
construction and analysis course. No books or calculators will be
permitted for the test. Results will be provided the day after the
testing. The test blueprint is shown in Figure 5 below.
Figure 5. Test blueprint for basic statistics
Objectives
Content
Recall Computational
Understanding Total
of facts skills
Frequency distributions 2 items - 4 items 6

Means 2 items 4 items 2 items 8
Variances 2 items 4 items 2 items 8
Correlation 4 items 4 items 12 items 20
Relative standing 4 items - 8 items 12
Total 14 12 28 54
34
The test construction steps
Comparing the test blueprint with the analysis of the curriculum

should show that the allocation of items across the cells in Figure 5
provides a reasonably representative sample of what the curriculum
is about (at least as far as content is concerned). Test blueprints
may include other dimensions too. For example, the blueprint may
indicate the desired balance between factual recall questions and
questions which require interpretation or application to a particular
context. Or the blueprint may show the desired balance between
different item formats (constructed responses as compared with
recognition responses).
When the test blueprint has several dimensions it is possible to see

how the evidence to be collected combines these dimensions with
various types of evidence by means of a grid (or series of grids) and
how account is taken of the importance of that evidence.
Item writing
Item writing is the preparation of assessment tasks which can reveal
the knowledge and skill of students when their responses to these
tasks are inspected. Tasks which confuse, which do not engage the
students, or which offend, always obscure important evidence by
either failing to gather appropriate information or by distracting
the student from the intended task. Sound assessment tasks
will be those which students want to tackle, those which make
clear what is required of the students, and those which provide
evidence of the intellectual capabilities of the students. Remember,
items are needed for each important aspect as reflected in the test
specification. Some item writers fall into the trap of measuring what
is easy to measure rather than what is important to measure. This
enables superficial question quotas to be met but at the expense
of validity – using questions that are easy to write rather than
those which are important distorts the assessment process, and
therefore conveys inappropriate information about the curriculum
to students, teachers, and school communities.
© UNESCO 35
Item review
The first form of item analysis: Checking intended
against actual
Writing assessment tasks for use in tests requires skill. Sometimes
the item seems clear to the person who wrote it but may not
necessarily be clear to others. Before empirical trial, assessment
tasks need to be reviewed by a review panel (with a number of
people) with questions like:
• Is the task clear in each item? Is it likely that the person

attempting an item will know what is expected?
• Are the items expressed in the simplest possible language?
• Is each item a fair item for assessment at this level of education?
• Is the wording appropriate to the level of education where the

item will be used?
• Are there unintended clues to the correct answer?
• Is the format reasonably consistent so that students know what

is required from item to item?
• Is there a single clearly correct (or best) answer for each item?
• Is the type of item appropriate to the information required?
• Are there statements in the items which are likely to offend?
• Is there content which reflects bias on gender, racial, or other

grounds?
• Are the items representative of the behaviours to be assessed?
36
• Are there enough representative items to provide an adequate

sample of the behaviours to be assessed?
This review before the items are tried should ensure that we avoid
tasks which are expressed in language too complex for the idea
being tested, avoid redundant words, multiple negatives, and
distractors which are not plausible. The review should also identify
items with no correct (or best) answer and items with multiple
correct answers. Such items may be discarded or re-written.
Other practical concerns in preparing

the test
• How much time will students have to do the actual test? What
time will be set aside to give instructions to those students
attempting the test? Will the final number of items be too large
for the test to be given in a single session? Will there be a break
between testing sessions when there is more than one session?
• Will the students be told how the items are to be scored? Will
they be told the relative importance of each item? Will they be
given advice on how to do their best on the test?
• What test administration information will be given to those

who are giving the trial test to students? Will the students be
told that the results will be returned to them? Are the tests to be
treated as secure tests (with no copies left behind in the venue
where the test is administered)?
• Do students need advice on how they are to record their

responses? If practice items are to be used for this purpose,
what types of response should they cover? How many practice
items will be necessary?
© UNESCO 37
• Will the answers be recorded on a separate answer sheet

(perhaps so that a test booklet can be used again)? Will this
use of a separate sheet add to the time given for the trial test?
What information should be requested in addition to the actual
responses to the items? (This might include student name,
school, year level, sex, age, etc.)
• Has the layout of the test (and answer sheet if appropriate) been
arranged for efficient scoring of responses? Are distractors for
multiple-choice tests shown as capital letters (easier to score
than lower case letters)?
Have the options in multiple-choice items been arranged in some

logical order (for example, from smallest to largest)? Have the items
been placed in order from easiest to most difficult (to encourage
candidates to continue through the test)? Has the layout of items
avoided patterns in the correct answers such as 3 or more of the
same letter in a row, or other patterns like ABCD or ABABAB
(which might lead to ‘correct’ responses for the ‘wrong’ reasons)?
Item scoring arrangements

Multiple-choice: Judgments of experts are needed to establish
which option is the best (or correct) answer for each item. Once
these correct answers have been decided, the score key can then be
used by clerical staff or incorporated in machine scoring.
Constructed response: What preparation do the scorers need?

Should they practice with a sample of papers to ensure that good
work is given due credit, poor work is recognised consistently, and
that each scorer makes use of similar ranges of the scale? Should
each paper (or a sample of papers) be remarked without knowledge
of the other assessment? If large differences occur in such a case,
what should be the next step?
38
Trial of the items

Item trial is sometimes called pilot testing – but in this context
it does not mean testing those who fly aeroplanes. As well as
considering the best efforts of item writers and item reviewers as
a means of eliminating faulty items and improving the quality of
items, it is necessary to subject the proposed items to empirical trial
with students similar to those who are going to use the final form
of the test. Since items involve communication with students, an
evaluation of this quality is required before the set of tasks can be
used with a larger group.
Each trial paper should be attempted by 150-250 persons who

are similar to those who will attept the final forms of the test. It
is usual to allocate the trial forms on a random basis within each
trial examination room so that (on the average) each trial test is
attempted by candidates of comparable ability. The same form of
a test should not be given to candidates sitting in adjacent seats
so as to ensure that candidates do not improve their scores by
looking at another candidate’s paper. It is wise to have some visible
distinguishing mark on the front of each version of the test. Then
the test supervisor can see at a glance that the trial tests have been
alternated. If distinguishing marks cannot be used, then a different
color of cover page should be used for each version.
Undertaking trial testing requires sound planning. Institutions

which have agreed to allow trial testing to occur on their premises
need to be contacted in advance of the trial testing. The numbers
of trial candidates, the balance between males and females, the
diversity of age levels or schooling levels required for the trials, the
size of the rooms, and the availability of test supervisors are all
issues that need to be discussed. The test supervisor introduces the
test to the trial candidates, explains any practice items, and has to
ensure that candidates have the correct amount of time allowed to
attempt the test, that any last minute queries are answered (such
© UNESCO 39
as informing those attempting the trials tests that their results will
be used to validate the items and will not have any effect on their
current course work), and gather all test materials before candidates
leave the room.
Processing test responses after trial

testing
If the test needs to be scored before analysis, this scoring is done
next. If there are essays to be scored, it is good practice to mark the
first essay all the way through the stack of test papers. Then start
the stack again to score the next essay. When all items have been
marked, the scores on each item are entered into a computer file. If
the test is multiple-choice in format, the responses may be entered
into a computer file directly.
Item analysis
The second form involving responses by real
candidates
Empirical trial can identify instances of confused meaning,
alternative explanations not already considered by the test
constructors, and (for multiple-choice questions) options which are
popular amongst those lacking knowledge, and ‘incorrect’ options
which are chosen for some reason by very able students.
This trial allows the gathering of evidence about each item

– whether items can distinguish those students who are
knowledgeable from those lacking knowledge, whether items are
of an appropriate difficulty (how many attempted each item and
what percentage responded correctly), and, in the case of multiple-
40
choice items, whether the various options, both ‘correct’ and

‘incorrect’ performed as expected. The item analysis also provides
an opportunity to collect information about how each item performs
relative to other items in the same test, and to judge the consistency
of the whole test.
Amending the test

by discarding/revising/replacing items
Items which do not perform as expected can be discarded or
revised. However discarding questions when there is a shortage
of replacement questions can lead to distortions of the achieved
test specification. If the original specification represents the best
sampling of content, skills, and item formats, in the judgments of
those preparing and reviewing the test, then leaving some cells of
the grid vacant will indicate a less than adequate test. To avoid this
possibility, test constructors may prepare three or four times as
many questions that they think they will need for each cell in the
grid.
Assembling the final test

(or a further trial test) and the corresponding
score key
After trial, tasks may be re-ordered to take account of their
difficulty. Usually the easiest questions are presented first. This is to
encourage candidates to proceed through the test and to ensure that
the weaker candidates do not become discouraged before providing
adequate evidence of their achievements and skills. Minor changes
to items may have to be made for layout reasons (for example, to
keep all of an item on one page of the test, or to avoid obvious
© UNESCO 41
patterns in the list of correct answers). Items representing a single

cell within a test specification should vary in item content and
difficulty. The position of the correct option in multiple-choice items
(A, B, C, D or E) should also vary and each position should be used
to a similar extent. Some questions may have minor changes to
wording, others may be replaced. The final test should be consistent
with the test blueprint. The item review procedures described above
are repeated (particularly important where stimulus material must
be associated with more than one question) and each reviewer
should work independently through the proposed test and
provide a ‘correct’ answer for each question. This enables the test
constructor’s (new) list of correct answers to be checked.
Validity, reliability and dimensionality

Test validity can be interpreted as usefulness for the purpose. Since
purposes vary, it is important to specify which purpose applies
when making a comment about validity. Content validity refers
to the extent to which the test reflects the content represented in
curriculum statements (and the skills implied by that content). A
test with high content validity would provide a close match with the
intentions of the curriculum, as judged by curriculum experts and
teachers.
A test with high content validity for one curriculum may not be
as valid for another curriculum. This is an issue which bedevils
international comparisons where the same test is administered in
several countries. Interpretation of the results in each country has
to take account of the extent to which the comparison test is content
valid for each country. If two countries have curricula that are only
partly represented in the test, then comparisons between the results
of those countries are only valid for part of the data.
42
When test results are compared with an agreed external criterion

such as a direct measure of actual performance of tasks in the ‘real’
world, this type of validity is called criterion-related validity. If there
is little time delay between the test and the actual performance, the
criterion-related validity may be referred to as concurrent validity. If
there is a longer time delay between the test and subsequent actual
performance, the criterion-related validity may be referred to as
predictive validity.
If we think in terms of achievement as a generalized construct,

and our test tends to be consistent with other recognized measures
of that construct, we say that the test has construct validity as a
measure of achievement. Similarly, if we think in terms of aptitude
as a generalized construct, and our test tends to be consistent with
other recognized measures of that construct, we say that the test
has construct validity as a measure of aptitude. The higher the
degree of agreement, the higher the construct validity. However,
this is not a fixed state of affairs. Particular tests may have high
construct validity as achievement measures or predictive validity
as an indicator of later success. If circumstances change (such as
teachers teaching to that particular test or tests) the scores on the
test may well rise considerably without improving the predictions.
The assumed association between the test and the predicted
behaviour no longer holds, and raising the cut-off on the test will
not rectify the problem.
When tests have high construct validity we may argue that this
is evidence of dimensionality. When we add scores on different
parts of a test to give a score on the whole test, we are assuming
dimensionality without checking whether our assumption is
justified. Similarly, when item analysis is done using the total score
on the same test as the criterion, we are assuming that the test
as a whole is measuring a single dimension or construct, and the
analysis seeks to identify items which contradict this assumption.
© UNESCO 43
Earlier in this discussion, it was argued that validity refers to

usefulness for a specified purpose and can only be interpreted
in relation to that purpose. In contrast, reliability refers to the
consistency of measurement regardless of what is measured.
Clearly, if a test is valid for a purpose it must also be reliable
(otherwise it would not satisfy the usefulness criterion). But a test
can be reliable (consistent) without meeting its intended purpose.
That is, it is possible to make consistent assessments which are not
meaningful in the context of the decisions to be made.
44
Resources required to 8
construct and produce a test
Many teachers have some skills in preparing assessment tasks but

receive little feedback on which tasks are valid and useful. Those
preparing questions for national examinations may receive some
feedback on the quality of the assessment tasks they have prepared,
but only if the examining authority conducts the appropriate
analyses. Without such quality feedback the skill level of item
writers tends to remain low. Expertise in developing non pencil-
and-paper assessment tasks is an even more scarce resource.
Training in test construction requires an expertise-sharing approach

so that the test construction skills of the trainer are transferred
to those involved in assessment. The team writing the questions
has to be aware of actual candidate responses to those questions,
and to have an opportunity to discuss the subsequent analyses
of trial data. Development of expertise is incremental, requiring
an ability to distinguish between what was intended and what
actually happened in practice. As experience in developing tests,
administering them, and interpreting the resulting responses is
gained, there should be less involvement with external trainers (and
more involvement in sharing that developed expertise with novice
test constructors).
Producing a test has a number of costs, aside from the physical

provision of weather- and vermin-proof secure room space,
heating or cooling, furniture and secure storage. The costs relate
to developing a test specification, the test construction effort by
© UNESCO 45
teachers (either set aside from classroom work to join the test-
construction team, or paid additional fees to work outside school
hours), class teacher and student time for trials, the paper on
which the test (and answer sheet if appropriate) is to be printed
or photocopied, the production of copies of the test materials,
distribution to schools, retrieval from schools (if teachers are
not to score the tests), and scoring and analysis costs. Figure 6
shows a possible time scale for developing two parallel forms of
an achievement test of 50 items for use during the sixth year of
schooling. Figure 6 also shows the resources that would need to be
assembled to ensure that the tests were produced on time. (Note
that this schedule assumes that the test construction team has
had test development training prior to commencing work on the
project.)
46
Resources required to construct and produce a test
Figure 6. Example timescale and resource requirements for test

construction
Time
Task Resources
(weeks)
Decision to gather evidence 1 (Depends on local circumstances)
Decision to allocate resources 2 (Depends on local circumstances)
Curriculum experts (national or regional).

Test construction team (see below).
Content analysis and test blueprint 1
Relevant text books as used in the sixth year
of schooling.
Test construction team (3 to 4 teachers, full-

time).
Item writing 5
Text books, word-processing, and
photocopying facilities and supplies.
Curriculum experts & test construction

Item review 1 1
team.
Planning item scoring 1 Test construction team.
Test construction team.

Production of trial tests 2 Word-processing, and photocopying facilities
and supplies.

Scoring and item analysis 3
PC for computing.
Curriculum experts & test construction

Item review 2 1
team.
Word-processing, and photocopying facilities

Amendment (revise/replace/discard) 1
and supplies.
If ‘yes’ Go back to ’Item writing’.

More items needed? No/Yes
If ‘no’ continue.

Assembly of final tests 2 Word-processing, and photocopying facilities
and supplies.
Total Time 20 weeks
© UNESCO 47
9 Some concluding comments
At the beginning of this module it was asserted that the assessment

of student learning provides evidence so that sound educational
decisions can be made. This evidence should help us to evaluate
(or judge the merit of) a teaching programme or we may use the
evidence to make statements about student competence or to make
decisions about the next aspect of teaching for particular students.
Clearly the quality of the evidence is a critical factor in making
sensible decisions.
The procedures for test construction described in this module have

developed over many years of practical work in the development
of tests and similar instruments. Some of the advice has arisen
from research into test analysis and some advice has been derived
from the practical experience of large numbers of research and
development staff working at various agencies around the world.
Improving the quality of the evidence is not an easy task. And
reading a book about the procedures will not suffice – because
improving one’s skills as a test constructor requires working on the
construction of tests as part of a test construction team.
Preparation of final forms of a test is not the end of the work. The
data gathered from the use of final versions should be monitored
as a quality control check on their performance. Such analyses
can also be used to fix a standard by which the performance of
future candidates may be compared. It is important to do this as
candidates in one year may vary in quality from those in another
year.
48
It is customary to develop more trial forms so that some forms of
the final test can be retired from use (where there is a possibility of
candidates having prior knowledge of the items through continued
use of the same test).
The trial forms should include acceptable items from the original
trials (not necessarily items which were used on the final forms
but similar in design to the pattern of item types used in the final
forms) to serve as a link between the new items and the old items.
The process of linking tests using such items is referred to as
anchoring. Surplus items can be retained for future use in similar
test papers.
© UNESCO 49
10 References
General measurement and evaluation
1. Hopkins, C.D. and Antes, R.L. (1990). Classroom measurement
and evaluation. Itasca, Illinois: Peacock.
2. Izard, J. (1991). Assessment of learning in the classroom. Geelong,

Vic.: Deakin University.
3. Mehrens, W.A. and Lehmann, I.J. (1984). Measurement and

evaluation in education and psychology. (3rd Ed.) New York: Holt,
Rinehart and Winston.
Content analysis and test blueprints

1. Izard, J. (1997). Content Analysis and Test Blueprints. Paris:
International Institute for Educational Planning.
Item writing
1. Withers, G. (1997). Item Writing for Tests and Examinations.
Paris: International Institute for Educational Planning.
Trial testing and item analysis

1. Izard, J. (1997). Trial Testing and Item Analysis in Test
Construction. Paris: International Institute for Educational
Planning.
50
Testing applications
1. Adams, R.J., Doig, B.A. & Rosier, M.J. (1991). Science learning
in Victorian schools: 1990. (ACER Research Monograph No. 41).
Hawthorn, Vic.: Australian Council for Educational Research.
2. Doig, B.A., Piper, K., Mellor, S. & Masters, G. (1994). Conceptual

understanding in social education. (ACER Research Monograph
No. 45). Melbourne, Vic.: Australian Council for Educational
Research.
3. Masters, G.N. et al. (1990). Profiles of learning: The basic skills

testing program in New South Wales, 1989. Hawthorn, Vic.:
Australian Council for Educational Research.
4. Ross, K.N. (1993). Issues and methodologies in educational

development: 8. Sample design procedures for a national survey
of primary schools in Zimbabwe. (International Institute
for Educational Planning) Paris, France: United Nations
Educational, Scientific and Cultural Organisation.
Information for decision-making

1. Somerset, A. and Ekholm, M. (1990). “Different information
requirements for different levels of decision-making”, in K.N.
Ross and L. Mählck, L. (eds.) (1990). Planning the quality of
education: The collection and use of data for informed decision-
making. Paris: United Nations Educational, Scientific and
Cultural Organization/Oxford: Pergamon Press.
© UNESCO 51
11 Exercises
1. CONSTRUCTION OF A TEST PL AN
a) Choose an important curriculum topic or teaching subject

(either because you know a lot about it or because it is
important in your country’s education programme).
List the key content areas in that topic or subject.
Show (in percentage terms) the relative importance of each

key area.
Compare your key content areas and associated relative

importance with one or more persons attempting this
exercise.
b) Choose another appropriate dimension (such as skills

categories or item format categories) for the same curriculum
topic or subject (as in Exercise 1 above).
List the important categories.
Show the relative importance (expressed as percentages) of

each category.
Compare your categories with one or more persons (as in

Exercise 1).
c) Construct a test plan (like the plan shown in Figure 5) which

has the content categories (from Exercise 1) at the left and the
skill or format categories (from Exercise 2) at the top.
Adjust the numbers of items in each cell to reflect the

percentage weightings you have chosen for each dimension.
52
2. TE X TBOOK ANALYSIS AND ITE M WRITING
(a) Review a classroom textbook used in your country. Using

your test plan as a guide, prepare a test plan for a test of the
material in the text book.
(b) Choose one cell of the test plan and write some items for this
cell.
3. REVIEW OF E XISTING CL ASSROOM TEST
(a) Using the section on item review as a guide, review a

classroom test prepared by a teacher.
(b) Set up a panel of two or three to discuss the reviews.
(c) Choose some of the questions and re-write them to satisfy

the panel’s critical comments.
© UNESCO 53
Quality (SACMEQ).


and innovation”.
7
Module
John Izard
Trial testing
and item analysis
in test construction



Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66

Module 7 Trial testing and item analysis in test construction
Content
1. Introduction 1
2. Preparing for trial testing 3

Content analysis 3
Test blueprint 3
Item review 4
Other review issues 6
Review of trial test, presentation and layout 8
3. Planning the trial testing 11
4. Choosing a sample of candidates for

the test trials 12
Preparing the codebook 13
What to consider in arranging for a test to be given 15
Preparing test administration instructions 16
5. Conducting the actual trial testing 18
1
6. Processing test responses after a trial

testing session 19
Scoring procedures 20
Scoring trial papers 22
7. Aknowledging co-operation 24
8. Analysis in terms of candidate responses 25

Introduction to test analysis strategies 27
Doing an item analysis ‘by hand’ 39
9. Item analysis approaches using the computer 45

Classical strategies for item analysis 47
Deciding whether an item is useful after trial with real
candidates (classical analysis) 54
Test reliability 57
Item response modelling strategies for item analysis 59
Deciding whether an item is useful after trial with real
candidates (item response modelling analysis) 64
Classical item analysis and item response modelling
compared 65
10. Maintenance of security 66
II
Content
11. Test review after trials 67

Cautions in interpreting item analysis data 68
Assembling the final test and the corresponding score key 69
12. Confidential disposal of trial tests 71
13. Using item analysis software 72

Computer software 72
References 73
Finding out more about trial testing and item analysis 73
Applications of Item Analysis 76
14. Exercises 77
© UNESCO III
Introduction 1
Assessment involves selecting evidence from which inferences can

be made about current status in a learning sequence. The tasks
that are chosen to provide that evidence have to be effective in
distinguishing between those who have the required knowledge
and those who do not. Trial testing (sometimes called pilot testing)
involves giving a test under specified conditions to a group of
candidates similar to those who will use the final test. Subsequent
analysis of the data from the trials examines the extent to which the
assessment tasks performed as expected under practical conditions.
The shading in Figure 1 below, indicates the position of trial testing
and item analysis in the overall test construction process.
In order to assess the capacity of each question or task to

distinguish between those who know and those who do not, the
trial group of candidates should possess a range of knowledge
from those with good knowledge to those lacking it. Typically one
does not have definitive evidence on this (and if we did have, we
probably would not need to construct the trial test). Therefore we
need to depend on teacher advice and our experience to choose
suitable trial test candidates. Note that this applies to both criterion-
referenced and norm-referenced tests. In the former case we need
some candidates likely to meet the criterion and some who do not.
In the latter case we need some candidates who score well relative
to their peers, some who score around the average relative to their
peers, and some who score poorly relative to their peers.
© UNESCO 1
Figure 1. Trial and analysis in the context of test

construction
Decision to gather evidence
â
Decision to allocate resources
â
Content analysis and test blue print
â
à
Item writing
â
Item review 1
â
Planning item scoring
â
Production of trial tests
â
Trials
â
Item review 2
â
Amendment
(revise/replace/discard)
â
More items needed? à Yes
â
No
â
Assembly of final tests
2
Preparing for trial testing 2
Before undertaking a trial test project, we need to make some
important checks. Trial testing uses time and resources so we must
be sure that the proposed trial test is as sound as possible so that
time and resources are not wasted. The team preparing the trial
tests should have prepared a content analysis and test blueprint. A
panel should review the trial test in terms of the content analysis
and test blueprint to make sure that the trial test meets the intended
test specifications. It is also necessary to review each test item
before trial testing commences.
Content analysis
A content analysis provides a summary of the intentions of the
curriculum expressed in content terms. Which content is supposed
to be covered in the curriculum? Are there significant sections of
this content? Are there significant subdivisions within any of the
sections? Which of these content areas should a representative test
include?
Test blueprint
A test blueprint is a specification of what the test should cover
rather than a description of what the curriculum covers. A test
blueprint should include the test title, the fundamental purpose
of the test, the aspects of the curriculum covered by the test, an
© UNESCO 3
indication of the students for whom the test will be used, the types
of task that will be used in the test (and how these tasks will fit in
with other relevant evidence to be collected), the uses to be made of
the evidence provided by the test, the conditions under which the
test will be given (time, place, who will administer the test, who will
score the responses, how the accuracy of scoring will be checked,
whether students will be able to consult books (or use calculators)
while attempting the test, any precautions to ensure that the
responses are only the work of the student attempting the test, and
the balance of the questions.
Comparing the test blueprint with the content analysis of the

curriculum should show that the test is a reasonably representative
sample of what the curriculum is about (at least as far as content is
concerned). Test blueprints may include other dimensions too. For
example, the blueprint may indicate the desired balance between
factual recall questions and questions which require interpretation
or application to a particular context. Or the blueprint may show
the desired balance between different item formats (constructed
responses as compared with recognition responses). When the
test blueprint has several dimensions it is possible to see how the
evidence to be collected combines these dimensions with other
sources of information by means of a grid (or series of grids), and
how account is to be taken of the importance of that evidence.
Item review
Why should the proposed trial test be reviewed before trial? The
choice of what to assess, the strategies of assessment, and the modes
of reporting depend upon the intentions of the curriculum, the
importance of different parts of the curriculum, and the audiences
needing the information that assessment provides. If we do not
select an appropriate sample of evidence, then the conclusions we
draw will be suspect, regardless of how accurately we make the
assessments. Tasks chosen have to be representative so that:
4
Preparing for trial testing
• dependable inferences can be made about both the tasks chosen

for assessment and the tasks not chosen;
• all important parts of the curriculum are addressed;
• achievement over a range is assessed (not just the presumed

narrow band where a particular selection decision might be
required on a single occasion).
The review panel has the responsibility of ensuring that the

assessment tasks are appropriate, representative, and extensive.
For example, the range of complexity of tasks should be at least as
wide as the expected range of achievement for the students being
assessed if evidence of learning is required about all students.
Just as a team of item writers can produce a better range of items
to consider for trial, a team of item critics (including item writers
– they need the feedback) can provide better and more constructive
comments on proposed trial items. Item review without the benefit
of interaction with colleagues is generally inefficient and tends to
be too idiosyncratic, representing only one person’s limited view of
the topic to be assessed. The review of assessment tasks by a review
panel is essential before trial testing commences. Sometimes the
item seems clear to the person who wrote it – but the item may not
necessarily be clear to others. The review panel will ask questions
like:
• Is the task clear in each item? Is it likely that the person

attempting an item will know what is expected?
• Are the items expressed in the simplest possible language?
• Is each item a fair item for assessment at this level of education?

Is the wording appropriate to the level of education where the
item will be used?
• Are there unintended clues to the correct answer?
© UNESCO 5
• Is the format reasonably consistent so that students know what

is required from item to item?
• Is there a single, clearly correct (or best) answer for each item?
• Is the type of item appropriate to the information required?
• Are there statements in the items which are likely to offend?
• Is there content which reflects bias on cultural or other

grounds?
• Are the items representative of the behaviours to be assessed?
• Are there enough items to provide an adequate coverage of the

behaviours to be assessed?
This part of the review before the items are tried should help
avoid tasks which are expressed in language too complex for
the idea being tested, and/or contain redundant words, multiple
negatives, and distracters which are not plausible. The review
should also identify items with no correct (or best) answer and
items with multiple correct answers. Such items may be discarded
or re-written. Only good items should be used in a trial test. (The
subsequent item analysis helps choose the items with the best
statistical properties from the items that were good enough for
trial).
Other review issues

Some tests provide items so that candidates can do under
supervision in order to be sure that they know how to record
their responses. Some candidates will have had more experience
in attempting tests. In situations where tests are to be used for
6
selection purposes it may be necessary to provide more detailed

information about the test. For example, an information leaflet
about a test can be useful in reducing test anxiety, and in avoiding
some of the unsavoury effects of test coaching (by providing a
simple form of coaching for all candidates rather than advantaging
those who can afford to pay private tutors). Here are some of the
important questions.
Will the students be told how the items are to be scored? Will they be
told the relative importance of each item? Will they be given advice
on how to do their best on the test?
Will there be practice items? Do students need advice on how they are
to record their responses? If practice items are to be used for this
purpose, what types of response do they cover? How many practice
items will be necessary?
Will there be a separate answer sheet? Recording responses on a

separate answer sheet may allow a test booklet to be used again.
If there is to be a separate answer sheet, have plans been made to
recycle the test question booklets? (If so, resources may be required
to have each page checked very carefully to make sure that there
are no marks left by previous candidates who used the test). Will
this use of a separate sheet add to the time given for the trial test?
What information should be requested in addition to the actual
responses to the items? (This might include student name, school,
year level, sex, age, etc.).
Has the scoring been arranged for efficient scoring (or coding) of
responses? Are distracters for multiple-choice tests shown as capital
letters (less confusing to score than lower case letters)? One long
column of answers is generally easier to score by hand than several
short columns.
© UNESCO 7
How much time will students have to do the actual test? What time
will be set aside to give instructions to those students attempting
the test? Will the final number of items be too large for the test to
be given in a single session? Will there be a break between testing
sessions when there is more than one session?
What type of score key will be used? Complex scoring has to be done
by experienced scorers, and they usually write a code for the mark
next to the test answer or on a separate coding sheet. Multiple-
choice items are usually coded by number or letter and the scoring
is done by a test analysis computer programme.
What test administration information will be given to those who are

giving the trial test to students? Will the students be told that the
results will be returned to them? Are the tests to be treated as
secure tests (with no copies left behind in the venue where the test
is administered)?
Review of trial test, presentation and

layout
Some very practical working rules should be adopted. The front
page should explain briefly which group has prepared the test, give
the purpose of the test, and give instructions to the candidate about:
• the number of items;
• the time available for them to attempt the test;
• how they are to show their answers (whether on the test paper,
or on a separate answer sheet); and
• what to do if they change their mind about an answer and wish

to alter it.
8
The options in multiple-choice items should be arranged in some

logical order (for example, from the smallest to the largest). The
items should be placed in order from the easiest to the most difficult
(to encourage candidates to continue the test).
The layout of items should avoid patterns in the correct answers

such as three or more of the same letter in a row, or other patterns
like ABCD or ABABAB (which might lead to ‘correct’ responses for
the ‘wrong’ reasons).
Any materials required during the administration of the trial test

should be listed so the candidates know, explicitly, what they should
have for the testing session. Candidates must be informed that all
test materials must be returned to the testing supervisor.
If the test is to be expendable, there must be space for the trial

candidate’s name, location or department (so that scores can be
returned if appropriate, and to give those conducting the trial test,
information about the diversity of the trial sample). If the candidate
is not to write on the test but is to write on a separate answer sheet,
that answer sheet must have the candidate’s personal identification
details instead of the test itself.
When several trial tests are being given at the same time (and this is
usually the case) it is important to have some visible distinguishing
mark on the front of each version of the test. Then the test
supervisor can see at a glance that the tests have been alternated.
If distinguishing marks for each version cannot be used, then a
different colour of cover page for each version is essential.
The trial test pages should not be sent for reproduction of copies
until the whole team is satisfied that all possible errors have been
found and corrected. All corrections must be checked carefully to
be sure that everything is correct! [Experience has shown that
sometimes a person making the corrections may think it better to
© UNESCO 9
retype a page rather than make the changes. If only the ‘corrections’
are checked, a (new) mistake that may have been introduced will
not be detected.]
When those responsible for constructing the questions, assembling

the trial test, and reviewing it, are satisfied that each question meets
the criteria for relevant, reasonable, valid and fair items, the test is
ready for trial. Only items which have survived this review should
be subjected to trial with candidates like those who will eventually
attempt the final version of the test.
10
Planning the trial testing 3
Empirical trial testing provides an opportunity to identify
questionable items which have not been recognised in the process of
item writing and review. At the same time, the test administration
instructions are able to be refined to ensure that the tasks presented
in the test are as identical as possible for each candidate. (If the
test administration instructions vary then some candidates may
be advantaged over others on a basis unrelated to the required
knowledge which is being assessed by the test).
The trial testing will:
• establish the difficulty of each item;
• identify distracters which do not appear plausible;
• assist in determining the precision of the test and suggest the

number of test items for the final test;
• establish the contribution of each item to the discrimination

between candidates who achieve at a high level and those who
do not;
• check the adequacy of the administration instructions including

the function of any practice items and the time required for
most students to complete the test;
• identify misconceptions held by the students through analysis

of student responses and, where possible, the questioning of
some students as to their reasons for making these responses.
© UNESCO 11
4 Choosing a sample of candidates

for the test trials
The size of the trial testing group for each trial test should be
around 150 to 250 persons, covering a wide range of ability,
geographic dispersion, and should be roughly representative of
the various groups likely to attempt the final versions of the tests.
It is usual to try to have approximately equal numbers of male
and female candidates for the trials, with males and females each
meeting the target group requirements.
The target audience for the final form of the test should guide
the selection of a trial sample. If the target audience is to be a
whole nation or region within a nation, then the sample should
approximate the urban/rural, sizes and types of school and age level
mix in the target audience. This type of sample is called a judgment
sample, because we depend on experience to choose a sufficiently
varied sample for trial purposes. The choice of sample also has to
consider two competing issues: the costs of undertaking the trial
testing and the need to restrict the influence of particular schools.
The more schools are involved in the trial testing and the more
diverse their location, the greater the travel and accommodation
costs. The smaller the number of schools the greater the influence of
a single school on the results.
Judgment samples often have to take into account the following

categories of schools:
12
• Government/Private
• Co-educational/Boys/Girls
• Major Urban/Minor Urban/Outer Urban/Rural
• Primary/Secondary/Vocational
• Selective/Non-selective
As a consequence, those choosing the judgment sample need to

know how many students (at least approximately) there are in each
category so that the judgment sample can approximate the national
or regional target audience for the final form of the proposed test.
In some nations and regions, test security concerns result in trial
testing being conducted in another nation or region.
Preparing the codebook

When a trial test is prepared it is necessary to document where an
item appears on the test, which area of content and which skills are
being assessed, the name assigned to the item (if one is assigned),
the number of options, the code used for missing data, any
coding values for particular responses, and any notes that provide
necessary information about the item.
The document which is a collation of such item information and

associated trial sample description is known as a codebook. (This
label is also applied to a machine readable file with the same
information).
The test specification grid (part of the test blueprint) will help in the
preparation of this documentation (see Figure 2). For example, the
content and skill objectives of a basic statistics test are shown in the
grid below.
© UNESCO 13
Figure 2. Test specification grid for a basic statistics test
Objectives
Content
Understanding Total
of facts skills
Frequency distributions 2 items - 4 items 6

Means 2 items 4 items 2 items 8
Variances 2 items 4 items 2 items 8
Correlation 4 items 4 items 12 items 20
Relative standing 4 items - 8 items 12
Total 14 12 28 54
Figure 3. Item specification grid for a basic statistics test
Objectives
Content
Understanding Total
of facts skills
Frequency distributions items 1,4 - items 6, 9, 12, 16 6
Means items 2,7 items 8, 10, 19, 22 items 13, 18 8
Variances items 3,5 items 11, 15, 20, 24 items 14, 17 8
items 28, 31, 34, 37,

Correlation items 21, 25, 32,36 items 23, 27 35,41 39, 43, 45, 47, 49, 20
50, 52, 54
items 26, 29, 33, 38,

Relative standing items 30, 42, 44, 53 - 12
40, 46, 48,51
Total 14 12 28 54
14
Choosing a sample of candidates for the test trials
The code book should show which items appear in each cell. One
way of doing this is to show the specification grid with the item
numbers in place and show the score key below the grid (see
Figures 3 and 4).
Figure 4. Codebook details for Figure 3
Basic Statistics correct answers
*** ................. 1 ................. 2..................3................. 4.................. 5 .......

*** 123456789012345678901234567890123456789012345678901234
key 153242325 2543514151 5231315323 4541452452133245432112315
Missing data: coded as 6. Multiple answers: coded as 7.
What to consider in arranging for a test

to be given
Experience has shown that those involved in the construction of
test items should also be involved in the trial of those items. Test
constructors need first-hand feedback on the qualities of their test
items; students attempting a new test can help in providing that
direct feedback.
The institutions which have agreed to allow trial tests to occur on

their premises should be contacted in advance. They should be
informed of the number of candidates that are required from that
institution, whether they be from different year levels (if in training)
or from different employment levels (or equivalent) if already
working. [It is usually wise to have as diverse a group as possible,
particularly in the context of testing for selection purposes.]
There may need to be a preliminary visit to each institution to

establish whether the trial tests will be done in one large room, and/
© UNESCO 15
or several smaller rooms (such as classrooms). Each testing room

needs a test supervisor! The supervisor introduces the test to the
trial candidates, explains any practice items, and has to ensure that
candidates have the correct amount of time allowed to attempt the
test, that any last minute queries are answered (such as informing
trainees that the results of this trial testing are to be used to validate
the questions and will not have any effect on their current course
work), and gather all test materials before candidates leave the
room.
Test materials should be sorted into bundles before entering the

testing room so that different trial test forms can be alternated.
All bundles should have three or four spare copies of each trial
form in case of printing or collating errors. No candidate should
be sitting beside another candidate doing the same form of the
trial test. Candidates may sit in front of or behind other candidates
attempting the same form of the trial test, unless the test is being
done in a sloping-floored lecture theatre which permits one person
to see the paper of the person in front.
Preparing test administration

instructions
A sample set of administration instructions is given in Panel 1.
These may be used as a model for writing such instructions. Other
issues to consider, include the provision of practice examples
(particularly if the format of the test, is expected to be unfamiliar
to those students in the trial group), provision of pens or pencils
in two colours so that after a given period of time candidates can
be instructed to change to the other colour (particularly if the
variation in the number of items completed in that time needs to be
determined), and advice to the test administrator on alternating the
versions of the trial tests so that adjacent candidates are attempting
different versions.
16
Choosing a sample of candidates for the test trials
Panel 1. A sample set of test administration instructions
Instructions for Administration
These instructions assume the candidates can read. The tester should have a
stopwatch, or a digital watch showing minutes and seconds or a clock with a
sweep-second hand. Make sure each candidate has a pen, ballpoint pen, or
pencil. All other materials should be put away before the test is started.
Give each candidate a copy of the test, drawing attention to the instruction
on the cover.
do not open this book or write anything until you are told.
Instruct the candidates to complete the information on the front cover of
the test, assisting as necessary. Check that each candidate has completed
the information correctly. (Year of birth should not be this year; number
of months in the age should be 11 or less;first name should be shown in
full rather than as an initial.) Ensure that the test booklet remains closed.
Read these instructions (which are shown on the cover of the test), asking
candidates to follow while you read.
Say:
Work out the answer to each question in your head if you can. You
can use the margins for calculations if you need to. You will receive
one mark for each correct answer.
Work as quickly and accurately as you can so that you get as many
questions right as possible. You are not expected to do all of the
questions. If you cannot do a question do not waste time. Go on to
the next question. If there is time go back to the questions you left
out.
Write your answer on the line next to the question. If you change
an answer, make sure that your new answer can be read easily.
Check that everybody is ready to start the test. Tell candidates that they
have 30 (thirty) minutes from the time they are told to start to answer the
questions. Note the time and tell candidates to turn the page and start
question one.
After 30 (thirty) minutes tell candidates to stop work and to close their
booklets.
Collect the tests, making sure that there is one test for each candidate, and
thank the candidates for their efforts.
© UNESCO 17
5 Conducting the actual trial

testing
Finding the appropriate place where the testing is to be held in
an institution unfamiliar to the supervisor, may mean that the
supervisor has to arrive at that institution well in advance of the
planned testing time. Each testing room must have a supervisor.
The supervisor for each room has to have a complete set of testing
materials (since testing rooms may not be adjacent or even in the
same building). It is more efficient for all the supervisors to start the
testing at the same time, rather than go from room to room starting
the testing on a staggered timetable.
The supervisor makes sure that all candidates are seated, introduces
him/herself, explains briefly what will happen in the testing session
and answers queries, distributes the test and associated papers to
each person according to the agreed plan, and ensures that each
candidate has a fair chance of completing the trial test without
interruption. The supervisor must enforce the test time limits so
that candidates in each testing room have essentially the same time
to attempt the items.
After the test has been attempted, it is usual for all test materials to
be placed in an envelope (or several if need be) with identification
information about the trial group and the location where the
tests were completed. If there is time, the trial tests can be sorted
into the different test forms before being placed in the envelope.
The envelope should be sealed. The test supervisor for a room is
responsible for ensuring that all the test papers (used and unused)
are returned to those who will process the information.
18
Processing test responses 6
after a trial testing session
When the trial tests arrive back at the trial testing office they should
still be in their sealed envelopes or packages. Only one envelope is
opened at a time, as it is important to know the source of every test
paper. When an envelope is opened, the trial tests are sorted into
stacks according to the test version.
Identification numbers are assigned to the tests in the package, and

written clearly on the tests. For example, some digits of the numbers
may be assigned according to the institution that provided the
trial test candidates. The first institution numbers may be prefixed
with ‘1’, the second with ‘2’, and so on. It is important to check
whether the intended trial group became the actual trial group. If
the actual trial group differs substantially from the intended group,
interpretation of trial data will be made more difficult because the
group will be less representative. For example, trial groups should
have both urban and country representation. Data for country trials
may be slower in returning for processing. If country data are not
included, the analyses will not be representative of country and
urban groups. That is, there will be no evidence of the usefulness
of the items for distinguishing between more able and less able
respondents in country areas.
© UNESCO 19
Scoring procedures
• Multiple-choice
Multiple-choice items present a task and provide a number of

options from which the candidate has to choose. The candidate’s
task is to identify the correct or the best alternative. Judgments
of experts are needed to establish which option is the best (or
correct) answer for each item. Once these correct answers have
been decided, the score key can then be used by clerical staff or
incorporated in machine scoring. Scoring becomes a mechanical
task and many test analysis software packages for personal
computers can score and analyze test data in a single processing
run. The correct score key is crucial. Errors in score keys create
interpretation problems. In such a case the total score obtained is
not the best measure of what the test is measuring, items which are
sound are queried, and candidates do not receive appropriate credit
for their achievements. Further, since test software packages require
the score key in the files to be kept on disk, there is a need for the
computer containing score keys, to be kept in a secure place and for
there to be restrictions on access to the computer.
• Constructed response
There are potential difficulties in scoring prose, oral, drawn and

manipulative responses. An expert judge is required because each
response requires interpretation to be scored. Judges vary in their
expertise, vary over time in the way they score responses (due
to fatigue, difficulty in making an objective judgment without
being influenced by the previous candidate’s response, or by
giving varying credit for some correct responses over other correct
responses), and also vary in the notice they take of handwriting,
neatness, grammatical usage and spelling.
One technique for avoiding or minimizing such problems is to train

a team of scorers. Such training often involves a discussion of the
20
Processing test responses after a trial testing session
key issues that have to be identified by a candidate. The scorers

should then apply what they have learned by scoring the same
batch of anonymous real samples of responses. It is important to
have a range of real samples. (The training is to ensure that scorers
can tell the difference between high quality, medium quality, and
low quality answers and assign marks so that the higher quality
answers will get better scores than the medium quality answers,
and medium quality answers in turn will get better scores than
low quality answers). These results are then compared (perhaps
graphically) and discussed. It is not expected that identical results
will be obtained by each scorer. Rather, the aim is to improve the
agreement between scorers about the quality of each response. We
expect that there should be greater agreement between the scorers
where the responses are widely separated in quality. Making more
subtle distinctions, consistently, requires more skill. To achieve
consistency, each paper (or sample of papers) should be remarked
without knowledge of the other assessment. If large differences
occur in such a case, training is required until the interpretations
tend to agree. Members of the scoring team may differ in the
importance they place on various aspects of a task and fairness
to all candidates requires consistency of assessment within each
aspect. Even when team members agree in the rank ordering of
responses, the marks awarded may differ because some team
members are lenient while others are more stringent.
A more subtle difference occurs when some judges see more “shades
of grey” or see fewer such gradations (as in the tendency to award
full-marks or no marks). Scorers should make use of similar ranges
of the scale.
Short-answer items may require a candidate to recall knowledge

rather than recognise it (to produce an answer rather than make a
choice of an answer) or may be restricted to recognition. The former
may be something like miniature essays (or the oral or drawn
equivalent), or may require a word or phrase to be inserted (as in
© UNESCO 21
cloze procedure or fill-the-gap). Recognition tasks may require

a key element of a drawing/photograph/diagram/prose passage
to be identified, as in the case of a proof-reading test of spelling,
or choosing the part of a diagram or poster which has a safety
message.
Scoring short responses (whether production or recognition in

format) has some of the difficulties of scoring more extended
responses but it is generally easier for judges to be consistent, if
only because the amount of information to be considered is smaller
and likely to be less complex. However item analysis is still a
necessary part of the scoring arrangements for short responses as a
quality assurance process.
Scoring trial papers

If the test needs to be scored by expert judges before analysis,
this scoring is done next. If there are essay-type items, two
approaches can be used. The first requires the marker to obtain
scores on distinct aspects such as completeness of evidence, logical
organization, and effectiveness of explanation. This analytic
method may be time consuming and errors may creep in if the
marks awarded to each aspect are not added correctly. The second
approach requires a general unanalyzed impression. This approach
depends upon rapid global judgments leading to sorting of samples
of work into a number of groups. For example, the first sorting
might set up three groups: poor, average, and good.
When this sorting has been finished the essays in each group are
checked quickly to ensure that they are in the correct group. The
essays in each group are then sorted into two further groups and
checked again. For both approaches essays should be assessed as
anonymously as possible.
22
Processing test responses after a trial testing session
Regardless of the approach that is chosen, it is necessary to

decide in advance what qualities are to be considered in judging
the adequacy of the answer. If more than one distinct quality is
required in an essay, separate assessments are needed. It may
be useful to prepare an answer guide in advance, showing what
points should be covered. Where there are several essays in an
examination paper it is good practice to mark the first essay all the
way through the stack of test papers. Then shuffle or rearrange the
papers before starting to score the next essay. Repeat this process
after each essay has been marked.
When all items have been marked, the scores are entered into a
computer file. If the test is multiple-choice in format, the responses
may be entered into a computer file directly. (The scoring of the
correct answers is done by the test analysis computer programme).
The next envelope of tests is not opened until the processing of
the first package has been assigned. This is to ensure that tests do
not get interchanged between packages. [Sending the wrong results
to an institution reflects very badly on those in charge of the test trials
and analysis.] Data entry can be done in parallel provided that each
package is the responsibility of one person (who works on that
package until all work on the tests it contains is completed). The
tests are then returned to their package until the analysis has been
completed, and the wrapping is annotated to show which range
of candidate numbers is in the envelope and the tests for which
the data have been entered. (If a query arises in an analysis, the
actual test items for that candidate must be accessed quickly and
efficiently).
The analysis can be done as soon as all of those particular trial

tests have been processed and the resulting data files have been
combined. (Remember to check that blank or duplicate lines have
been taken out of the combined data file. Leaving such lines in may
lead to spurious discrimination and difficulty indices).
© UNESCO 23
7 Aknowledging co-operation
Empirical trial is the only satisfactory method of finding the

difficulty of a test item for a particular group. Without the co-
operation of those managing the trial sites, and the trial group of
candidates, this information could not be obtained.
If the results from the trial tests are to be sent back to the
institutions which co-operated in the trials, the results should be
accompanied by some advice on interpretation. This advice should
include something like this.
These results are from the trial testing conducted on <date>.

Since these results are based on trial tests some caution should be
exercised in interpretation of the results. For example, the trial tests
administered may have differed in difficulty so the same score on
each test may not represent equivalent achievement.
Appropriate thanks for co-operation should also be given.
24
Analysis in terms of 8
candidate responses
When candidate responses are available for analysis, trial test
items can be considered in terms of their psychometric properties.
Although this sounds very technical and specialized, the ideas
behind such analyses are relatively simple. We expect a test to
measure the skills that we want to measure. Each item should
contribute to identifying the high quality candidates. We can see
which items are consistent with the test as a whole. In effect, we are
asking whether an item identifies the able candidates as well as can
be achieved by using the scores on the test as a whole.
Two main indices are obtained from a traditional analysis of

student responses to test items. These are an index of item difficulty
(or facility) and an index of item discrimination. Also, further
information can be gained from an analysis of the choices in a
multiple-choice context. Many software packages provide summary
statistics such as the mean, standard deviation, reliability or
internal consistency index, and a frequency distribution of scores,
for the test as a whole as well.
• Item difficulty
Empirical trial of a test is the only satisfactory method of finding

the difficulty of a test item for a particular group. The index of
difficulty, which is reported for a particular test administered
to a particular group, is a function of the skills required by the
questions and the skills achieved by those attempting the test. Item
© UNESCO 25
facility is the opposite of item difficulty. As the difficulty increases,

fewer candidates are able to give the correct response; as the facility
increases, more candidates are able to give the correct response. In
general, between 90 per cent and 100 per cent of students should
complete all items unless the purpose is to test speed itself, as in the
case of a speed of reading test.
• Item discrimination
Traditional test analysis considers the extent to which a single item

distinguishes between able and less able candidates in a similar way
to the test as a whole. Items which are not consistent with the other
items in the way in which they distinguish between able and less
able candidates (as measured by this test) are considered for deletion,
amendment, or placement on a different test. Modern test analysis
techniques consider other factors as well. (These are discussed
later).
For a test of many items, it is common practice to assume that the

total score on the trial test is a reasonable estimate of achievement
for that type of test. Criterion groups may be selected on the basis
of total score (if that type of analysis is being done). When such an
assumption is made, we expect candidates with high total scores to
have high achievement and candidates with low total scores to have
low achievement.
The procedure investigates how each item distinguishes between

candidates with knowledge and skill, and those lacking such
knowledge and skills. Choosing items with an acceptable
discrimination index will tend to provide a new version of the test
with greater homogeneity. [However this process should not be
taken too far because a test measuring a more complex area will be
made less relevant if only one type of item is retained.]
26
Analysis in terms of candidate responses
Introduction to test analysis strategies

In this introduction, you will analyze a set of data by hand. Scores
on a test are usually obtained by adding scores on each of the
tasks. The validity of adding task scores depends upon the tasks
belonging to some dimension that makes sense. Indeed, if those
who tend to be good at one task do not tend to be good at another
similar task, we question whether both tasks are assessing similar
qualities. If the tasks are not assessing similar qualities we have no
logical reason for adding the separate task scores together. If success
on a task tends to be consistent with success on other tasks, we may
infer that it is legitimate to add scores from each task, and that we
are able to give meaning to scores on the resulting scale.
A data set is shown below in Figure 5. The student identification

numbers are shown across the top of the columns. The item
numbers are shown down the left hand side. (This layout is
appropriate where each student answer strip is overlapped with
other answer strips for an analysis by hand. Later in this module
a different layout, appropriate for computer analysis of test items,
will be used). Each column of non-bold numerals represents the
responses of one student. The correct answers are shown by a 1;
incorrect answers are shown by a zero. For example, student 18
was correct on the first 3 items and incorrect on the fourth item.
Adding 1 mark for each correct item down the page gives the
score obtained by a person. For example, student 7 has 10 correct
responses. Adding 1 mark for each item across the page gives the
score obtained by an item. For example, 5 students were correct on
item 20.
Test analysis investigates the patterns of responses for both persons

and items. Some of the techniques will be illustrated initially with
this set of data (see Figure 5).
© UNESCO 27
Figure 5. Matrix of student data on a twenty-one item test
Students
Items 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 15
2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 14
3 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 14
4 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 11
5 1 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 13
6 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 11
7 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 1 1 8
8 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 13
9 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 1 1 1 9
10 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 7
11 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 1 0 7
12 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 6
13 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 7
14 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 1 11
15 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 1 12
16 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 1 1 1 8
17 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 8
18 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 6
19 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 4
20 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 5
21 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 0 0 0 10
3 4 5 7 7 9 10 10 12 12 14 14 14 14 15 15 17 17 199
28
When the data were entered into the table, the data for the student
with lowest score was entered first, then the student with the next
lowest score, and so on.
In Figure 6, the position of the rows (item scores) has been altered so
that the easiest item is at the top of the matrix and the other rows
are arranged in descending order. Notice that the top right corner
of the matrix has mostly entries of 1s, and the lower left corner has
mostly entries of 0s.
In Figure 6, the students have been assigned to 3 (equal) groups. The

highest 6 scorers will be called the High group; the lowest 6 scorers
will be called the Low group; the Middle group of 6 has been shown
underlined. (Note that to form three groups of equal size, the middle
group has some students with the same score as students in the
high group).
We can investigate the patterns of success for each item (in an

approximate way) by graphing the success rate of the Low group
and the corresponding success rate of the High group. (We will
ignore the Middle group for the moment). (See Figure 6).
© UNESCO 29
Figure 6. Students divided into three groups according to score
Students
Items 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 15
2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 14
3 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 14
5 1 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 13
8 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 13
15 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 1 12
4 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 11
6 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 11
14 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 1 11
21 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 0 0 0 10
9 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 1 1 1 9
16 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 1 1 1 8
17 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 8
7 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 1 1 8
13 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 7
10 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 7
11 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 1 0 7
12 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 6
18 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 6
20 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 5
19 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 4
3 4 5 7 7 9 10 10 12 12 14 14 14 14 15 15 17 17 199
30
Consider item 1 (with the data as shown in Figure 7).
Figure 7. Responses for item 1
Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 15
The Low group has 4 successes; the High group has 6 successes.
You can draw a graph like the one shown in Figure 8 for item 1.
Figure 8. Correct answer responses for item 1
6
5
4
3
2
1
Low Middle High
Next consider item 4 (Figure 9)
Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
4 0 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 11
The Low group has 2 successes; the High group has 4 successes.
You can draw the graph like the one shown in Figure 10.
© UNESCO 31
6
5
4
3
2
1
Low Middle High
Note that in each case, although the actual numbers differ, the low
group had less success than the high group. This is the expected
pattern for correct answers if the item measures the same skills as
the whole test. Now look at the pattern for item 19 (Figure 11).
Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 4
The Low group has 1 success; the High group has 1 success. You
can draw a graph like the one shown in Figure 12.
6
5
4
3
2
1
Low Middle High
32
In this case the columns are equal. If these data were from a larger
sample and gave this pattern, we could conclude that item 19 was
not consistent with the rest of the test. Further, if the low group
did better than the high group we would think that there was
something wrong with the item, or that it was measuring something
different, or that the answer key was wrong. Test analysis can
identify a problem with an item but the person doing the analysis
has to work out why this is so.
Look again at the graphs for correct answers for items 1, 4, and 19
(as shown in Figure 13 below). Trend lines have been added. Items
performing as expected have a rising slope from left to right for
the correct answers. Item 19 does not show a rise; the data for this
item show no evidence that the item distinguishes between those
who are able and those who are not (where the criterion groups are
determined from scores on the test as a whole). For item 21 (Figure
13 below) there is evidence that this item distinguishes between
those who are able and those who are not (as determined from the
test as a whole) but not in the expected direction. Those who are less
able are better on this item than those who are more able. It may
be that the score key has the wrong ‘correct’ answer, that the item
is testing something different from the other items, that the better
candidates were taught the wrong information, and/or only the
weaker candidates were taught the topic because it was assumed
(incorrectly) that able students already knew the work. Item analysis
does not tell you which fault applies. You have to speculate on
possible reasons and then make an informed judgment.
© UNESCO 33
Figure 13. Correct answer responses for items 1, 4, 19, and 21
Item 1 Item 4
6 6
5 5
4 4
3 3
2 2
1 1
Low Middle High Low Middle High
Item 19 Item 21
6 6
5 5
4 4
3 3
2 2
1 1
34
Items with correct answer patterns like items 1 and 4 distinguish

between those who are low scorers on the test as a whole, and those
who are high scorers. Such items are called positive discriminators;
the gradient of the trend line is positive.
Items with patterns like item 19 fail to distinguish between high

and low scorers. Such items are called non-discriminators; the
gradient of the trend line is zero or close to zero.
Items with patterns like item 21 also distinguish between high

and low scores but in the wrong direction. Such items are called
negative discriminators; the gradient of the trend line is negative.
Now we return to considering the middle group. (If the middle

group was not exactly the same size as the low and high groups, we
would plot the proportion of candidates in each group). For item 1
there were 5 successes in the middle group; for item 4 there were 5
successes, and for item 19 there were 2 successes. For item 21 there
were 3 successes.
The additional information provided by the middle group data

allows us to consider how well the item distinguishes between the
Low and Middle groups, and between the Middle and High groups.
Items which perform as expected will have a correct answer option
graph with positive discrimination. Items which do not perform as
expected have a correct answer option graph with zero or negative
discrimination, or have a correct answer option graph with positive
discrimination in one part and not in another part.
The graphs in Figure 14 illustrate some items with correct response

patterns where caution must be exercised. The initial information
based only on the high and low groups suggested that items 1 and
4 were acceptable but consideration of middle group responses
showed that item 4 was problematic.
© UNESCO 35
Figure 14. Correct answer responses for L,M and H groups

on items 1, 4, 19, and 21
Item 1 Item 4
6 6
5 5
4 4
3 3
2 2
1 1
Item 19 Item 21
6 6
5 5
4 4
3 3
2 2
1 1
36
Figures 15 and 16 show some possible patterns from analyses. Figure

15 shows one pattern that is acceptable and two patterns that may
be acceptable, and Figure 16 shows patterns that raise concern.
Figure 15. Acceptable and may be acceptable correct

answer response patterns
Acceptable Ceiling effect Floor effect
Low Middle High Low Middle High Low Middle High
In each of these patterns, success rate improves with ability. The

middle graph represents easy items where there is no evidence
of the item distinguishing between middle and higher groups.
The item may be acceptable but may not be; we reserve judgment
until further evidence is obtained. Such an item is said to have a
ceiling effect; the high group cannot distinguish their achievement
from the middle group because the item was so easy and a trend
line cannot go beyond 100 per cent correct. The right-hand graph
represents difficult items where there is no evidence of the item
distinguishing between lower and middle groups. Such an item is
said to have a floor effect; the middle group cannot distinguish their
achievement from the low group because the item was so difficult
and a trend line cannot show less than zero percent correct.
© UNESCO 37
Figure 16. Unacceptable correct answer response

patterns
Not acceptable Erratic Erratic
Low Middle High Low Middle High Low Middle High
The patterns for correct responses are summarised diagrammatically

in Figure 17.
Figure 17. Patterns for correct responses
Correct responses
OK ? ? ? ? OK? OK?
38
Doing an item analysis ‘by hand’

These relatively simple ideas can form the basis for understanding
item analysis. We will now look at some of these ideas with another
data set. Doing an analysis by hand may take a longer time but it
will help you understand the analysis process. (It is more efficient to
let the computer do the analysis provided that you know what you
are doing).
The data for analysis are shown below (Figure 18). In this figure
the candidates are listed in the left column. Each row shows
the responses to the items. Acceptable responses (correct and
incorrect) are 1, 2, 3, 4, 5 and 6. The first five acceptable responses
are multiple-choice options for each item. (In this example the
responses have been entered as numerals, but they could have been
entered as letters such as A, B, C, D, E and F). The key (the list of
correct answers in the correct order for this test) is supplied at the
bottom of the response data. The 6 indicates that the question was
omitted, but candidates had sufficient time to attempt all items. To
help line up columns, the last two lines show the item numbers.
On a copy of this table of data, use a coloured pencil to highlight

each correct answer. For example, in the first column after the
candidate identification code (item 1), each 5 should be highlighted.
Other responses such as 6 and 2 should not be highlighted. Repeat
this procedure for each item in turn. Then count the number of
highlighted numerals to obtain a total score for each candidate;
write each total at the right-hand end of each row. Then count the
number of highlighted numerals to obtain a total score for each
item; write each total at the bottom of the column for that item.
© UNESCO 39
Count the number of candidates. Use the candidate totals to

identify the top one-third; mark these to show they are in the High
group. Identify the bottom one-third; mark these to show they are
in the Low group. Identify the middle one-third; mark these to show
they are in the Middle group. (You might find it useful to cut up
your piece of paper into rows with one candidate’s results to a row.
Then paste each slip of paper in order of total score).
X03, X08, X19, X21, X22, X24, X26, X14, and X15 will be in the high
group; X05, X25, X09, X06, X12, X13, X20, X11, and X23 will be in
the middle group; and X16, X27, X07, X01, X02, X04, X10, X17, and
X18 will be in the lower group).
Make some tables like Figure 19. Use one table for each item. Taking
each item in turn, count how many from the High group chose 1,
how many chose 2, how many chose 3, and so on. As you complete
each count, write the result in your table for that item.
40
Figure 18. Responses on a multiple-choice test of 30 items
X01 6 2 4 2 3 6 5 4 3 5 1 3 2 3 2 2 3 3 5 2 4 2 1 4 3 2 1 2 2 2
X02 5 2 4 2 2 5 5 1 3 4 1 4 2 5 1 1 3 2 9 9 5 5 1 9 3 1 3 2 2 2
X03 5 2 4 2 5 5 5 4 3 1 1 4 5 5 5 2 3 3 3 5 4 5 4 1 4 3 9 2 1 2
X04 2 2 4 1 2 5 1 5 3 1 1 1 2 5 5 2 3 5 5 4 2 5 5 4 3 2 2 2 1 2
X05 5 2 4 3 4 5 5 5 3 5 1 4 2 5 3 2 3 3 5 5 4 5 4 2 4 2 3 2 3 3
X06 5 2 4 2 3 9 1 1 3 1 1 4 2 5 4 2 3 3 5 4 4 5 4 1 3 2 2 2 1 2
X07 5 2 4 1 2 1 5 1 3 1 1 4 2 5 5 3 3 3 3 1 1 5 4 2 1 4 3 2 4 2
X08 5 2 4 2 5 5 5 4 3 5 2 4 2 5 5 2 3 3 3 5 4 3 4 1 5 4 5 2 1 2
X09 5 2 4 1 5 1 2 4 3 2 1 4 2 5 5 2 1 1 5 5 4 5 4 3 4 3 3 2 1 2
X10 5 1 4 1 5 1 2 4 3 1 3 4 2 3 2 1 3 3 4 1 5 5 2 1 3 2 4 3 1 2
X11 5 2 4 1 5 1 5 2 3 2 5 3 2 2 5 2 1 3 3 5 4 5 5 1 1 2 3 2 1 2
X12 5 2 4 2 5 5 1 4 3 1 1 4 2 5 5 1 3 3 5 5 4 2 4 2 1 4 1 4 1 2
X13 5 2 4 2 2 5 5 5 9 1 1 4 2 5 5 2 3 3 3 9 9 5 9 9 9 2 9 2 1 2
X14 5 2 4 2 2 5 5 4 3 5 1 4 2 5 5 2 3 3 4 9 4 5 4 9 9 9 1 2 1 3
X15 5 2 4 2 2 5 5 2 3 5 1 4 2 5 5 2 3 3 5 9 4 5 1 4 1 2 2 2 1 2
X16 5 2 4 2 9 5 5 5 3 4 1 4 2 5 5 9 3 5 5 9 4 5 4 9 1 9 9 2 1 2
X17 5 2 4 5 3 2 5 2 3 5 5 1 2 5 5 5 1 9 5 5 4 5 5 3 9 9 9 2 1 2
X18 5 2 4 1 1 5 1 4 4 5 1 4 2 3 2 1 3 3 5 9 9 2 5 1 2 2 2 4 1 1
X19 5 2 4 2 5 5 5 3 3 1 1 4 2 5 5 2 3 3 3 9 4 5 4 1 9 9 4 2 1 3
X20 5 2 4 2 4 5 5 2 5 9 1 4 2 5 5 1 3 3 3 9 4 5 4 9 9 3 9 2 1 2
X21 5 2 4 2 5 1 5 9 3 5 1 4 2 5 5 2 9 3 3 5 4 9 4 9 4 9 9 2 1 2
X22 2 2 4 2 5 5 5 4 3 5 1 4 2 5 5 2 3 3 3 1 4 2 1 1 5 2 2 2 1 2
X23 5 2 4 2 5 5 5 4 3 9 1 4 2 2 4 1 1 3 2 4 4 5 3 1 4 3 3 2 2 2
X24 5 2 4 2 5 5 5 4 3 1 1 4 2 1 5 2 3 3 5 2 4 5 4 1 1 3 5 2 1 2
X25 5 2 4 2 2 5 2 9 3 5 2 4 2 5 5 2 3 3 3 9 4 5 4 9 3 9 5 2 1 2
X26 5 2 4 2 5 3 5 4 5 5 1 4 2 5 5 2 2 3 2 3 4 5 4 3 4 3 2 2 1 2
X27 5 1 4 2 2 5 2 4 2 3 1 4 2 1 5 2 3 3 1 3 2 5 4 3 4 4 4 2 2 2
Key 5 2 4 2 5 5 5 4 3 5 1 4 2 5 5 2 3 3 3 5 4 5 4 1 4 2 4 2 1 2
Item ^^^^^^^^^^^^^^^^^^^^^^^ 1 ^^^^^^^^^^^^^^^^^^^^^^^ 2 ^^^^^^^^^^^^^^^^^^^^^^^ 3
Num 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
© UNESCO 41
Figure 19. Blank data table for an item
Item Option
No — 1 2 3 4 5 Other Total
H — — — — — — —
M — — — — — — —
L — — — — — — —
Total — — — — — — —
Figure 20. Data table for item 1
Item Option
No –1– 1 2 3 4 *5 Other Total
H — –1– — — –8– — –9–
M — — — — –9– — –9–
L — –1– — — –7– –1– –9–
Total — –2– — — –24– –1– –27–
Item 1 has been completed (Figure 20) to show you how the results
are recorded. The * indicates the option that was keyed as correct.
This processing of the data has to be accurate. If several people are

working together on this analysis, each person may process a sub-
set of items.
42
When all items have a completed table of data, the information for
the keyed responses can be graphed. A graph for item 1 is shown
below (Figure 21), together with a blank graph (Figure 22). These
graphs can be compared with those in Figure 17.
Figure 21. Graph for item 1
Item –1–
10
9
8
7
6
5
4
3
2
1
Low Middle High
© UNESCO 43
Figure 22. Blank graph for an item
Item —
10
9
8
7
6
5
4
3
2
1
Low Middle High
44
Item analysis approaches 9
using the computer
There are two main types of approaches to item analysis used
extensively in test research and development organizations.
Some use one approach, some use the other, and some use both
approaches in conjunction with each other. In this module the
earlier approach will be called the Classical (or traditional) item
analysis, and the more recent approach will be called Item Response
Modelling.
The first step in an item analysis is to choose an appropriate

criterion measure, which can be used to make judgments
concerning whether an item discriminates between better
performing students and poorer performing students. Many test
research and development agencies assume that the total score on
the test is the best criterion measure available. Criterion groups
are set up on the basis of total scores on the test and each item’s
correlation with the total score is reported. (Note that there is a
built-in spurious correlation here because each item is included in
the total score. With tests of 20 items or more, the effect of the item
contribution is ignored in practice).
The older classical approach to item analysis seeks to identify

items which do not distinguish between high and low scorers
in a similar way to a criterion measure. The extent of agreement
between the item and the criterion measure in ordering the
candidates is reported as a correlation coefficient, often the point-
biserial correlation coefficient. (The phi coefficient, often estimated
© UNESCO 45
by a graphical means, was used widely before the use of personal

computers became common). These correlation indices range
between -1 through 0 to +1.
Zero or low correlations and negative correlations identify items

to be queried, discarded, revised, or replaced. High positive
correlations identify items to be retained. The degree of success
or failure for a particular item is usually measured either by the
percentage of candidates correct or by the percentage of candidates
incorrect. Both percentages have been called the difficulty of the item
in various textbooks on measurement. For consistency and to avoid
confusion, percentage correct should be called facility and percentage
incorrect should be called difficulty.
The item response modelling approach to item analysis also seeks

to identify items which do not distinguish between high and low
scorers in a similar way to a criterion measure. However, this
approach takes a more detailed look at the capacity of the item to
distinguish between other subsets of the scorers. For example, to
distinguish between low and middle scorers, and between middle
and high scorers. Items are assigned a position on a scaled difficulty
continuum from easiest to most difficult.
Candidates are assigned a position on a scaled ability or

achievement continuum in the same metric as the item difficulty
continuum. High achievers among the candidates and difficult items
on the test are near the top end of the continuum; low achievers and
easy items are near the bottom end of the continuum.
The actual pattern of responses resulting from the interactions of

items with candidates is compared with a model pattern consistent
with the observed marginal totals. The extent of agreement between
the observed pattern and the model in ordering both the candidates
and the items is reported in terms of fit statistics. Candidates and
items with unusual patterns in the correct responses are identified
46
Item analysis approaches using the computer
to be queried. Items may be discarded, revised, or replaced.

Explanations are sought for unusual candidate patterns. There
are several separate variations within item response modelling
(sometimes known as Item Response Theory or IRT). In this
module, only one of these variations will be used, the Rasch model
(named after the Danish statistician who published his research
findings in 1960).
We now look at these two types of analysis in turn, compare the

approaches, showing where they agree on item quality and where
they differ.
Classical strategies for item analysis

The high group/low group procedures used in the analysis of
data by hand in the section on the introduction to item analysis
strategies above are simplified examples of classical item analysis.
Personal computers have made the task of scoring the test,
counting the cases, calculating the percentages, and calculating
the correlations between success on items and total score, easier,
particularly for multiple-choice tests. The discussion of the clerical
approach presented below has used the ITEMAN computer
program to analyze the data presented in Figure 18.
The first part of the computer output from a traditional test analysis
report for a multiple-choice test might look like Figure 23.
© UNESCO 47
Figure 23. Classical item analysis for data in Figure 18
MicroCAT (tm) Testing System

Copyright (c) 1982, 1984, 1986, 1988 by Assessment Systems Corporation
Item and Test Analysis Program – ITEMAN (tm) Version 3.00
Item analysis for data from file iiepitm.dat Page 1
Item Statistics Alternative Statistics
Seq. Scale Prop. Point Prop. Point

No. -Item Correct Biser. Biser. Alt. Endorsing Biser. Biser. Key
1 0-1 0.889 0.264 0.159 1 0.000 -9.000 -9.000

2 0.074 -0.051 -0.027
3 0.000 -9.000 -9.000
4 0.000 -9.000 -9.000
5 0.889 0.264 0.159 *

Other 0.037 -0.531 -0.227
The interpretation of this part of the printout is now described.
Each item discrimination (the measure of the extent of agreement

between success on the item and success on the test as a whole) is
shown opposite the item number. For example, the discrimination
for item 1 is shown as Point Biser. = 0.159. (Usually reported as
0.16). The Alt. column shows the options. The first category can be 1
or A, the second 2 or B, and so on. The correct answer is shown
by * . The Prop. Endorsing column shows the proportion of
candidates who chose each option. The Point Biser. (Point biserial
correlation coefficient) statistic shows the extent of agreement
between the option and the test as a whole. The Biser. (biserial
48
correlation coefficient) statistic provides another statistic which also

shows the extent of agreement between the option and the test as a
whole. (Note that values of the point biserial correlation coefficient
tend to be smaller in magnitude than if the same data are analyzed
using the bisevial correlation coefficient).
(-9.000 means Not Applicable. The extent of agreement cannot be

calculated where no candidate has chosen an option). The last
option (Other) indicates missing data – that is, no response at all.
The analysis for item 23 is shown in Figure 24. This item has
many good qualities. It is in an appropriate range of difficulty (the
proportion correct was 0.593) and those who were incorrect are
spread over each of the other options. The ‘correct’ option has a
substantial positive agreement (0.549) with the test as a whole.
All of the ‘incorrect’ options have negative agreements: 1 -0.072;
2 -0.287; 3 -0.049; and 5 -0.515 with the test as a whole.
Figure 24. Analysis results for item 23

23 0-23 0.593 0.695 0.549 1 0.148 -0.148 -0.072

2 0.037 -0.670 -0.287
3 0.037 -0.113 -0.049
4 0.593 0.695 0.549 *
5 0.148 -0.792 -0.515

Other 0.037 0.026 0.011
© UNESCO 49
By contrast, item 27 (Figure 25) has a pattern of results suggesting

that either the item has been mis-keyed, or the candidates have
been taught incorrect information.

27 0-27 0.111 -0.145 -0.088 1 0.111 -0.086 -0.052

2 0.222 -0.071 -0.051
CHECK THE KEY 3 0.222 -0.298 -0.214
4 was specified,
5 works better 4 0.111 -0.145 -0.088 *
5 0.111 0.568 0.342

Other 0.222 0.155 0.111
The test analysis program has identified option 5 as a more likely

correct answer (because the measure of agreement for that option
is more positive than the keyed option). Note that the keyed option
has a negative agreement (-0.088) with the test as a whole, while
option 5 has a positive agreement (0.342). Either the item key is
correct and a substantial proportion of the better candidates are
misinformed, or the item key is incorrect. If an error in the item key is
found, it must be corrected and the analysis must be done again.
Item 25 (Figure 26) is similar to item 27, but identifying the problem
with the item may be difficult. The keyed option does have a
positive agreement (0.265) with the test as a whole. However other
options also have positive agreements (0.030 and 0.403). The test
50
analysis program has identified the largest positive agreement as

a likely correct answer. However this type of pattern may occur
when there is more than one correct answer. For item 25, it appears
that the best correct answer may be option 5 and that option 4
may be another correct answer (that is, if mis-information is not a
feasible explanation). Test construction experts often suggest that
amendment is required so that there is only one correct answer for
an item. If an error in the item key is found, it must be corrected and the
analysis must be done again.

25 0-25 0.259 0.358 0.265 1 0.222 0.042 0.030

2 0.037 -0.809 -0.347
CHECK THE KEY 3 0.222 -0.639 -0.457
4 was specified,
5 works better 4 0.259 0.358 0.265 *
5 0.074 0.753 0.403

Other 0.185 0.081 0.056
Sometimes an item has some options which work and some which
contribute nothing to distinguishing between those who have
knowledge and those who do not. In Item 7 (Figure 27), options 3
and 4 were not endorsed by any person, and no index of agreement
with the test as a whole could be calculated (as shown by -9.000). In
effect, only part of this item has worked; those who constructed the
item need to provide two more attractive options.
© UNESCO 51

7 0-7 0.704 0.505 0.383 1 0.148 -0.500 -0.325

2 0.148 -0.256 -0.167
3 0.000 -9.000 -9.000
4 0.000 -9.000 -9.000
5 0.704 0.505 0.383 *

Other 0.000 -9.000 -9.000
In some cases, the item may have more than one fault. For example,
item 13 (Figure 28) appears to be mis-keyed (or the better candidates
are mis-informed) and some of the options do not attract.

13 0-13 0.963 -0.861 -0.369 1 0.000 -9.000 -9.000

2 0.963 -0.861 -0.369 *
CHECK THE KEY 3 0.000 -9.000 -9.000
2 was specified,
5 works better 4 0.000 -9.000 -9.000
5 0.037 0.861 0.369

Other 0.000 -9.000 -9.000
52
Item 3 (Figure 29) is an example of an item which every candidate

can do successfully. For this group of candidates, there is no
evidence that this item is useful in distinguishing between able and
less able candidates.

3 0-3 1.000 -9.000 -9.000 1 0.000 -9.000 -9.000

2 0.000 -9.000 -9.000
3 0.000 -9.000 -9.000
4 1.000 -9.000 -9.000 *
5 0.000 -9.000 -9.000

Other 0.000 -9.000 -9.000
The next section provides a brief summary of the key aspects to

consider when evaluating a set of test items.
© UNESCO 53
Deciding whether an item is useful after

trial with real candidates (classical analysis)
The steps are:
1. Find the correct option.

This is indicated in the Key column with the *.
2. Is the agreement index (Point-Biserial) positive?

If Yes, continue to 3;
If No, this is an unexpected result! Check why! Probably you need
to change or reject the item. Check that the score key is correct!
In practice, some positive agreement index values are small.

Some are so small as to be effectively zero. The position of the
cut-off between zero and non-zero index values depends upon
the size of the candidate group. With a candidate group of 60,
values less than about 0.249 are traditionally regarded as zero.
The corresponding approximate values for larger group sizes are
80 (0.217), 100 (0.194), 120 (0.178), 140 (0.165), 160 (0.154), 180
(0.145), and 200 (0.138). If your trial test involved 200 candidates,
then items with a correct-option point-biserial index of less than
0.138 would be rejected.
[Some classical test analysis programs provide a probability

value associated with each option (sometimes called a p-value).
The p-value shows the probability of the agreement index value
occurring by chance. If the probability is higher than a chosen
value, we treat the correlation as approximately zero. Traditional
chosen values for the cut-off between ‘zero’ and ‘acceptable’ are
p=0.05, p=0.01, and p=0.001. For p=0.05 we take a risk that for
1 in 20 cases we may accept an item as in agreement when it is
only a chance agreement. For p=0.01, the risk is 1 in 100 and for
p=0.001 the risk is 1 in 1000.
54
More conservative risk values result in more items being

rejected. Many test developers use the correlation associated
with p=0.05 as the cut-off; a lower point-biserial value (that is
a higher p value) leads to rejection of the item. If the program
provides a p-value then an additional question is asked: Is the
p-value 0.05 or less? If Yes, continue to 3; If No, probably change
or reject the item.]
3. Are the wrong option agreements negative?

If Yes, keep the item and continue to 4;
If No, consider each wrong option in turn. If an incorrect option
has a positive correlation about the same as the correct option
or higher, check the score key (this option may be an alternative
correct answer that has not been credited as such). Options that
are not chosen by any candidate are often replaced and the item
is then retained for further trial. If there is no serious problem,
keep the item and continue to 4; otherwise change or reject the
item.
4. Assembling final forms of the test

We consider the position where the item belongs in the test
specification table and the difficulty of the item. Each cell in the
test specification table should have several discriminating items
and a range of difficulties.
The other test items are considered in the same way. The final page
of the ITEMAN test analysis looks like the information in Figure 30.
Comments on the printout have been added.
© UNESCO 55
Figure 30. Classical item analysis summary statistics
MicroCAT (tm) Testing System

Copyright (c) 1982, 1984, 1986, 1988 by Assessment Systems Corporation
Item and Test Analysis Program – ITEMAN (tm) Version 3.00
Item analysis for data from file iiepitm.dat Page 7
There were 27 examinees in the data file.
Scale Statistics
Scale: 0 <-- This is the scale identification code.
N of items 30 <-- The number of items on this scale.
N of Examinees 27 <-- The number of candidates.
Mean 19.815 <-- The mean (or average) for this group of 27 persons (on 30 questions).
Variance 10.818 <-- A measure of spread of test scores for these candidates.
Std. Dev. 3.289 <-- Another measure of spread of test scores for these candidates.
(The standard deviation is the square root of the variance
Skew -0.111 <-- This index summarises the extent of symmetry in the disribution of
candidates scores. A symmetrical distribution has a skewness of 0;
negative values indicate more high scores than low scores and positive
values indicate more low scores than high scores.
Kurtosis -0.893 <-- This index compares the distribution of candidate scores with a
particular mathematical distribution of scores known as the
Normal or Gaussian distribution. Positive values indicate a more
peaked distribution than the specified distribution; negative
values indicate a flatter distribution
Minimum 14.000 <-- This is the lowest candidate score in this group.
Maximum 26.000 <-- This is the highest candidate score in this group.
Median 20.000 <-- This is the middle score when all candidates scores in this group
are arranged in order.
Alpha 0.543 <-- This index indicates how similar the questions are to each other.
The lowest value is 0.0 and the highest is 1.0. Provided that
candidates had ample time to complete each item, higher values
indicate greater internal consistency in the items.
(See Test Reliability below).
SEM 2.224 <-- We use this index to estimate how much the scores might change
if we gave the same test to the same candidates on several occasions
(See Test Reliability below).
Mean P 0.660 <-- This is the average proportion correct for these items with these
candidates.
Mean Item-Tot. 0.254 <-- This is the average point biserial correlation for these items.
Mean Biserial 0.338 <-- This is the average biserial correlation for these items.
56
Test reliability
The term validity refers to usefulness for a specified purpose and
can only be interpreted in relation to that purpose. In contrast,
reliability refers to the consistency of measurement regardless of
what is measured. Clearly, if a test is valid for a purpose it must also
be reliable (otherwise it would not satisfy the usefulness criterion).
But a test can be reliable (consistent) without meeting its intended
purpose. Test reliability is influenced by the similarity of the test
items, the length of the test, and the group on which the test is
tried. When we add scores on different parts of a test to give a score
on the whole test, we assume that the test as a whole is measuring
on a single dimension or construct, and the analysis seeks to
identify items which contradict this assumption. In the context of
test analysis, removing items which contradict the single-dimension
assumption should contribute to a more reliable test. Where trial
tests vary in length, the reliability index for one test cannot be
compared directly with another. An adjustment to a common-length
test of 100 items can be made using the Spearman-Brown formula:
reliabilityoriginal test × (100/number of itemsoriginal test)

reliability100 item test =
[1 + reliabilityoriginal test × (100/number of itemsoriginal test- 1)]
If the group of candidates is more diverse, the index obtained will

be higher than for a less diverse group. For example, students at
the one age-level will be less diverse than a group with students of
several age-levels. A test reliability quoted for a sample of Grades
4, 5, 6 and 7 students is expected to have a higher value than a test
reliability for the same test given to a similar size sample of a single
Grade level (such as Grade 6).
© UNESCO 57
There are a number of methods for estimating reliability; item

analysis software programs generally only use one of these
methods. There are four basic approaches.
· The same test can be given on two different occasions to the

same sample of candidates; the reliability coefficient could then
be calculated by correlating the scores on the two occasions.
· Two separate parallel tests can be given to the same sample of

candidates; the reliability coefficient could then be calculated by
correlating the scores on the two tests. (One variant is to delay
the second test to assess stability over time).
· A single test can be split into two parts; the reliability

coefficient could then be calculated by correlating the scores
on the two parts. (In this case each part test is not as long as
the complete test so an adjustment has to be made using the
Spearman-Brown formula).
· The reliability can be calculated as an internal consistency from

a single set of test data; this may be considered as equivalent
to the average of all possible adjusted split-half coefficients.
This is the approach used most often by item analysis computer
programs.
The last two approaches only assess on one occasion so there is no

assessment of stability over time.
Reliability is sometimes estimated in order to judge how precise

a candidate’s score might be. Various test analysis programs use
different measures of item consistency. The reliability index may
be described as an item homogeneity index, an internal consistency
index, a Kuder-Richardson Formula 20 index, or a (Cronbach)
Alpha index. For example, the ITEMAN program calculates an
Alpha index which is a measure of the internal consistency of the
test. In the item-analysis example above, the Alpha index is 0.543.
58
In practice, a reliability index for a test should be at least 0.7 and

preferably higher than 0.8.
By making some assumptions about a particular candidate being

similar to other candidates, the spread of scores of other candidates
can be combined with the estimate of reliability to estimate a
band of scores in which that candidate’s score might fall if the test
was given again. In the item-analysis example above this statistic
(with a value of 2.224) is called the SEM, the standard error of
measurement. For the item-analysis example above, we might
expect that two thirds of the time the ‘true score’ of a candidate
(the average score for an individual for an infinite number of test
administration will fall within the candidate’s observed score on the
test plus or minus 2.224. Doubling the error limit provides a score
range for the true score for 95 per cent of the time. To illustrate,
we would expect that 95 per cent of the time the true score for
a candidate who obtains 20 on the test would fall between 20
– (2x2.224) and 20 + (2x2.224).
Item response modelling strategies

for item analysis
Part of the computer output from an item response modelling test
analysis report for a multiple-choice test might look like Figure
31, showing how the items and the candidates are placed on the
continuum. This output was produced by applying the QUEST
computer program to an analysis of the data in Figure 18. Notice
that item 3 is not shown. [If every person is correct on an item (or
incorrect on an item), that item cannot be placed on the graph.
Similarly, a person who has every item correct cannot be placed
on the continuum. We know they are better than the next best
person, but we do not know how much better. A more demanding
test is needed to place such persons on the knowledge continuum.
© UNESCO 59
A person with a zero score cannot be placed either. We have to

find what they know as well as what they do not know to locate
them on the graph.] Figure 32 shows another part of the output that
is a check on the fit to the model. Figure 33 shows details of the
individual items. Items are queried if they are well to the left of, or
well to the right of, the vertical dotted lines in Figure 32. They may
also be queried if the fit t-values in the last two columns of Figure 33
are large.
Figure 31 also shows how the development of trial tests can result in
more items in some difficulty ranges and less items in others. Most
of the candidates have attainments (as judged by this test) higher
than the average difficulty for the items. In other words, most items
have difficulties below the attainment levels of the candidates.
In effect, this test is more powerful at detecting differences between

candidates at lower levels within the range than at higher levels.
More valid items in a particular range of difficulty lead to more
precise distinctions between candidates within that range.
60
Figure 31. Variable map for test data in Figure 18
QUEST: The Interactive Test Analysis System

Item Estimates (Thresholds)
all on all (N = 27 L = 30)
3.0 27 fThe most difficult item
x fThe top candidate
2.0 x 25 24 26
1.0 xx 20
xx 19
xxx 10
xx 5
xxxxx 8
xxx 23
x
x
0.0 x 6 16 fAverage item difficulty
xxxx 4 7
The lowest candidate g x 14 15 21
17 22 29
9 11 18
12 30
-1.0 1 28
-2.0 2
3.0 13 fThe easiest item
Each x represents one student
© UNESCO 61
Figure 32. Item fit map for test data in Figure 18

Item Fit
all on all (N = 27 L = 30)
INFIT
MNSQ 0.63 0.71 0.83 1.00 1.20 1.40 1.60
1 item 1
2 item 2
3 item 3
4 item 4
5 item 5
6 item 6
7 item 7
8 item 8
9 item 9
10 item 10
11 item 11
12 item 12
13 item 13
14 item 14
15 item 15
16 item 16
17 item 17
18 item 18
19 item 19
20 item 20
21 item 21
22 item 22
23 item 23
24 item 24
25 item 25
26 item 26
27 item 27
28 item 28
29 item 29
30 item 30
62
Figure 33 shows the raw scores for each item and the maximum
scores, the ability level on the continuum where the probability of
success changes from less likely to be correct to more likely to be
correct. The point is called the threshold for the item. Underneath
each threshold numeral there is another numeral indicating the error
associated with the threshold estimate.
Figure 33. Item estimates for test data in Figure 18 (part only)

Item Fit
all on all (N = 27 L = 30)
TRSH INFT OUTFT INFT OUTFT
Item name SCORE MAXSCR 1 MNSQ MNSQ t t
-1.38
1 item 1 24 27 1.00 1.29 0.2 0.6
.63
-1.83
2 item 2 25 27 0.96 0.66 0.1 -0.2
.75
3 item 3 0 0 Item has perfect score
-0.13
4 item 4 19 27 0.83 0.74 -0.9 -0.8
.44
1.04
5 item 5 12 27 0.86 0.84 -1.1 -0.6
.41
0.05
6 item 6 18 27 1.02 0.99 0.2 0.1
.43
-0.13
7 item 7 19 27 0.95 0.86 -0.2 -0.3
.44
-0.88
8 item 8 13 27 1.11 1.12 0.9 0.6
.41
-0.77
9 item 9 22 27 1.07 1.04 0.3 0.2
.51
1.20
10 item 10 11 27 1.09 1.17 0.7 0.7
.41
-0.77
11 item 11 22 27 1.06 1.34 0.3 0.8
.51
-1.04
12 item 12 23 27 0.89 0.65 -0.2 -0.6
.56
-2.55
13 item 13 26 27 1.10 5.51 0.4 2.3
1.03
-0.32
14 item 14 20 27 0.94 0.88 -0.2 -0.2
.46
-0.32
15 item 15 20 27 0.87 0.77 -0.5 -0.6
.46
0.05
16 item 16 18 27 0.81 0.73 -1.2 -0.9
.43
-0.53
17 item 17 21 27 1.15 1.19 0.7 0.6
.48
-0.77
18 item 18 22 27 0.90 0.73 -0.3 -0.5
.51
© UNESCO 63
Deciding whether an item is useful after

trial with real candidates
(item response modelling analysis)
The steps are:
1. Look at the variable map

Are the items (on the right, shown with numerals) spread
over a similar range as the candidates (on the left, shown X)?
If Yes, continue to 2; If No, which are much higher, items or
candidates? If items, further less complex items are required;
if candidates, further more complex items are required. (The
ranges of items and candidates should be similar).
2. Look at the item fit map

Are any of the items shown well to the left or well to the right
of the vertical dotted lines? If No, continue to 3; If Yes, this is
an unexpected result! Check why! You may need to change or
reject the outlying items. Check that the score key is correct! If
the score key is not correct, amend it and repeat the analysis. If
the score key is correct, go to 3.
3. Are the fit t-values in the last two columns of the item
estimates table larger than 3?
If No, keep the item and continue to 4; If Yes, probably change
or reject the item.
4. Assembling final forms of the test

We consider the position where the item belongs in the test
specification table and the threshold level (difficulty) of the
item. Each cell in the test specification table should have several
items over a range of difficulties.
64
Classical item analysis and item

response modelling compared
In most situations, items rejected in the classical approach to
item analysis will also be rejected by the item response modelling
approach. However, it is the case that the item response modelling
approach sometimes rejects items that are acceptable using the
classical approach. This type of item is illustrated in Figure 34.
Figure 34. Correct answer response patterns where

decisions vary
Useful range Useful range
The first type of item is usually considered acceptable regardless

of the analysis method. The second type of item is regarded as
unacceptable by the item response modelling approach. Very steep
gradients are regarded as inappropriate; such items are not useful
over a reasonable range and often may be concerned with trivial
content.
© UNESCO 65
10 Maintenance of security
When trial tests are developed for secure purposes it is important
that the secure nature of the tests be preserved. Copies of the tests
for file purposes must be kept under lock and key conditions. The
computer control files for the test analysis include the score key for
each trial test so there has to be restricted access to the computers
where the test processing is done.
The analysis reports (such as the item analysis, and the summary
tables) will include the score keys and therefore those reports must
be kept secure.
66
Test review after trials 11
After the item analyses are complete, decisions have to be made

about the items that will be retained, the items that will be
modified, and the items that will be discarded. The test blueprint
and associated specification grid must be consulted to ensure that
enough items are retained to give a range of item difficulties within
each cell of the grid.
An item may be easy because

· Wrong choices are not plausible;
· Most candidates know the work on which the items were based.
An item may be difficult because

· You have the wrong ‘correct’ answer;
· More than one answer is correct;
· The content is rare or trivial;
· The task is not well stated; and/or
· Candidates did not reach the item (other items may have been
too complex, too lengthy, or too numerous).
An item may not discriminate because

· You have the wrong ‘correct’ answer;
· More than one answer is correct;
· The task is ambiguous;
© UNESCO 67
· The ‘correct’ choice has flaw;

· The ‘correct’ choice is too obvious;
· The task is too difficult and candidates are guessing;
· The item is testing something different from the other items;
· The better candidates were taught the wrong information; and/
or
· Only the weaker candidates were taught the topic because it
was assumed (incorrectly) that able students already knew the
work.
Cautions in interpreting item analysis

data
Item analysis identifies questionable items which up until the trial
stage had met our criteria for relevant, reasonably valid, and fair
items. Item analysis may not necessarily identify faulty questions
which should not have been included in the trial test because those
criteria were not met. Some users of item analysis seek to reject
all items but those with the very highest discrimination values.
While this apparently gives increased reliability, this may be
gained at expense of the validity of the final test. For example, a
test of computation may have addition, subtraction, multiplication
and division items. If items are progressively discarded through
continued analysis it is likely that only one of the operations will
remain (probably the one with the most items). The resulting test
will be an apparently more reliable test but, because only one of the
four operations is tested, it is no longer representative of all four
processes, and hence not valid for the purpose of assessing the four
processes.
68
Test review after trials
Items which do not perform as expected can be discarded or

revised. Test constructors should be aware of the possibility of
distortion in the balance of questions when there are not enough
items to satisfy requirements in all cells of the specification grid. If
the original specification represents the best sampling of content,
skills, and item formats, in the judgments of those preparing and
reviewing the test, then leaving some cells of the grid vacant will
indicate a less than adequate test. To avoid this possibility, test
constructors may prepare three or four times as many questions
that they think they will need for each cell in the grid. Test
constructors have to avoid the tendency to test what is easy to test,
rather than what is important to test.
Assembling the final test and the

corresponding score key
After trial, tasks may be re-ordered to take account of their
difficulty. Usually the easiest questions are presented first. This is to
encourage candidates to proceed through the test and to ensure that
the weaker candidates do not become discouraged before providing
adequate evidence of their achievements and skills. Minor changes
to items may have to be made for layout reasons (for example, to
keep all of an item on the one page of the test, or to avoid obvious
patterns in the list of correct answers). Items representing a single
cell within a test specification should vary in item content and
difficulty. The position of the correct option in multiple choice items
(A, B, C, D or E) should also vary and each position should be used
to a similar extent. Some questions may have minor changes to
wording, others may be replaced. The final test should be consistent
with the test blueprint. The item review procedures described above
are repeated (particularly important where stimulus material must
be associated with more than one question) and each reviewer
should work independently through the proposed test and
© UNESCO 69
provide a ‘correct’ answer for each question. This enables the test
constructor’s (new) list of correct answers to be checked.
Preparation of final forms of a test is not the end of the work. The
data from use of final versions should be monitored as a quality
control check on their performance. Such analyses can also be used
to fix a standard by which the performance of future candidates
may be compared. It is important to do this as candidates in one
year may vary in quality from those in another year. In some
instances such checks may detect whether there has been a breach
of test security.
It is customary to develop more trial forms so that some forms of

the final test can be retired from use (where there is a possibility of
candidates having prior knowledge of the items through continued
use of the same test).
The trial forms should include acceptable items from the original
trials (not necessarily items which were used on the final forms
but in similar design to the pattern of item types used in the final
forms) to serve as a link between the new items and the old items.
The process of linking tests using such items is referred to as
anchoring. Surplus items can be retained for future use in similar
test papers.
70
Confidential disposal of trial 12
tests
It is usual to dispose of the used copies of trial tests by confidential
destruction after a suitable time. [The ‘suitable’ time is difficult to
define. Usually, trial tests are destroyed about one month after all
analyses have been concluded and when the likelihood of further
queries about the analyses is very low.]
© UNESCO 71
13 Using item analysis software

In practice, test research and development agencies use item
analysis software on a variety of computers to monitor the quality
of their tests. Some useful software packages are listed in Figure 35.
A • indicates that the software has the feature. Other software
packages may provide similar coverage.
Figure 35. Coverage of item analysis software discussed

in this module
MS-Dos Mac Classical Rasch (IRT)

Name Version Version Analysis Analysis
QUEST • • • •
ITEMAN • •
BIGSTEPS • •
Computer software
The QUEST computer program is published by The Australian
Council for Educational Research Limited (ACER). Information
can be obtained from ACER, 19 Prospect Hill Road, Camberwell,
Melbourne, Victoria 3124, Australia.
The ITEMAN computer program is published by Assessment

Systems Corporation. Information can be obtained from
Assessment Systems Corporation, 2233 University Avenue, Suite
400, St Paul, Minnesota 55114, United States of America.
72
The BIGSTEPS program is published by MESA Press. Information
can be obtained from MESA Press, 5835 S. Kimbark Avenue,
Chicago, Illinois 60637, United States of America.
References
Adams, R.J. ; Khoo, S.T. (1993). QUEST: The interactive test analysis
system. Hawthorn, Vic.: Australian Council for Educational
Research.
Rasch, G. (1960). Probabilistic models for some intelligence and

attainment tests. Copenhagen: Denmark Paedagogiske Institut.
Finding out more about trial testing

and item analysis
1. General strategies
Coffman, W.E. (1971). Essay examinations. In R.L. Thorndike (Ed.).
Educational Measurement. (2nd Ed. (pp. 271-302). Washington,
DC: American Council on Education.
Hake, R. (1986). How do we judge what they write? In K.L.

Greenberg, H.S. Weiner, ; R.A. Donovan (Eds.), Writing
assessment: Issues and strategies, (pp. 153-167). New York:
Longman.
Huot, B. (1990). The literature of direct writing assessment: Major

concerns and prevailing trends. Review of Educational Research,
60, 237-263.
© UNESCO 73
Henrysson, S. (1971). Gathering, analyzing, and using data on test

items. In R.L. Thorndike (Ed.) Educational Measurement. (2nd
Ed.) (pp. 130-159). Washington, DC: American Council on
Education.
Hopkins, C.D. ; Antes, R.L. (1990). Classroom measurement and

evaluation. Itasca, Illinois: Peacock.
Hopkins, K.D. ; Stanley, J.C. (1981). Educational and psychological

measurement and evaluation. (6th Ed.) Englewood Cliffs, NJ:
Prentice-Hall.
Izard, J. (1991). Assessment of learning in the classroom. Geelong, Vic.:

Deakin University.
Izard, J. (1995). Module C.1 Overview of Test Construction. Paris:

International Institute for Educational Planning.
Low, B. ; Withers, G. (Eds.) (1990). Developments in school and public

assessment. (Australian Education Review, No. 31). Hawthorn,
Vic: ACER.
Mehrens, W.A. ; Lehmann, I.J. (1984). Measurement and evaluation

in education and psychology. (3rd Ed.) New York: Holt, Rinehart
and Winston.
Tinkelman, S.N. (1971). Planning the objective test. In R.L.

Thorndike (Ed.) Educational Measurement. (2nd Ed.) (pp. 46-80).
Washington, DC: American Council on Education.
Wright, B.D. ; Stone, M.H. (1979). Best test design: Rasch

measurement. Chicago, IL: MESA Press.
Wright, B.D. ; Masters, G.N. (1982). Rating scale analysis. Chicago,

IL: MESA Press.
74
Using item analysis software
2. Broadening trial testing and item analysis

strategies
de Lange, J. (1992). “Assessment: No change without problems”.
In M. Stephens ; J. Izard. (Eds.) Reshaping assessment practices:
Assessment in the mathematical sciences under challenge. (pp.
46-76). Hawthorn, Vic.: Australian Council for Educational
Research.
Griffin, P. ; Forwood, A. (1991). Adult literacy and numeracy

competency scales. An International Literacy Year Project.
Melbourne, Vic.: Phillip Institute of Technology.
Haines, C.R., Izard, J.F., Berry, J.S. et al. (1993). «Rewarding student
achievement in mathematics projects». Research Memorandum
1/93, London: Department of Mathematics, City University.
(54pp.)
Haines, C.R. ; Izard, J.F. (1994). “Assessing mathematical

communications about projects and investigations”. Educational
Studies in Mathematics, 27, 373-386
Izard, J.F. (1991). “Issues in the assessment of non-objective

and objective tasks”. in A.J.M. Luitjen (Ed.), Issues in public
examinations. (Proceedings of the 16th IAEA conference.
Maastricht, The Netherlands, 18-22 June 1990.) (pp73-83).
Utrecht, The Netherlands: Lemma, B.V.
Linacre, J.M. (1990). Modelling rating scales. Paper presented at

the Annual Meeting of the American Educational Research
Association, Boston, MA., USA, 16-20 April, 1990. [ED 318 803]
Wilson, M. (1992) Measurement models for new forms of

assessment education. In M. Stephens ; J. Izard. (Eds.) Reshaping
assessment practices: Assessment in the mathematical sciences
under challenge. (pp. 77-98). Hawthorn, Vic.: Australian Council
for Educational Research.
© UNESCO 75
Applications of Item Analysis

Adams, R.J., Doig, B.A. ; Rosier, M.J. (1991). Science learning in
Victorian schools: 1990. (ACER Research Monograph No. 41).
Hawthorn, Vic.: Australian Council for Educational Research.
Doig, B.A., Piper, K., Mellor, S. ; Masters, G. (1994). Conceptual

understanding in social education. (ACER Research Monograph
No. 45). Melbourne, Vic.: Australian Council for Educational
Research.
Masters, G.N. et al. (1990). Profiles of learning: The basic skills testing
program in New South Wales, 1989. Hawthorn, Vic.: Australian
Council for Educational Research.
76
Exercises 14
1. Choose an important curriculum topic or teaching subject
(either because you know a lot about it or because it is
important in your country’s education programme).
• List the key content areas in that topic or subject.
• List the important skills or behavioural objectives.
• Show (in percentage terms) the relative importance of

each key area.
2. Construct a test plan which has the content categories (from

Exercise 1) at the left and the skills or behavioural objectives
(also from Exercise 1) at the top. Adjust the numbers of items
in each cell to reflect the percentage weightings you have
chosen for each dimension.
3. Review an examination or test used in your country for the

topic or teaching subject you chose in Exercise 1. Using your
test plan as a guide, compare the examination or test with
your test plan. Choose a topic in the curriculum which is not
addressed by the examination or test and write some sample
items to illustrate how item writers might satisfy this need.
4. Use the ITEMAN software to analyze the data given in

Figure 18. Discuss the charateristics of each item in the test.
© UNESCO 77
Quality (SACMEQ).


and innovation”.
8
Module
Maria Teresa Siniscalco
and Nadia Auriat
Questionnaire
design



Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66

Content
1. Introduction 1
2. Initial planning 3
Why a new questionnaire – and when? 3
Relationships between research problems,
research hypotheses, and variable construction 5
Characteristics of research hypotheses 6
Specifying variables and indicators 6
Operationalization of research questions 8
The problem of the cross-national validity of educational
concepts, definitions, and data collection instruments 12
1. What is formal education? 12
2. Distinguishing public and private service providers 14
3. What is a school? 16
4. What is a student? 17
5. What is a teacher? 18
3. The design of questions 22

Question structure 22
1. Closed questions 23
2. Open-ended questions 26
3. Contingency questions 28
Guidelines for writing questions 29
1
Specifying the characteristics of respondents 34

A checklist for reviewing questionnaire items 35
4. Examples of questions 38
Student background 38
1. Gender and age 39
2. Socio-economic background: occupation, education, and possessions 41
Teacher characteristics 47
School location 49
Learning, teaching, and school activities 51
1. Student reading activity 51
2. Teacher activities 52
3. School head activities 53
Attitudes, opinions, and beliefs 55

1. Likert scaling 55
5. Moving from initial draft to final version

of the questionnaire 63
Two widely-used patterns of question sequence 63
General guidelines for item placement 65
Covering letters and introductory paragraphs 66
Drafting instructions for answering questions 68
Training of interviewers or questionnaire administrators 70
Pre-testing the questionnaire 72
Basic steps in pre-testing 75
Reliability and validity 76
1. Validity 76
2. Reliability 78
II
Content
The codebook 79
6. Further reading 84
© UNESCO III
Introduction 1
• This module provides guidance for the design of standardized

questionnaires that are to be administered in school systems
to students, teachers, and school heads. The module is divided
into four sections that cover initial planning, the design of
questions, examples of question types, and moving from a draft
to a final questionnaire.
After reading this module, the reader should be able to design a

quality survey questionnaire that is suitable for addressing the
research issues at hand. He or she will know how to:
• Decide on the target population for the questionnaire.
• Identify the variables and indicators that will address the

research issues and hypotheses on which data are to be
collected.
• Develop demographic, knowledge, attitude, and practice

questions.
• ‘Close’ open ended quantitative and qualitative questions

and design skip, filter, and contingency questions, where
appropriate.
• Decrease response bias and maximize response rates.
• Design probe questions and interviewer or respondent

instructions on the questionnaire.
© UNESCO 1
Module 8 Questionnaire design
• Conduct a pilot test of the questionnaire, and adjust its final

design according to the results.
• Prepare a codebook for data entry.
2
Initial planning 2
This section reviews the steps required to determine the need for
a new questionnaire, and looks at how a general research problem
needs to be translated into a number of specific research questions
and hypotheses. It examines the problem of valid cross-national
instruments and provides helpful hints and recommendations for
using comprehensive and precise definitions of key educational
concepts.
Why a new questionnaire – and when?

This module addresses the planning and design of standardized
questionnaires. A formal standardized questionnaire is a
survey instrument used to collect data from individuals about
themselves, or about a social unit such as a household or a school.
A questionnaire is said to be standardized when each respondent is
to be exposed to the same questions and the same system of coding
responses. The aim here is to try to ensure that differences in
responses to questions can be interpreted as reflecting differences
among respondents, rather than differences in the processes that
produced the answers.
Standardized questionnaires are often used in the field of

educational planning to collect information about various aspects
of school systems. The main way of collecting this information is
by asking people questions – either through oral interviews (face
to face or telephone), or by self-administered questionnaires, or by
using some combination of these two methods.
© UNESCO 3
Although survey research, by definition, implies the use of

some form of questionnaire to be administered to a sample of
respondents, the questionnaire is simply one instrument that can
be employed in the study of a research problem. As such, it may or
may not be the most suitable tool for the task at hand.
Hence, before deciding on the need for a new questionnaire, one

should consider whether or not some of the required information
may already be available from other sources, for example, from
statistics compiled by governments or research agencies, or from
survey research archives. One should also consider whether a
suitable questionnaire already exists that could be wholly or
partially used.
The planner should also consider whether other means of data

collection are more appropriate. These can be (a) field experiments,
where people in ‘treatment’ and ‘control’ groups respond to a
scenario devised by the investigators, (b) content analysis of
newspapers or articles, (c) direct observation (such as counting
the number of schools in a district, the number of blackboards in
a school or the number of students per teacher in a given area),
or (d) non-directive interviews where there are no pre-specified
questions and the interviewer has a great deal of freedom in
probing areas and specific issues during the course of the interview.
Among the types of information that can be collected by means of

a questionnaire are facts, activities, level of knowledge, opinions,
expectations and aspirations, membership of various groups, and
attitudes and perceptions. In the field of educational planning,
the information that is collected can be classified broadly into: (a)
inputs to education (such as school resources or various background
characteristics of schools, teachers or students), (b) learning and
teaching processes, and (c) the outcomes of education (such as pupil
achievement, attitudes towards school, and measures of school
efficiency such as survival rates etc.).
4
Initial planning
Relationships between research

problems, research hypotheses,
and variable construction
The development of a questionnaire commences with the
transformation of general educational research and policy concerns
into specific research questions for which the data are intended to
supply an answer. Some examples of general educational policy and
research concerns are: (a) policy-makers want to assess the supply
of resources in their primary schools, (b) a curriculum expert wants
to determine to what extent teaching methods explain differences
in reading literacy among 9-year-old students, and (c) a national
evaluation agency wants to investigate student attitudes towards
science at the end of secondary school.
In the case of the above three examples, it would be necessary to

establish empirical evidence for decisions through the collection of
data on facts (school resources), activities (teaching methods), and
attitudes (students’ views towards science), respectively.
A research hypothesis is a tentative answer to a research problem

expressed in the form of a clearly stated relation between
independent (‘cause’) and dependent (‘effect’) variables. Hypotheses
are built around a more general research problem.
Examples of educational research problems derived from the

general issue of ‘equity’ are:
• How large are differences in the stability of school staff

between schools in urban and rural areas?
• Is the provision of equipment and supplies distributed to

schools dependent on public and private funding?
© UNESCO 5
These research problems can be translated into research hypotheses

as follows:
• The stability of school staff is greater in rural schools than in

urban schools.
• Equipment and supplies are more widely available in schools

dependent on private funding than they are in schools
dependent on public funding.
Characteristics of research hypotheses

Educational research hypotheses should have the following
characteristics.
• Describe clearly, and provide identification of the most

important variables in operational terms.
• Specify expected relationships among independent, dependent,

and control variables.
• Present a statement in a form that is testable with available

research methods.
• Be value free in the sense that they exclude the personal biases
of the researcher.
Specifying variables and indicators

Following the identification of the research problem and the
formulation of researchable hypotheses, it is necessary to prepare a
tentative list of variables and indicators for measuring the specific
research questions and hypotheses of interest.
6
Initial planning
A variable is a characteristic that can assume two or more

properties. If a property can change either in quantity or quality,
then it can be regarded as a variable. There are five main types of
variable.
• Dependent variables
Variables that the researcher is trying to explain (for example,
student achievement).
• Independent or explanatory variables

Variables that cause, or explain, a change in the dependent
variable.
• Control variables
Variables that are used to test for a spurious relationship
between dependent and independent variables. That is, to test
whether an observed relationship between dependent and
independent variables may be explained by the presence of
another variable.
• Continuous variables
Variables that take all values within a particular range.
• Discrete variables
Variables that take a number of specific values.
An indicator is an empirical, observable, measure of a concept.

When an indicator is composed of a combination of variables
involving only simple calculations (such as addition, subtraction,
division, multiplication, or a combination of these) it is called a
‘simple indicator’. When more complex analytical methods, such
as factor analysis or regression are used to develop an indicator,
the result is referred to as a ‘complex indicator’. Examples of simple
indicators are: number of school library books per pupil; or teacher/
pupil ratio. An example of a complex indicator is a factor score
© UNESCO 7
entitled ‘emphasis on phonics’ in the teaching of reading formed

from three variables: learning letter-sound relationships; word
attack skills; and assessment of phonic skills.
Operationalization of research
questions
When operationalizing a specific research question to identify
an appropriate indicator it is necessary to specify the indicator
according to the following components.
• The statistics that will be reported (for example, means or

percentages).
• The level of analysis at which the statistics will be calculated

(for example, at the student, teacher, or school level).
• The target population and, if any, the sub-populations

considered (for example, all primary school students, with
the data presented by region, and urban/rural location of the
school).
• The specific measure to be used (for example, the number of

school library books per student).
• The variables needed in order to calculate a measure on the

indicator to be obtained (for example, total school enrolment
and number of books in the school library).
Two different indicators of teacher stability were operationalized in

data collections conducted by UNESCO and the OECD during mid
1990’s. The UNESCO study examined the conditions of primary
schools in the least developed countries (Schleicher et al., 1995, pp.
56-59) and the OECD study was focussed on the development of a
8
Initial planning
broad range of indicators (OECD, 1996, pp. 150-152). These studies

offer interesting examples of different approaches to indicator
construction. For example, staff stability was defined on the basis
of the number of years teachers had been at the school, but the
indicator was constructed differently in the two surveys.
In the UNESCO study it was hypothesized that most of the

participating countries would be characterized by a high level of
staff instability due to population growth and resulting increases
in school enrolment rates. Teachers were considered to be ‘stable’
if they had been at the school for at least three or more years. The
level of staff stability for nations was represented by the percentage
of teachers in each country who ‘had been at the same school for
three or more years’. The following variables were needed for this
calculation: the overall number of full-time equivalent teachers
in the school; the number of teachers having joined the school by
year; and, the year of construction of the school building – which
functioned as a validity check variable.
In contrast, the indicator of staff stability used by the OECD for

developed countries measured the percentage of primary school
students enrolled in schools where more than 75 percent of teachers
had been employed at the same school for at least five years. In
order to build this indicator the following variables were needed:
total enrolment per school, the number of teachers per school, and
the number of years each teacher had been employed at the school.
Three aspects distinguish these two indicators of school staff

stability. First, in the OECD indicator the percentage of stable
teachers was weighted by the number of students enrolled. This
approach was taken because the goal of the analysis was to provide
an indication of how many students were affected by the stability
of the teaching force – rather than concentrating on teachers as the
unit of analysis. In contrast, the UNESCO study aimed at giving
a picture of the teaching body as a whole and therefore employed
© UNESCO 9
teachers as the unit of analysis. Second, stability was defined

as teachers being at the school for a minimum of five years in
the OECD indicator, and for at least three years in the UNESCO
indicator. The reason for the difference between five and three years
was that the first study was dealing with a group of the world’s most
developed countries and the second study concerned developing
countries. Third, the OECD indicator defined ‘stable’ schools as
those where more than a certain percentage of teachers (75 percent)
were ‘stable’. That is, the OECD defined an indicator decision point
to distinguish between stable and unstable schools. On the other
hand, the UNESCO study, aimed at giving a descriptive picture of
the conditions of schooling and therefore did not need to adopt an
indicator decision point.
The following table presents the components of the above-

mentioned indicators on teacher stability, and highlights the main
differences between them.
10
Initial planning
Table 1 Analysis of the teacher stability indicators’

components
Components of
Teacher stability
the final indicator
UNESCO data collection OECD data collection
Statistics Percentages of teachers Percentages of teachers
Unit of analysis Teacher level Student level
Target population Primary school Primary school teachers

(and sub-populations) teachers, with
reference to subgroups
of schools defined by
type (public/private)
and location (urban/
rural)
Operationalization Three years at the Five years at the school

of the indicator school
Variables needed a) overall number of a) overall number of

teachers teachers
b) number of teachers b) number of teachers
by number of years at the same school for
of permanence at at least 5 years
the same school
Indicator decision Not specified Schools with at least 75%

points of stable teachers
© UNESCO 11
The problem of the cross-national

validity of educational concepts,
definitions, and data collection
instruments
The specification of variables and indicators presupposes a general
and common agreement on the exact meaning and scope of the
terms and concepts employed. However, given the diversity that
characterizes different education systems, not to mention the
disparities that can sometimes be found among regions, and even
among schools within the same system, there is a need for a clear
and comprehensive definition of these kinds of terms.
In the following paragraphs some key educational concepts and

terms are examined to exemplify the kind of definitional problems
that arise when dealing with education issues. Some solutions
that can be used to address problems in this area have also been
described. The definitions and classifications presented draw
mainly on UNESCO, OECD, and EUROSTAT work.
1. What is formal education?

a. Problem/issues to be resolved
A number of questions on the scope of education need to be
addressed before meaningful data can be collected on key
aspects of education systems. For example, when does formal
education start and should a data collection on education
statistics include the pre-primary level? How should activities in
the field of vocational education and training be accounted for?
Is special education provided within or outside regular schools,
and should it be covered by the data collection? Should adult
education be included in the statistics?
12
Initial planning
b. Helpful hints
Some directions helping to answer these questions can
be drawn from the following comprehensive definition
of education proposed within the International Standard
Classification of Education (ISCED).
Education is ‘organized and sustained communication designed

to bring about learning’ (UNESCO, 1976). ‘Communication’ in
this context refers to the relation between two or more persons
involving the transmission of information. ‘Organized’ means
planned in a sequence including established aims and/or
curricula and involving an educational agency that organizes
the learning situation and/or teachers who are employed
(including unpaid volunteers) to consciously organize the
communication. ‘Sustained’ means that the learning experience
has the characteristics of duration and continuity. ‘Learning’
indicates any change in behaviour, information, knowledge,
understanding, attitudes, skills, or capabilities that can be
retained and cannot be ascribed to physical growth or to the
development of inherited behaviour patterns.
According to this definition, pre-primary school should be included

within the specification of education because, not only does it serve
the purpose of giving the child daily care while the parents are at
work, it also contributes towards the child’s social and intellectual
development. One solution to keeping track of differences among
pre-primary programmes is to distinguish between ‘all pre-
primary programmes’ and ‘pre-primary programmes with special
staff qualifications requirements’. The first area covers all forms
of organized and sustained activity taking place in schools or
other institutional-settings (as opposed to services provided in
households or family settings). The second refers to programmes
where at least one adult has a qualification characterized by training
covering psychological and pedagogical subject matter.
© UNESCO 13
In the area of vocational training one solution is to exclude

vocational and technical training that is carried out in enterprises,
with the exception of combined school and work programmes that
are explicitly deemed to be part of an education system.
The above definition of education also suggests that special

education programmes, whether provided by regular schools or by
special institutions, are to be included in the data collection as long
as the main aim of the programme is the educational development
of the individual.
‘Adult’ or ‘non-regular’ education programmes should be included

in the statistics only if they involve studies with a subject matter
content similar to regular education studies or whose qualifications
are similar to those of regular education programmes.
Each of the above points provides some idea of the kind of

definitional and classificatory work necessary to overcome national
and/or regional differences in definitions and thereby to construct
data collection instruments which guarantee the comparability of
the data that are collected.
2. Distinguishing public and private service

providers
Most national and cross-national data collections gather
information that will enable schools to be classified according
to the education service provider. In many data collections for
school systems, this classification is often referred to as ‘school
type’. This is a complex task because of the need to take variety
into account. In some countries virtually all education activities
and institutions are public. In other countries private agencies
play a substantial role. However, the label ‘private’ covers
14
Initial planning
a number of different educational configurations. In some

countries ‘private schools’ are entirely or mostly funded by the
central authority, but they are run privately. In other countries
‘private schools’ are entirely or mostly funded and managed
privately.
When information concerning ‘school type’ is collected it is

not sufficient to ask the respondent (for example, the school
head) to classify the school either as public or private. When
developing questions in this area, whether the questionnaire
is to be addressed to a central authority or to school heads, it is
necessary to specify what it is intended by ‘private’ vs. ‘public’,
or by ‘government’ vs. ‘independent’.
b. Helpful hints
An approach often adopted is to distinguish between the
following three categories of schools: schools controlled by
public authorities; schools controlled by private authorities
but depending on substantial government funds; and schools
controlled and funded by private authorities.
Alternatively, it is helpful to distinguish between the ‘control’

and ‘funding’ of schools. The terms ‘public’ and ‘private’ can
be used to indicate control. That is, whether it is a public or a
private agency which has the ultimate power to make decisions
concerning the institution (in particular the power to determine
the general programme of the school and to appoint the
officers who manage the school). The terms ‘government’ and
‘independent’ can be used to indicate the source of funding.
For example, a government school could be defined as one that
receives more than 50 per cent of the funds to support its basic
educational services from government agencies and/or whose
teaching personnel are paid by a government agency; whereas
an independent school could be defined as one that receives less
than 50 per cent of its overall funding from government agencies.
© UNESCO 15
3. What is a school?
A school is often difficult to define in a manner that is
consistent for a cross-national data collection. In some cases
a school consists of several buildings, managed by the same
head-teacher. In other cases, the same building hosts different
schools in different shifts at different times of the day. In
some cases a school has a well-defined structure, consisting of
separate classrooms – with each classroom being endowed with
one teacher table and chair, one desk and chair for each student,
and a chalkboard in each classroom. In other cases the school
is in the open air, perhaps under a tree, where teachers and
students sit on the ground, and the students use their knees as
writing places. When collecting comparative information on
schools, these different scenarios have to be taken into account.
b. Helpful hints
Suppose, for example, that ‘school crowdedness’ – expressed as
square metres of classroom space per pupil – is being measured.
The result obtained by dividing the number of square metres
by the total enrolment will be correct (and comparable across
schools) only in a situation where all schools have one shift. But
if some schools operate more than one shift, then the results
will be misleading.
One solution in this case would be to ask whether the school

has shifts, and how many students there are per shift. The
crowdedness measure could then be calculated by taking into
account the overall number of students for schools with no
shifts, but only the students in the largest shift for schools with
more than one shift.
16
Initial planning
4. What is a student?
Suppose that student enrolment figures are being investigated.
How will the corresponding statistics be calculated and
reported? When the focus of the analysis is on rates of
participation, what should be done with repeaters, and how
should they be distinguished from students enrolling regularly
for the first time in a grade or year of study? All these issues
need to be taken into account when designing questions on
student enrolment figures for an education system.
b. Helpful hints
A distinction should be made between the number of students
and the number of registrations. The number of students
enrolled refers to the number of individuals who are enrolled
within a specific reference period, while the number of
registrations refers to the count of enrolments within a specific
reference period for a particular programme of study. The two
measures are the same if each individual is only enrolled in
one programme during the reference period, but the measures
differ if some students are enrolled in multiple programs. Each
measure is important: the number of students is used to assess
participation rates (compared to population numbers) and to
establish descriptive profiles of the student body. The number
of registrations is used to assess total education activities for
different areas of study.
One solution for calculating student enrolment figures would

be to choose a given date in the education programme of
interest and then to present the number of students enrolled
on that date. Another solution would be to report the average
number of students enrolled during the academic year. Yet a
third possibility would be to report the total number of students
© UNESCO 17
enrolled during the academic year, with the possibility of double

counting multiple entrants and re-entrants.
With respect to identifying repeaters, one commonly applied

solution is that students are classified as repeaters if they are
enrolling in the same grade or year of study for a second or
further time.
5. What is a teacher?
How can teachers be defined in order to distinguish them from
other educational personnel? One approach would be to base
the definition on qualifications. However, this could result in an
overestimation of the number of teachers because a number of
personnel employed in schools may have a teacher qualification
but do not actually teach. Another approach would be to define
teachers on the basis of their activities within schools, but this
alone would not be sufficient to distinguish professionals from
those who may act as teachers occasionally or on a voluntary
basis. A further issue is the reduction of head-counts to full-
time equivalents (if part-time employment applies). How can
part-time teachers be converted into full-time equivalents?
No questionnaire concerning teacher characteristics can be
designed before these issues have been clarified.
b. Helpful hints
The following definition of a teacher provides a useful
framework for overcoming ambiguities:
A teacher is a person whose professional activity involves the

transmission of knowledge, attitudes, and skills to students
enrolled in an educational programme.
18
Initial planning
The above definition is based on the concepts of (a) activity

(excluding those without active teaching duties), (b) profession
(excluding people who work occasionally or on a voluntary capacity
in educational institutions), and (c) educational programme
(excluding people such as some school principals who provide
services other than formal instruction to students).
Note that according to this definition, principals, vice-principals,

and other administrators without teaching responsibilities as well
as teachers without active teaching responsibilities for students in
educational institutions are not classified as teachers.
For the reporting of head-counts, individuals who are linked to

multiple educational programmes, such as teachers who divide
their work between public and private institutions, or between
levels of education, or between different functions (for example,
teaching and administration) should be pro-rated between those
levels, types of institutions and functions. Suppose, for example,
that there are 100 full-time teachers that (on the average) devote
80 per cent of their statutory working time to teaching and 20 per
cent to the function of headmaster. In this case 80 full-time teachers
should be reported under the category ‘teacher’ and 20 full-time
teachers should be reported under the category ‘other professional
personnel’. If countries cannot pro-rate educational personnel, the
classification could be based on the activity to which they devote the
majority of their working time.
With respect to part-time conversion, the distinction between full-

time and part-time teachers, as well as the calculation of full-time
equivalents, is based on the concept of statutory working time.
One solution to this conversion problem is to classify as ‘full-time’
those teachers employed for more than 90 percent of their statutory
working time, and as ‘part-time’ those teachers employed for less
than 90 percent of the statutory working time. The classification of
individuals linked to multiple educational programmes as full-time
© UNESCO 19
or part-time teachers will depend on the total number of working

hours over all levels, types of institutions, and functions.
In some cases the solutions found will be used to define the target
population for which data will be collected – as shown in the
examples given in the paragraph on formal education. In other
cases the definitional work will contribute directly to the design
of questionnaire items – as shown in the examples given in the
previous section on service provider. In yet other cases definitions
and explanations will be used to prepare accompanying notes
that provide instructions on how to answer specific questions
– as shown in the examples given for the conversion of part- time
teachers and students into full-time equivalents).
20
Initial planning
E XERCISES
1. Prepare three research hypotheses concerning factors

that influence student achievement, and then identify some
appropriate independent, dependent, and control variables.
2. Specify the variables needed to construct each indicator in the

following list.
a. Time spent daily for homework by students.
b. Teacher academic education.
c. Availability of school library.
d. Yearly instructional time in Grade 1.
e. Average class size in school.
f. Teacher/pupil ratio.
3. An educational planner has defined an “Indicator of the

Level of Qualifications for Primary School Teachers” as the
percentage of primary school pupils in schools with at least
50 percent of teachers having completed secondary school.
For this indicator specify (a) the statistical units used, (b) the
unit of analysis, (c) the target population, and (d) the variables
required to construct the indicator.
4. Suppose you have prepared a questionnaire to be answered

by all teachers in a sample of schools. How would you specify
what is a teacher in the notes accompanying the questionnaire
in order to prevent the questionnaire being filled in by other
educational personnel?
© UNESCO 21
3 The design of questions

Once the indicators and variables of interest have been identified
and their components have been defined, one may begin designing
the corresponding questionnaire items. It is important to note
that the number of questions in a questionnaire does not coincide
necessarily with the number of variables. Sometimes more than one
question needs to be asked to operationalize one variable.
This section is concerned with types of questions and response

formats. It examines and discusses the advantages and
disadvantages of three key types of question structure: open,
closed, and contingency. It then gives writing tips for structuring
questions and the response categories that accompany them. The
section ends with advice on how to avoid response bias and pitfalls
in question writing.
Question structure
Two important aspects of questionnaire design are the structure of
the questions and the decisions on the types of response formats for
each question. Broadly speaking, survey questions can be classified
into three structures: closed, open-ended, and contingency
questions.
22
1. Closed questions
Closed (or multiple choice) questions ask the respondent to choose,
among a possible set of answers, the response that most closely
represents his/her viewpoint. The respondent is usually asked to
tick or circle the chosen answer. Questions of this kind may offer
simple alternatives such as ‘Yes’ or ‘No’. They may also require that
the respondent chooses among several answer categories, or that
he/she uses a frequency scale, an importance scale, or an agreement
scale.
How often do your parents ask you about your homework?

(Please, circle one answer only)
Never . . . . . . . . . . . . . . . . . . 1
1 or 2 times a week . . . . . . . 2
3 or 4 times a week . . . . . . 3
Nearly every day . . . . . . . . 4
The main advantage of closed questions are:
• the respondent is restricted to a finite (and therefore more

manageable) set of responses,
• they are easy and quick to answer,
• they have response categories that are easy to code, and
• they permit the inclusion of more variables in a research study

because the format enables the respondent to answer more
questions in the same time required to answer fewer open-
ended questions.
© UNESCO 23
The main disadvantages with closed questions are:
• they can introduce bias, either by forcing the respondent to

choose between given alternatives or by offering alternatives
that otherwise would not have come to mind,
• they do not allow for creativity or for the respondent to develop

ideas,
• they do not permit the respondent to qualify the chosen

response or express a more complex or subtle meaning,
• they can introduce bias, where there is a tendency for the

respondent to tick systematically either the first or last category,
to select what may be considered as the most socially desirable
response alternative, or to answer all items in a list in the same
way, and
• they require skill to write because response categories need to

be appropriate, and mutually exclusive.
The response format for closed questions can range from a simple
yes/no response, to an approve/disapprove alternative, to asking
the respondent to choose one alternative from 3 or more response
options.
The possibility of format effects or response bias for this type of

question can be reduced by changing the sequence of response
categories and values. For example, if responses to an item range
from 1 to 5, going from negative to positive, then a number of
items in the questionnaire can be designed to have 1 as the most
positive alternative and 5 as the most negative. This is a particularly
important technique for the construction of attitude scales.
24
The design of questions
Some closed questions may have a dichotomous response format,

which means only two mutually exclusive responses are provided.
What is your sex?

(Please tick one box only)
p Male
p Female
For the above example a dichotomous response format is

appropriate. However, this type of format should not be overused
in a survey because it elicits much less information than multiple
choice formats. For example, if seeking information on degree
of interest in public affairs, the question “Do you read a daily
newspaper?” yields a yes/no response. This could be reworded to:
“How many times per week do you read a daily newspaper?”, to
which multiple choice responses could be:
1. Seven times a week

2. Five to six times a week
3. Three to four times a week
4. One to two times per week
5. Less than once per week
6. Never
Such a multiple category response format would provide more

specific and more useful information than the dichotomous one.
© UNESCO 25
2. Open-ended questions
Open-ended or free-response questions are not followed by any
choices and the respondent must answer by supplying a response,
usually by entering a number, a word, or a short text. Answers are
recorded in full, either by the interviewer or, in the case of a self-
administered survey, the respondent records his or her own entire
response.
What are your favourite TV programmes?

(Please specify their titles)
...........................................................
...........................................................
What do you like most about school?
.........................................................
.........................................................
The main advantages of open-ended questions are:
• they allow respondents to express their ideas spontaneously in

their own language,
• they are less likely to suggest or guide the answer than closed
questions because they are free from the format effects
associated with closed questions, and
• they can add new information when there is very little existing
information available about a topic.
The main disadvantages of open-ended questions are:
• they may be difficult to answer and even more difficult to

analyze,
26
• they require effort and time on behalf of the respondent,
• they require the development of a system of coded categories

with which to classify the responses,
• they require the respondent to have some degree of writing

ability, and
• respondent handwriting can be illegible.
There is always the possibility with open-ended questions that

responses may come in very different forms, and these may lead
to answers that cannot be systematically coded for analysis. For
example, if asked “When did you leave school?”, the respondent
may answer in a variety of ways: “Seven years ago”. “When I got my
first job”. “When my brother started going to high school”. “When
my parents moved into this house”.
If the survey is administered by an interviewer, appropriate probing

helps clarify such answers. In the case of a self-administered
survey, guidance by writing specific instructions on how to answer
the question can often minimize the number of responses that have
very different dimensions.
Care should be taken in writing open-ended questions so as to

avoid formats that elicit a dichotomous yes/no or agree/disagree
response. In addition, the wording of questions should seek to
reduce the possibility of eliciting responses that are aligned along
very different dimensions and therefore cannot be systematically
coded. For example, asking “What do you think about your school?”
can elicit responses such as ‘nothing’ or ‘school is useless’. However,
asking “What recommendations would you have for improving your
school?” would be more likely to elicit informative answers.
A good case for using open-ended questions is when the aim is to

have the respondents reply spontaneously, or when the investigator
© UNESCO 27
is pilot testing the first version of the questionnaire, or when the

investigator wants to collect evidence on the parameters of an issue
with the aim of later formulating a multiple choice or closed version
of a question.
Generally, open-ended questions can produce useful information

in an interviewer administered survey, provided that the
interviewers are alert and trained to probe ambiguous responses.
In self-administered surveys, it is useful to provide instructions
on the format of the response that is required so as to minimize
opportunities for the respondents to answer the question according
to very different dimensions.
3. Contingency questions
A contingency question is a special case of a closed-ended
question because it applies only to a subgroup of respondents.
The relevance of the question for a subgroup is determined by
asking a filter question. The filter question directs the subgroup to
answer a relevant set of specialized questions and instructs other
respondents to skip to a later section of the questionnaire.
The advantage of contingency questions is that detailed data may

be obtained from a specific subgroup of the population. Some
questions may apply only to females and not to males; others may
apply only to people in school, and not to those who are employed.
At the base of good contingency questions are clear and specific
instructions to respondents.
The formats for filter and contingency questions can vary. One
option is to write directions next to the response category of the
filter question.
28
Are you enrolled in secondary school?
1. Yes (answer the following question)

2. No (skip to question 5)
Alternatively, the contingency question can be placed at the end of

the questionnaire set apart from ordinary questions that are to be
answered by everybody:
ANSWER THIS FINAL SET OF QUESTIONS ONLY IF YOU PLAN ON

ENTERING AN ADULT EDUCATION COURSE NEXT YEAR.
OTHERWISE, YOU HAVE NOW COMPLETED THE QUESTIONNAIRE.
Guidelines for writing questions

There are no all-purpose rules that, if followed, will automatically
result in a well-written questionnaire. There are, however, some
basic principles that, when violated, usually result in respondent
confusion, misunderstanding, lack of comprehension, or response
bias.
a. Keep the vocabulary simple

A first rule concerns the vocabulary used in writing questions and
answer categories. The rule is ‘keep it as simple as possible’. This
implies using simple words, avoiding acronyms, abbreviations,
jargon, technical terms, and abstract or general words.
© UNESCO 29
• If a rare or technical term has to be used, then its meaning

should be explained. For example, a question concerning
the frequency with which teachers teach their students to
understand different styles of text should be accompanied by a
definition of each kind of text.
Narrative :
texts that tell a story or give the order in which things happen.
Expository :
texts that provide a factual description of things or people or
explain how things work or why things happen.
Documents :
tables, charts, diagrams, lists, maps.
• Acronyms and abbreviations should always be spelled out in

the questionnaire. Do not assume that respondents will or
should know what an acronym represents.
• When a general term is used, concrete examples should be

given to clarify its meaning. For example, a question on learning
activities included in the International Educational Achievement
(IEA) Reading Literacy Teacher Questionnaire included the
following items, for which the respondent had to answer on a
four-point frequency scale.
How often are your students typically involved in the

following activities?
• silent reading in class

• learning new vocabulary systematically (for example, from lists)
• learning to use illustrations (for example, graphs, diagrams, tables)
30
The words ‘systematically’ and ‘illustrations’ were too general to

be understood in the same way by all respondents. Examples were
therefore provided to clarify their intended meaning.
Finally, it is recommended to avoid words that may have an

ambiguous meaning. In education, the word ‘hour’ may have
different meanings. For example, many education systems refer to
a lesson length or period as an hour even though the lesson is only
forty-five minutes long. In order to measure the yearly instructional
time at a given educational level, it is therefore necessary to know
the length (in minutes) of an ‘hour’ of instruction, the number of
minutes of instruction per week, and the number of school weeks
per year. If this information is known, then calculations can be
made later for instructional time per day, or week, or year.
b. Keep the question short

Closely related to keeping vocabulary simple is avoiding lengthy
questions. Generally, it is recommended to hold questions to
25 words or less. If a longer sentence is used then it should be
broken up so that there will be several shorter sentences.
c. Avoid double-barrelled questions

These are single questions that ask for two things and therefore
require two answers. “Do you have your own table or your own
room to do your homework?” “Do you think it is good idea for
children to study geography and history in primary school?” In
such instances, respondents do not know what to do if they want to
say ‘Yes’ to one part of the question but ‘No’ to the other.
d. Avoid hypothetical questions

Evidence has shown that hypothetical questions such as “Would
you use this resource in your class if it were available?” are not good
for the prediction of behaviour. People are generally poor predictors
© UNESCO 31
of their own behaviour because of changing circumstances and

because so many situational variables intervene. Investigators are
able to collect more valid data if they question respondents’ about
their past behaviour and present circumstances, attitudes, and
opinions.
e. Don’t overtax the respondent’s memory

It is risky to ask the respondent to recall past behaviour over a long
retrospective period. This is true especially when recurrent events
or behaviours are concerned. No student, especially young students,
will be able to answer reliably a question such as “In the last month
how many hours of homework did you do on an average day?”
because the time is just too long to remember what happened in
detail. If such a question must be asked, a one-week recall period
might be more appropriate for this type of event.
f. Avoid double negatives

Double negatives, either in the question or an answer category (or
both), create difficulties for the respondent. For example a statement
such as ‘Student self-evaluation should not be allowed’ followed by
agree/disagree is problematic to answer for respondents who are in
favour of students’ self-evaluation, that is those who do not agree
that students’ self evaluation should not be allowed. It is usually
possible to solve problems of this kind by formulating the initial
statement in a positive way.
g. Avoid overlapping response categories

Answer categories should be mutually exclusive. It should not be
possible to agree with or choose more than one category – unless
the instructions explicitly allow the respondent to check more than
one alternative. Examples of questions with overlapping categories
are:
32
Do teachers generally receive their salaries:

(Check one only)
usually on time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
sometimes a week late . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
more than a week late . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
How old are you?
under 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
20-30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
30-40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
40-50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
50-60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
60 or more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
The categories of the first question could be made mutually

exclusive by removing the qualifiers ‘usually’ and ‘sometimes’. In
order to avoid overlap in the second question it should be modified
as follows.
How old are you?
under 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
20-30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
31-40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
41-50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
51-60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
61 or more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
© UNESCO 33
h. Beware of ‘leading’ questions

A leading question is a question phrased in such a way that it seems
to the respondent that a particular answer is expected. For example:
“Do you favour or oppose school on Saturday morning?”

might read in a leading question as:
“You wouldn’t say that you were in favour of school on
Saturday morning, would you?”,
or in a more subtle form:
“Would you say that you are not in favour of school on
Saturday morning?”
Specifying the characteristics

of respondents
Before beginning to write the questionnaire it is important to
consider the characteristics of the respondents.
A clear definition of the target population helps to adapt question

wording and response formats and also helps to ensure that
respondents have experienced what is being asked of them, or
at least have sufficient knowledge to be able to respond to the
questionnaire items.
In deciding on the sample design, and the population from which

the sample is to be drawn, it is helpful to consider whether the
population is of individuals, households, institutions, transactions,
or whatever. The source from which the data are to be collected is
not necessarily identical to the population definition. For example, if
a mail questionnaire is sent to school presidents asking about school
finance, the population is of schools and not of school presidents.
34
One of the most important considerations for the researcher is

whether respondents consist of a heterogeneous or a homogeneous
group. The former consists of individuals who differ from one
another in some way that might influence the phenomenon of
interest. A heterogeneous group may consist of people from
different ethnic backgrounds, different levels of income, and
different urban or rural areas. By contrast, homogeneous groups
consist of individuals from similar socio-spatial backgrounds.
Research has shown that response rates are usually higher for
homogeneous or select groups (for example, high school teachers,
university professors, physicians) because they are more likely to
identify with the goals of the study. Beyond this distinction, it is
known that interest and familiarity with the topic has a positive
effect on response rates.
A checklist for reviewing questionnaire

items
The following list of questions provides a framework for reviewing
each item that is to be included in a questionnaire.
1. Will the item provide data in the format required by the

research questions or the hypotheses?
2. Is the item unbiased?
3. Will the item generate data at the level of measurement required

for the analysis?
4. Is there a strong likelihood that most respondents will answer

the item truthfully?
© UNESCO 35
5. Do most respondents possess sufficient knowledge to answer

the item?
6. Will most respondents be willing to answer the item, or is it too

threatening or too sensitive?
7. Does the item avoid ‘leading’ respondents to a specific answer?
8. Is the language used in the questionnaire clear and simple – so

that all respondents are able to understand all of the questions?
E XERCISES
1. Explain the uses of closed, open, and contingency questions.
2. Draft five closed and open questions related to some aspect of

educational research.
3. Formulate a contingency question with accompanying

instructions.
4. Here are some ‘bad’ questions which contain some of the

problems presented in the above discussion. List the main
problems and then redraft each question to address these
problems and explain the changes that you have made.
36
… E XERCISES
Question 1
How many teachers are there in your school who have been
at the school for at least five years and who are involved in
special initiatives outside the normal class activities at least
once per week?
. . . . . . . . . . . . . . . teachers
Question 2
Do you enjoy studying English and Mathematics?
Yes . . . . . . . . . . . 1
No . . . . . . . . . . . 2
Question 3
If you could attend university which subjects would you like to
study?
..................................................
Question 4
In the last six months how many times did you teach your
students to read expository materials?
..................................................
Question 5
Sometimes teachers do not give me sufficient attention.
Definitely Mostly Mostly Definitely
Disagree disagree agree agree
1 2 3 4
Question 6
What is the condition of each of the following in your school?
Bad Good
Lighting 1 2
Water 1 2
Canteen 1 2
Water taps 1 2
© UNESCO 37
4 Examples of questions
In the following discussions some examples have been presented
of the main types of questions that are often used in educational
planning and educational research data collections. These questions
cover the areas of student background, teacher characteristics,
school location, learning/teaching activities, and attitudes. The
examples related to attitude scales include a discussion of the
principles of Likert scaling and the method used for the calculation
of the discrimination power of attitude scale items.
Student background
Demographic questions are designed to elicit information from
respondents concerning their personal characteristics and social
background. This type of information is important for explaining
variations in educational outcomes and behavioural patterns. The
most frequently used demographic questions focus on gender, age,
level of education, income level, marital status, level of parents’
education, religion, and ethnic background. A number of these
areas cover sensitive and personal issues and therefore need to be
handled carefully.
38
1. Gender and age
Data on student gender is critically important for examining issues
of gender equity in all school systems. This information can be
gathered from the class attendance register or can be asked as part
of a student questionnaire.
While a question on gender can be asked irrespective of the

student’s age, for younger students the question can be more
suitably phrased in the following manner.
Are you a boy or a girl?

(Tick one box only)
p Boy
p Girl
If the same information is to be obtained from teachers, the

question form and the phrasing of the response alternatives can be
more direct.
Your sex:
p Male
p Female
Whether the question is phrased as a standard question or a more

direct enquiry, it is always important to seek advice concerning the
wording of this type of question so as to respect local customs and
culture.
© UNESCO 39
Student age is another important variable for explaining the

structure and evolution of an education system, and for examining
the educational development of students over time. Data on student
age can also be obtained from the class register or from a question
included in a student questionnaire.
There are various ways in which information on student age can

be collected. One way is to ask for the age at a specific reference
date; another is to ask for the actual birth date. In this latter case,
depending on the degree of accuracy needed, the respondent can be
asked to specify the year, or the year and month, or the year, month,
and day as in the following example.
What is your date of birth?

(Please write the corresponding numbers in the spaces below)
Day Month Year

__ __ ____
Research has shown that the most accurate way to obtain

information about age is to ask both date of birth and age at last
birthday:
How old were you at your last birthday?
Age: . . . . . . . . . . . . . . . . .
40
Examples of questions
2. Socio-economic background:
occupation, education, and possessions
Another type of background characteristic concerns the socio-
economic background of the student. Indicators can be developed
using information obtained directly from the individual, or by using
either objective or subjective responses. Some common indicators
of student socio-economic background are the parents’ level of
income, their occupational status, their level of education, and the
personal possessions in the home. The measurement of parent
income is always a difficult task in all countries of the world – and
for many different reasons. Most school-aged children cannot
answer such questions accurately. Similarly, adults sometimes have
difficulty answering income questions because they are not in an
occupation with a regular salary or because questions in this area
represent an invasion of personal privacy. It is usually not useful
to include a question on parent income when the respondent is
under 15 years of age. For this reason, parents’ level of education,
occupational status, and home possessions are the most frequently-
used proxy indicators of household wealth.
a. Parent occupations
Parent occupations are usually grouped into ‘occupational status’
categories based on levels of education, skill, and income. The
categories are then ranked from lowest occupational status to
highest. The categories used to group the occupations must reflect
the range of occupations existing in society, and they must also be
comprehensible to the respondent. Terms such as white-collar, blue-
collar, professional, skilled, semi-skilled, unskilled are not easily
understood by younger children if left undefined.
The following is an example of a set of questions directed to

students on their father’s occupation. The question begins with
a filter to ascertain that questions 11 and 12 are asked only
© UNESCO 41
of students whose father is working. In Question 11 the open

responses will later be coded by the survey researcher. The
third question in the series helps check for consistency between
responses.
10. Does your father work for pay now?

Yes, he now works full time . . . . . . . . . . 1 (GO TO 11)
Yes, he now works part time . . . . . . . . . 2 (GO TO 11)
No, he is now looking for work . . . . . . . 3 (GO TO 13)
No, he is not working at present
(unemployed, retired) . . . . . . . . . . . . . . . 4 (GO TO 16)
11. What is the occupation of your father (or the male

person responsible for your education)?
(Please, describe as clearly as possible)
.......................................................
.......................................................
.......................................................
.......................................................
12. In your opinion, how can the occupation of your father

be defined? (Please tick only one box)
p professional and managerial
p clerical and sales
p skilled blue-collar
p semi-skilled and unskilled
Before the non-compulsory level of education is reached (but after

the age of 10 years), it is often recommended to use a classification
of occupations with no more than 4 categories. For example, (a)
professional and managerial, (b) clerical, (c) skilled blue-collar, (d)
semi-skilled and unskilled. Each category should have a definition
and/or examples that correspond to the type of occupation.
42
Obtaining useful information on parent’s occupation is very

difficult in self-administered questionnaires. For most children
under the age of 14 years an interviewer or data collector should
assist with this question so as to improve the quality of the
children’s responses. Some examples of interviewer probes are
presented below.
Response:
My father works in a factory
Probe:
What kind of machine does he operate?
Response:
My father is a teacher
Probe:
What level does he teach (or, alternatively, what age students does
he teach?)
Response:
My father works in a shop
Probes:
Does he own the shop? Or does he manage the shop?
Or does he work for someone else in the shop?
In the absence of interviewers, the following guidelines can improve

response quality in self-administered questionnaires: (i) avoid
simply asking the name of the place where the parent works since
this is insufficient in detail. For example, a response ‘in a hospital’
could mean the father is a doctor, a nurse, an administrator, or
a janitor; (ii) avoid asking vague job titles, such as ‘engineer’, (iii)
look into the job classifications used in a recent census or national
© UNESCO 43
population surveys and see if they can be adapted for use in a

student questionnaire, (iv) when asking the mother’s occupational
status, remember to include the option ‘housewife’ (or home duties),
and (v) a combination of open and closed questions is often more
effective since it permits a check to be made of the consistency of
the responses.
b. Parent’s education
Open-ended questions that ask directly for the number of years of
a parent’s education are very difficult to answer because they imply
that students remember not only the level of education completed
by their parents, but also the sequence of levels of education,
and that they know the duration of each of these levels. For these
reasons, questions on parent’s education should be given in multiple
choice format.
What is the highest level of education that your father

(or the male person responsible for your education) has
completed?
p Never went to school
p Completed some primary school
p Completed all of primary school
p Completed some secondary school
p Completed all of secondary school
p Completed some education/training after secondary school
p Don’t know
In this case the only task required is to recognize the correct

information, rather than to remember it. Once the responses to this
question are collected, the first six options can be ranked during
coding, from one to six, or they can be converted into the number of
years corresponding to each option (perhaps using a median value
for the second, and fourth option).
44
Where possible, the researcher should make a prior request for

information from the parents. This could be achieved by asking the
students to consult with their parents – or by having the students
take the following question home for completion by parents.
How many years of academic education has your father/

guardian completed?
..... years of primary school
..... years of secondary school
..... years of post secondary academic education
c. Possessions in the home

Possessions in the home has become a useful alternative approach
to collecting information on socio-economic background from
students. The number of books in the home also provides
information about possessions that are more closely linked to the
educational level of parents. Variables based on this information
usually yield a strong relationship with educational outcomes – even
when reported by younger students.
The items included in the list must relate to the context in which
the questionnaire is administered, and to the level of development
and characteristics of the society. It is important that the list include
possessions that denote high, medium, and low economic status in
order to discriminate among students with different socio-economic
backgrounds.
One of the most important possessions related to the social (and

educational) climate of the home is the number of books. This
information is usually collected in approximate categories – which
must de defined with a detailed knowledge of prevailing societal
conditions.
© UNESCO 45
About how many books are there in your home?

(Do not count newspaper or magazines. Please tick one box only)
p None
p 1-10
p 11-50
p 51-100
p 101-200
p More than 200
The data gathered using this kind of question are at best very
approximate. However, experience has shown that these data are
generally highly correlated with educational outcomes.
An alternative approach is to develop a ‘checklist’ of possessions.

The items presented on the list need to acknowledge the general
economic development level of the countries where data are being
collected. For example, the following list of items would be less
useful in wealthy countries (because most homes have all of the
items) and much more useful in developing countries where a
summarization of the total number of items in each home would
provide a good level of discrimination among relatively poorer and
relatively wealthier homes.
Which of the following items can be found in your home?

(Please, circle one number for each line)
No Yes
Radio 1 2
TV set 1 2
Video cassette recorder 1 2
Telephone 1 2
Refrigerator 1 2
Car 1 2
Piped water 1 2
46
When investigating home conditions the possibility that some

students do not live at home while attending school needs to be
considered. In this case the term ‘home’ may need to be replaced
with a more generic expression such as ‘the place where you stay
during school week’.
A number of issues have to be taken into account when using socio-

economic background data in educational research. One, that has
already been mentioned, is the limited capacity of young children
to provide an accurate report of this kind of information. Another
issue has to do with the confidential nature of the information
requested. The kind of questions that can be asked in this area
without risking the validity of the data depends on local culture
and customs. Yet another issue is related to social changes that
have occurred over the last two or three decades. In particular,
the increasing participation of women in the labour force requires
a revision of occupational status scales (and of their meaning).
Existing scales were mostly developed with data from samples
of adult males in developed countries. Similarly, the increased
incidence of divorce and of single-parent families requires a major
revision in terminology when asking questions that refer to the
father and/or to the mother.
Teacher characteristics
Among the teacher’s characteristics of interest in educational
data collection are gender, age, education, and years of teaching
experience. At the school level this information can be collected
either from teachers themselves or from school heads. However,
asking teachers to answer a question such as ‘How many years of
education have you completed?’ provides very little information.
Such a question neither specifies whether pre-service training is to
be included in the ‘years of education’, nor provides information on
years of grade repetitions (if any) or on whether part-time years of
attendance were converted into full-time years equivalent.
© UNESCO 47
In seeking information on a teacher’s educational background,

questions should distinguish between academic education and
pre-service teacher training and they should ask the respondent to
specify how many years were attended for each level of education.
Clear instructions also need to be provided on how to treat grade
repetition and part-time attendance.
1. How many years of academic education have you

completed?
(Do not count grade repetition years. Part-time years should be
converted into full-time years. For example, two half-years equals one
full year)
__ __ years of primary school
__ __ years of lower secondary school
__ __ years of upper secondary school
__ __ years of post secondary academic education
2. How many years of pre-service teacher training have you

received altogether?
(Please, circle one number only)
a. I did not receive any teacher training
b. I have had a short course of less than one-year duration
c. I have had a total equivalent of one year
d. I have had a total equivalent of two years
e. I have had a total equivalent of three years
f. I have had a total equivalent of more than three years
The first question reveals how many years of academic education

were completed altogether and the level of academic education
that was reached. The distinction between the different levels of
education helps to verify and improve the precision of responses.
The second question covers different scenarios, ranging from no
teacher training, up to more than three years of teacher training.
48
School location
The location of a school is often a key issue in data collections
because physical location is often strongly related to the
sociocultural environment of the school. In addition, the degree
of physical isolation of a school can have important impacts on
decisions related to staffing and infrastructure costs.
Consider the following question on school location.
In what type of community is your school located?

p A geographically isolated area
p A village or rural (farm) area
p On the outskirts of a town/city
p Near the centre of a town/city
In the above example, the first and second response categories

are not mutually exclusive: a village or rural area may also be in a
geographically isolated area. A second problem with the formulation
is the ambiguity in the fourth response category – it is not clear
what is meant by ‘Near the centre of a town or city’.
The following example shows a reformulation of the question that

should improve the quality of the information obtained:
What type of community is served by your school?

p A village or rural community
p A small town community
p A large town
p A city
© UNESCO 49
The second, third and fourth alternatives in the above question

discriminate between urban centres of different sizes. One could
add indications for population size for these response categories.
For example, a small town community (between 50,000 and 150,000
inhabitants); a large town community (greater 150,000 and under
1 million): and a city of 1 million or more people. The number of
inhabitants per category will depend on the demography of the
country, and can be determined by looking at the geographical
population distribution as reported in the most recent census data.
If a dichotomy of urban/rural is to be made during the data analyses
then category 1 could be used for rural, and categories 2, 3 and 4
could be combined for urban.
In a less densely populated country the following question may

provide more accurate information concerning school location. The
aim here is to identify the location of the school with respect to
important external services.
How many kilometres by road is it from your school to the

places in the list below?:
(a) the nearest health centre/clinic . . . . . . . . . . . . . . . . kilometres

(b) the nearest asphalt/tarmac/tarred road. . . . . . . . . . kilometres
(c) the nearest public library . . . . . . . . . . . . . . . . . . . . . kilometres
(d) the nearest secondary school. . . . . . . . . . . . . . . . . . kilometres
(e) the nearest city . . . . . . . . . . . . . . . . . . . . . . . . . . . . . kilometres
(f) the nearest regional capital. . . . . . . . . . . . . . . . . . . . kilometres
The list of items for which the distance in kilometres is asked can
vary according to the focus of the survey and the characteristics of
the country. Whatever items are used, the number of kilometres can
be summed for all items and then divided by the number of items
as a measure of the degree of isolation of the school.
50
Learning, teaching, and school activities

Questions on activities, whether addressed to students, teachers,
or school heads, usually employ either a ‘yes-no’ response format,
or they ask for an evaluation of frequency or importance. Some
examples for students, teachers, and school heads have been
presented below.
1. Student reading activity
How often do you read these types of publications for personal interest
and leisure?
Publication Rarely Less One About Two or Most

type than or two once a three days
once a times a week times a
month month week
Mystery 1 2 3 4 5 6
Romance 1 2 3 4 5 6
Sport 1 2 3 4 5 6
Adventure 1 2 3 4 5 6
Music 1 2 3 4 5 6
Nature 1 2 3 4 5 6
In a question of this kind it is important to set the time points

defining the scale so that they make sense in relation to the specific
activity of interest, and to the purpose for which the data are
collected. In general the expressions ‘never’ and ‘always’ should be
avoided, as they are extremes that respondents tend to dislike. This
is why in the above example, categories were formulated as ‘rarely’
and ‘most days’. The categories are defined so that they do not
overlap, and are not too distant from one another.
© UNESCO 51
2. Teacher activities
During the school year, how often do you teach comprehension of each
of the following kinds of text?
(Circle one number per line)
Kind of text Almost About About At least Nearly

never 3-4 once a once a every
times a month week day
year
a. Narrative text 1 2 3 4 5
(that tells a story or gives the
order in which things happen)
b. Expository text 1 2 3 4 5
(that describes things or
people, or explains how things
work or why things happen)
c. Documents 1 2 3 4 5
(that contain tables, charts,
diagrams, lists, maps)
The categories in the above example are specified on the assumption

that teaching takes place regularly. Therefore, if a teacher employs
a specific teaching strategy more than a few times per year, he/she
probably does it on a monthly, weekly, or daily basis. A time scale
needs to be constructed in relation to the variable(s) of interest.
The question above provides a good example of how questions and

their components should be specified. Suppose the question was
formulated as: “How often do you teach reading of the following
52
kinds of texts?” In this case the information obtained would be

much less useful because reading is a complex activity, made up of
skills that range from basic decoding to sophisticated inferences.
It is therefore necessary to specify what aspect of reading is being
taught (for example, ‘understanding’) and the segment of teaching
time on which teachers should base their answer. Should they think
of an average class in an average year? If so, how representative is
that year and class of their teaching experience? In asking these
kinds of questions it is important to specify the class and year (for
example, ‘in your class this year’).
3. School head activities

The school head’s report of education-related activities within a
school is very important. There is ample research evidence which
shows that schools that foster a wide range of educational and
cultural activities outside the classroom also have more effective
reading programmes.
Does your school have any special programs or initiatives for

reading outside normal classroom activities?
(You may tick more than one)
Extra-class lessons in reading p
Extra-individual tuition in reading at school p
Special remedial reading courses p
Other p (specify . . . . . . . )
None p
This question is a modified form of a yes/no question. Ticking a

response category corresponds to answering ‘yes’, and leaving it
blank corresponds to answering ‘no’. It is important to include the
category ‘other’ if the items listed are not exhaustive.
© UNESCO 53
The following question asks respondents to rank different items

related to the work of school heads.
Please rank the following activities in order of importance in

your work as a school head
(‘1’ is the most important activity, ‘6’ is the least important activity)
Importance ranking
(a) evaluating the staff ............
(b) discussing educational objectives with teachers ............
(c) pursuing administrative tasks ............
(d) organizing in-service teacher training courses ............
(e) organizing extra-class special programs ............
(f) talking with students in case of problems ............
This kind of question makes it impossible to score all items as ‘very

important’, and the respondent is forced to a ranking in order of
importance.
The advantages of rank order methods are basically that it is easy

for respondents to understand the instructions, and the questions
force discrimination among objects. One of the disadvantages is
that forced responses, may not yield a real degree of preference or
attitude, but rather information that the respondent prefers one
object over another.
Remember that ranking and rating are two different processes.

Ratings are assigned independently to each item. Ranking requires
that a set of items be placed in order, thus providing a comparison
of each item to all others.
54
Attitudes, opinions, and beliefs

An attitude is often defined as a state of readiness, or a tendency
to respond in a certain manner when confronted with particular
stimuli. Social psychologists consider that attitudes arise from
deeply rooted personality characteristics and value systems within
individuals, and that they become manifest in the form of opinions.
The main difficulties in measuring attitudes are that (a) the object
of an attitude can range from the very specific to the very general,
(b) attitudes are not static, and (c) attitudes are both shaped and
changed by socio-demographic circumstances and life experiences.
In the field of educational research, the measurement of attitudes

has become an important issue in attempts to monitor the ‘affective’
(that is non-cognitive) outcomes of schooling. The most popular
approach to attitude measurement has been via the use of attitude
scales.
Attitude scales usually consist of a number of attitude statements

which are presented to respondents with a request that they should
indicate whether they agree or disagree. Scaling techniques are
deployed to order respondents along some underlying attitudinal
continuum.
1. Likert scaling
Likert scaling is the most frequently applied attitude scaling
technique in educational research. It consists of six main steps.
Step 1 Determining the attitude to be measured

In the field of educational planning some of the more important
areas of attitude measurement include pupil attitudes towards
school, teachers, and school subjects. In addition, given the
importance of retaining good teachers within school systems, there
© UNESCO 55
has been a growing interest in measuring the sources of teacher

satisfaction with, and attitudes towards, their professional work as
teachers.
Step 2 Listing possible scale items

Here, a set of statements, or a series of items, are devised that
express a wide range of attitudes, from extremely positive to
extremely negative. The statements are designed to reflect favorably
or unfavorably on the object of the attitude.
One common approach for constructing these statements is to

organize a discussion focussed on the stimulus for the attitude (for
example, the quality of school life) with individuals or small groups
representative of the target population to whom the scale will be
administered. The various negative and positive comments and
statements made during this discussion may be selected and edited
for use as stimuli in the attitude scale. Another approach is to ask a
sample of respondents to respond to a set of open-ended statements
related to the attitude being investigated. These responses are then
used to construct attitude statements.
Each statement is followed by an agreement scale on which

respondents are requested to indicate the degree to which they
agree or disagree with each statement.
Although the scale may have only two choices (agree/disagree),

more choices may sometimes permit a finer distinction in the
intensity of the attitude. Generally, Likert scales have 5 categories
(strongly like, like, neutral, dislike, strongly dislike). Occasionally,
the neutral or middle category may be omitted, forcing respondents
to express an opinion for each statement.
It is usually recommended that an equal number of positive and

negative statements be used. For positively worded statements the
scoring categories are as follows below.
56
Strongly Strongly
agree uncertain disagree
agree disagree
5 4 3 2 1
For negatively worded statements the scoring is reversed so that

‘strongly agree’ would be scored as ‘1’, and so on, with ‘strongly
disagree’ being scored as ‘5’.
Step 3 Administering items to a sample

In this step a sample of respondents, selected randomly from the
population to be studied, is asked to indicate attitudes with respect
to the list of items drawn up in Step two. For this trial-testing phase
a sample of around 150 to 250 respondents from a wide range of
environments is normally required so as to provide stable statistical
analyses in the following steps.
Step 4 Computing a total score

The researcher calculates a total score for each respondent, by
summing the values of all items. Take the following example
adapted from a ‘Quality of School Life’ scale designed for 14 year-
old students. The scale aims to measure the attitudes of students
towards school in terms of their ‘well-being’.
Suppose respondent X had the following response pattern for the

above 5 items: Agree, Disagree, Strongly Agree, Neither agree nor
disagree, Disagree. The total score computed for respondent X
would be:
4 + 4 (Negative item) + 5 + 3 + 4 (Negative item) = 20
© UNESCO 57
1. School is a place where I usually feel great

p Strongly agree
p Agree
p Neither agree nor disagree
p Disagree
p Strongly disagree
2. The teachers at my school are often unfair

p Strongly agree
p Agree
p Disagree
p Strongly disagree
3. I really like to go to school

p Strongly agree
p Agree
p Disagree
p Strongly disagree
4. Going to school makes me feel important

p Strongly agree
p Agree
p Disagree
p Strongly disagree
5. School is a place where I sometimes feel depressed

p Strongly agree
p Agree
p Disagree
p Strongly disagree
58
Step 5 Analyzing the item responses (pre-testing the items)

In this step it is necessary to determine a basis for keeping some
items for the final version of the measurement scale, and discarding
others. This can be done either through correlational analysis or by
item analysis that yields a discrimination coefficient for each item.
The discrimination power of an item is a measure of its ability to
differentiate the high-scoring respondents (clearly positive attitudes)
from the low-scoring respondents (clearly negative attitudes).
Many standard computer packages, such as ‘Reliability’ in SPSS

facilitate this by calculating the correlation of each item with the
total score. As a general set of benchmarks: items with a correlation
(with the total score) of under 0.3 are considered to be ‘weak’,
and ‘good’ items would have a correlation (with the total score) of
around 0.5 or higher.
Items are often discarded if they show negligible or no variation

across respondents. For example, if almost all respondents
answered that they ‘strongly agree’ with item one (‘school is a
place where I usually feel great’), then this item is simply adding a
constant to all scores.
Step 6 Selecting the scale items

The final list of items for the attitude scale is selected from among
those trial items that have (a) high discrimination, and (b) a range
of mean response scores. The need for high discrimination has been
mentioned above. The need for a range of mean response scores
arises because this results in more reliable measurement of the
respondents along the full range of the total scores.
In writing attitude statements it is recommended that items be

worded positively and negatively so as to avoid the ‘response set’
– which is the tendency to respond to a list of items of the same
format in a particular way – irrespective of content.
© UNESCO 59
a. Problems in the design of rating questions

• Error of proximity: the tendency to rate items similarly because
they are near to each other in the questionnaire.
• Central tendency error: the tendency to rate most items in the

middle category (when the middle category is offered). Such
respondents either dislike extreme positions, or lack knowledge.
• Error of leniency: the tendency to give high ratings to most

items by liking or agreeing with everything.
• Error of severity: the opposite to the error of leniency:

respondents who dislike, or disagree, with most items.
• Halo effect error: the tendency to rate a particular statement

according to how respondents feel about it in general. For
example, giving a very low rating to statements such as ‘I
enjoy reading’, ‘I like to borrow library books’, and ‘I prefer to
read something every day’ because of a dislike for the reading
teacher.
b. Assumptions in Likert Scaling

It is important to note that the following assumptions underly this
scaling technique:
• That there is a continuous underlying dimension which is

assessed by total scores on the attitude scale and that each item
contributes in a meaningful way to the measurement of this
dimension.
• That a more favorable attitude will produce a higher expected

score, and vice-versa.
60
• That items are of equal value in that they each provide a

replicated assessment of the dimension measured by the total
score on the scale.
Other scaling techniques that rely on attitude statements are the

Thurstone Scale (1928) and the Guttman Scale (1950). The Osgood
Scale (1957), also referred to as the semantic differential technique,
is composed of pairs of adjectives to measure the strength and
direction of the attitude.
E XERCISES
1. Here are three questions on student age. Discuss their

suitability for students of primary, secondary and post-
secondary education level.
How old are you?

.........................

.........................

(Please write the corresponding numbers in the spaces below)
Day Month Year
__ __ __
© UNESCO 61
… E XERCISES
2. Speciify ten items that would be appropriate to include in a

question on home possessions in order to measure the socio-
economic background of pupils in your country.
Which of the following things can be found in your home?

(Please circle one number for each line)
No Yes
....................... 1 2
........................ 1 2
........................ 1 2
........................ 1 2
........................ 1 2
........................ 1 2
........................ 1 2
........................ 1 2
........................ 1 2
........................ 1 2
3. Consider the indicators (a) to (d) presented below.

(a) Teacher years of teaching experience.
(b) Primary school and grade enrolment by gender.
(c) Instructional time per year for Grades 1, 3, and 5.
(d) Pupils’ interest in reading.
• Decide whether one or more variables are required for
each indicator.
• Decide if one or more questions are required for each
variable.
• Write the questions.
4. Draft ten attitude statements (each with 5 scale response

categories) that could be used to construct a scale for
measuring student attitudes towards mathematics.
62
Moving from initial draft 5
to final version of the
questionnaire
This section looks at the ordering of questions in the questionnaire,
the training of interviewers and administrators, pilot testing, and
the preparation of a codebook. It gives advice on how to design the
layout of the questionnaire, including instructions to respondents,
interviewer instructions and introductory and concluding remarks.
Guidance is provided on how to trial test and then use the results of
this to improve the final form of the questionnaire.
Two widely-used patterns of question

sequence
Two widely-used patterns of question sequence in questionnaire
design have been found to motivate respondents to co-operate and
fully complete a questionnaire. They are called the funnel sequence
and the inverted funnel sequence.
The characteristic of the funnel sequence is that each question is

related to the previous question and has a progressively narrower
scope. The first question can be either open format, or multiple
choice. It should be very broad, and is used to ascertain something
about the respondent’s frame of reference on a topic. This ordering
pattern is particularly useful when there is a need to prevent
further specific questions from biasing the initial overall view of the
respondent.
© UNESCO 63
1. Would you say that the general quality of education

provided by primary schools in your community is:
p very good
p good
p uncertain
p bad
p very bad
2. How would you rate the overall quality of the primary

school attended by your child?
p very good
p good
p uncertain
p bad
p very bad
3. Do you think your own child is receiving a good primary

school education?
p Yes
p No
4. Given the opportunity, would you have your child attend

another school in your area?
p Yes
p No
In the inverted funnel sequence, specific questions on a topic are

asked first, and these eventually lead to a more general question.
This sequence requires the respondent to think through his or her
attitude before reaching an overall evaluation on the more general
question. Such a question order is particularly appropriate when
there is reason to believe that respondents have neither a strong
feeling about a topic, nor a previously formulated view.
64
Moving from initial draft to final version of the questionnaire
The placement of items in a questionnaire requires careful

consideration. Good item placement can increase the motivation of
respondents – which in turn results in more valid data.
General guidelines for item placement

1. Non-sensitive demographic questions should be placed at the
beginning of the questionnaire because they are easy to answer,
non-threatening, and tend to put the respondent at ease.
2. Items of major interest to the research study should be placed

next since there is greater probability of the respondent
answering or completing the first section of the questionnaire.
3. Sensitive items that cover controversial topics should be placed

last so that potential resentment that may be provoked by these
items does not influence responses to other questions.
4. Items on the same topic should be grouped together. However,

care should also be taken to prevent one item influencing
responses to later items.
5. Items with similar response formats should be grouped together

when several different response formats are being used within
a questionnaire.
6. Section titles should be used to help the respondent focus on

the area of interest.
© UNESCO 65
Covering letters and introductory

paragraphs
If the questionnaire is to be mailed, or distributed, for a respondent
to complete, it is important to have a covering letter. The purpose of
such a letter is to explain the object of the survey, and to encourage
respondents to complete the questionnaire. In an interview, one of
the tasks of the interviewer is to persuade the respondent to co-
operate. In a self-administered questionnaire, the covering letter is
the only instrument for overcoming resistance. For this reason, the
covering letter is important, and should do the following:
• Identify the organization conducting the study (for example, the

Ministry of Education).
• Explain the purpose of the study.
• Assure the respondent that information provided will be

managed in a strictly confidential manner and that all
respondents will remain unidentified.
• Explain WHY it is important that the respondent should

complete the questionnaire.
• Provide the name and contact numbers of the Principal

Researcher.
The following additional information should also be included in

both the introduction to the questionnaire and the covering letter:
• Brief detail on how the respondent was selected (for example,

‘Your name was randomly selected ....’).
• Expression of appreciation for the respondent’s help.
• Estimate of questionnaire completion time.
66
E X A MPLE COVERING LET TER
Date: 25 September 2000
To: Participants in Reading Literacy Teacher Questionnaire
As someone currently involved in the teaching of reading literacy,

the Ministry of Education would greatly appreciate a few minutes
of your time to respond to the enclosed questionnaire.
The results of this study will determine the reading literacy levels
of primary school students and this information will be used as
part of a review of teacher pre-service and in-service training
programmes.
You were randomly selected from a pool of currently employed

primary school teachers. You will not be identified by name. All
information provided by you will be treated as strictly confidential.
The questionnaire should only take 15 minutes to complete. Please

return it in the enclosed postage-paid envelope by 20 December
2000.
Your participation is very much appreciated and will allow us to

focus on critical issues related to the teaching of reading literacy as
determined by experienced teachers.
Yours sincerely,
xxxxxxx
If a good covering letter is enclosed with the questionnaire, the

introductory paragraph on the questionnaire itself may be shorter
and contain some instructions for responding to questions. The
IEA Reading Literacy Questionnaire for Teachers had the following
introductory paragraph.
© UNESCO 67
The following questionnaire is part of an International study of

Reading Literacy and attempts to identify differences in English
instruction. It is recognized that teachers are likely to respond
quite differently to the enclosed questions.
Please answer all questions in such a way as to reflect most clearly

your teaching practices. Most questions will require you to circle
your selected response. Others will require you to write down a
number. Do not leave blanks.
We thank you for your contribution to this important research.
Drafting instructions for answering

questions
Writing instructions for answering questions is a very important
part of the questionnaire layout. If the questionnaire is to be
administered by an interviewer, then the instructions will be
addressed to him or her. Such instructions are usually written in
capital letters, as follows.
Who was your employer on your last job

(PROBE FOR CORRECT CATEGORY)
p Private
p National Government
p City
p Self-employed
p Public, non profit
p Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
p Doesn’t know
In a mailed or self-administered questionnaire, it is very important

to provide clear instructions because there is no resource person to
help clarify respondents’ queries. Instructions can be for a single
question or for a set of questions.
68
INSTRUCTIONS TO A RESPONDENT FOR A SET OF QUESTIONS

INSTRUCTIONS : For each of the following questions, please
mark the answer that comes closest to the way you feel about
learning mathematics. There is no right or wrong answer. Answer
the questions in the order in which they appear on the paper.
Thank you for your co-operation.
INSTRUCTIONS TO A RESPONDENT FOR A SINGLE QUESTION

About how many different teaching positions have you held during
your life? (Count only those teaching positions that you have held for at
least one full academic year)
The following examples provide illustrations of different

instructions given for the same question. In the first example,
the instructions relate to an interview. In the second example
the instructions relate to a self-administered questionnaire. Note
that the question is multiple choice, followed by an open ended
contingency question.
INTERVIEW FORMAT
1. Thinking about government facilities provided for schools, do
you think your neighborhood gets better, about the same, or
worse facilities than most other parts of the city?
Better (ASK A) 1
About the same 2
Worse (ASK A) 3
Don’t know 8
1A. If better or worse:: In your opinion, what do you think is the

main reason why your neighbourhood gets (better/worse)
facilities?
...................................................
...................................................
...................................................
© UNESCO 69
SELF-ADMINISTERED FORMAT
1. Thinking about the government facilities provided for schools,
do you think your neighborhood gets better, about the same,
or worse facilities than most other parts of the city?
Better 1 (answer 1A below)
About the same 2
Worse 3 (answer 1A below)
Don’t know 8
1A. If better or worse: In your opinion, what do you think is the

main reason why your neighbourhood gets (better/worse)
facilities? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...................................................
...................................................
Training of interviewers or
questionnaire administrators
Frequently, the testing of a questionnaire is undertaken by
interviewing respondents – even if the final version of the
questionnaire is to be self-administered. This implies, however,
that the questionnaire administrators and the interviewers receive
an appropriate level of basic training before setting out to pilot the
questionnaire.
All questionnaire administrators and interviewers should be given

written instructions so as to ensure that each respondent receives
the questions in the same format and with the same instructions. If
the interview is to be administered to young children or adults who
cannot read, the interviewer should be given a card on which this
information is written. The interviewer should be instructed to read
the card to each respondent in the same way.
70
The interviewer should be instructed on the amount of direction

to give to the respondent. This can range from very little to a large
amount. If possible, it is useful to have the interviewer participate
in simulation exercises by both answering respondent questions
and interviewing respondents. Having a few people observe the
simulation helps in giving comments and suggestions for improving
an interviewer’s techniques.
If the questionnaire is to be self-administered, then the data

collector should be given instructions on how to introduce the
questionnaire as follows.
1. Say: ‘I am Mr/Mrs. . . . . . . . . . . . . . from the (National Research

Centre for Educational Planning), and we are interested in
knowing your views on the (education profession)’.
2. Say: ‘We would greatly appreciate your completing this

questionnaire, which should only take 10 minutes. The
directions for filling it in are given on the front page’.
3. Hand the respondent the questionnaire (and a pen or a

pencil).
4. Clarify the questions that the respondent(s) may have about

the instructions.
5. If there are questions about particular items, simply respond:

‘Just answer the question as you interpret it’. Alternatively if
more guidance is necessary the interviewer or administrator
could be instructed to ‘clarify all questions about the items’.
6. Note on the back of this sheet any questions respondents had

about items, or any comments or remarks concerning the
questionnaire (for example, too long, too hard to understand,
too difficult).
7. Thank the respondent when he/she completes the

questionnaire.
© UNESCO 71
Pre-testing the questionnaire

Pre-testing the questionnaire is an essential step before its
completion. The purpose of the pretest is to check question
wording, and to obtain information on open-ended questions
with a view to designing a multiple choice format in the final
questionnaire. Pre-testing has a number of very important
advantages.
1. Provides information on possible ethical problems overlooked

previously.
2. Helps determine if the research questions or hypotheses are

appropriate.
3. Helps determine if the levels of measurement are appropriate

for the selected variables.
4. Provides a check that the population is appropriately defined.
5. Provides information on the feasibility and the appropriateness

of the sampling method.
6. Helps determine sample size by allowing estimation of variance

from the pre-test sample.
7. Provides additional training for interviewers, instrument

administrators, experimenters, coders, and data editors.
8. Helps determine the length of the questionnaire.
After training the interviewers and questionnaire administrators,

the next step in pre-testing is to select a small pilot sample of
respondents that covers the full range of characteristics of the target
population. In the field of education this usually implies that the
72
pilot sample includes appropriate gender balance and covers a range

of richer/poorer and rural/urban communities.
Pre-testing should never be carried out on a ‘convenience sample’,

(for example, the researcher’s friends or family, or schools in one
neighbourhood of the Capital city). For interview questionnaires
50 interviews will provide solid material for verifying question
wording, sequencing, instructions and general quality of the
instrument. However, larger samples of around 200 are required to
calculate various statistics such as discrimination co-efficients.
Note that even questions ‘borrowed’ from existing questionnaires

need to be pre-tested to ensure that they will work as required
with the ‘new’ respondents. This is particularly the case with
questionnaires administered to schoolchildren and with questions
that are translated from other languages.
The first version of the pre-test questionnaire often contains

considerably more questions than the final questionnaire. This can
be upsetting for the respondents – especially if many questions are
asked in an unstructured and open form so that the amount of time
required to complete the questionnaire is considerable. If absolutely
necessary, the questionnaire could be divided in two or three parts
(of equal length and answering time) for the first tryout, so that
each respondent answers only a fraction of the questions. For each
form at least 50 respondents should be asked to participate. The
information collected in this first pre-test should provide sufficient
information to produce a second version of the questionnaire for
final pre-testing.
This second version of the questionnaire will then be administered

in one single form in order to further verify the functioning of the
items and answer categories, as well as that of the questionnaire
overall structure, layout, and accompanying instructions.
© UNESCO 73
This process of pre-testing has a number of goals:
• To reformulate or eliminate ambiguous or superfluous

questions
Take, as an example, a question soliciting information on how

often the respondent speaks a language other than the official
national language, at home. If a high percentage of students
respond that they speak a language different than the national
one, in an area where demographic statistics show that the
presence of foreigners is low, it may be that the data are
reflecting the use of dialects and not language. In this case it
would be necessary to reformulate the question to reflect this
observation.
• To provide material to design the answer categories for open

questions that need to be closed
Take, for example, a question on the age of teachers. This could

be asked during pre-testing in open format with a view to
formulating categories for a closed question in the final version.
The pretest could find that the ages of teachers in a particular
country fall fairly evenly into categories that cover 5 years: 20-
25, 26-30, 31-35, 36-40, 41-45, and 45 or more. For the purpose
of the question it would be sufficient to have six categories of
teacher age in a closed format for the final version. A similar
exercise could be employed for open qualitative questions. For
example, the following main reasons could be identified in a
pre-test survey on teacher absenteeism: ‘own health problems’,
‘family sickness’, ‘maternity leave’, ‘other family matters’, and
‘on training course’. This question could be closed using these
five reasons, and adding a sixth category for ‘other’.
74
• To determine whether the questionnaire is balanced in its

structure, and to discover whether instructions were properly
followed
The design of the layout should be guided by concern for the

convenience and comprehension of the respondent, and in
consideration of the subsequent work of the data processors,
who will have to enter the data using computers. From the
perspective of the data processors it is more practical to
have numbers, rather than boxes, for the answer categories.
Alternatively, the numbers can be placed next to each box, in
such a way as to not confuse respondents and yet making it
easy to enter and check the numerically coded answers.
The following steps cover the process of pre-testing and the

main points to be examined during piloting:
Basic steps in pre-testing

1. Select a sample similar in socio-economic background and
geographic location to the one that will be used in the main
study. This sample will not be included in the final survey.
Make sure you have a sufficient number of copies of the
questionnaire for the pre-test.
2. Instruct interviewers or questionnaire administrators to note

all respondents’ remarks regarding instructions or question
wording.
3. Administer the questionnaires.
© UNESCO 75
4. Debrief the interviewers and check the results:
a. Is each item producing the kind of information needed?

b. What role is the item going to play in the proposed analysis?
c. Are the questions meaningful to the respondents?
d. Are respondents easily able to understand the items?
e. Can respondents use the response format for each item?
f. Did the interviewers feel that they were receiving valid
information?
g. Was the question order logical and did the interview flow
smoothly?
h. Did some parts of the questionnaire arouse suspicion?
i. Did other parts of the questionnaire seem repetitive or
boring?
j. Were interviewers able to read the questions without
difficulty?
k. Were respondents able to follow all instructions?
l. Was the questionnaire too long?
Reliability and validity

1. Validity
Validity concerns the degree to which a question measures what it
was intended to measure (and not something else). Generally, there
are three main types of validity related to the use of questionnaires:
content, empirical, and concurrent validity.
76
• Content (or face) validity refers to whether a panel of judges or

experts on the topic agree that the statements do relate to what
they are supposed to measure. If agreement is obtained, then
the instrument has content or face validity.
• Empirical (or predictive) validity is usually tested using a

correlation coefficient which measures relationships between
questionnaire responses and other related behavioural
characteristics or outcomes. For example, a researcher could test
the validity of an intelligence test by comparing scores on the
test with the students’ grade point average on a range of school
subjects.
• Concurrent validity consists of measuring the degree to which

a variable correlates with another measure, already validated,
of the same variable. An example of concurrent validity is
given by a study designed to test the validity of questionnaire
items for use with 10-year olds in various countries (Wolf,
1993). The study compared the answers given by the children
in the questionnaire with those given by the mothers who
were asked the same questions either through an interview
or a mail-questionnaire. The results showed high concurrence
between the responses of children and those of their mothers
to questions related to home conditions, such as questions on
father’s occupation, student age, and where the child studied in
the home. However, considerable disagreement was observed on
questions that were retrospective or prospective in nature, such
as questions on how long the child had attended pre-school or
how much more education the parents wished to have for their
child. The conclusion was that it made sense to ask 10-year olds
about their present life situation, however questions about the
past or the future should be avoided as much as possible.
© UNESCO 77
2. Reliability
Reliability concerns the consistency of a measure. That is, the
tendency to obtain the same results if the measure was to be
repeated by using the same subjects under the same conditions.
There are two general approaches to establishing the reliability of

a questionnaire. The first is to ask the question again in a different
part of the questionnaire in the same or slightly altered form, but in
such a way as to yield the same information. This is a consistency
check, but does not take into account variations in day-to-day
variations. A second, and better approach, called Test-Retest, is
to readminister a questionnaire to the same group of individuals
several days later and to compare the results that were obtained.
This second approach was used in a small study of 9-year olds in

Sweden (Munck, 1991). A correlation coefficient that described
the strength of the relationship between responses at two times
of administration was calculated. Three items from the study are
presented below along with the Kappa coefficient.
Item Kappa
Are you a boy or a girl? 0.98
Do you speak Swedish at home? 0.77
How often do you read for somebody at home? 0.41
Although the kappa for the question on gender seems high (0.98),
for such a question one would expect the value to be 1. On a
question like this, agreement can be increased through more careful
supervision by the person who administered the questionnaire.
The relatively low coefficients for the second two questions suggest
that multiple data sources on many questions may be required for
children at this age.
78
The analysis of trial data should also focus on producing frequency

distributions of responses for each variable. Frequencies can
be used to eliminate or modify questions that produce unusual
response distributions. Items in which the rate of non-response or
of ‘don’t know’ responses exceeds 5 percent of the sample should
be examined. Such high rates are usually indicative of ambiguities
that are still inherent in items or inadequacies in the response
categories. If the variable that the problematic item is measuring
is central to the study, then further developmental work might be
needed.
As changes are made, it is usually desirable to trial test the

questionnaire again. It is not unusual that at least three trial-test
studies are required before the questionnaire is adequate for its
purpose.
The codebook
A codebook should be prepared in order to enter the data into a
computer. The codebook is a computer-based structure file designed
to guide data entry. It contains a field for every piece of information
which is to be extracted from the questionnaire – starting from the
identification code which allows each respondent in the sample to
be uniquely identified.
In the codebook, each question/variable is identified by a name and

is defined by a number of acceptable codes, or by a range of valid
values for open-ended questions.
A coding scheme should be prepared for the closed and open

ended-qualitative questions. The coding scheme is a set of
numerical codes which represent all response categories, with
additional codes to enter missing data (that is, questions left blank
by the respondent) and not-applicable data (that is, questions that
were not supposed to be answered by certain respondents).
© UNESCO 79
The coding scheme for closed questions is easy to prepare. Codes

are usually assigned sequentially to the set of response alternatives.
They are often already printed on the questionnaire itself to identify
each alternative, or next to the box to be ticked by the respondent,
as in the following examples.

Boy . . . . . . 1
Girl . . . . . . 2

p1 Boy
p2 Girl
The coding scheme for the above question will be ‘1’ for ‘Boy’, ‘2’ for
‘Girl’, ‘8’ for ‘Not Applicable’ and ‘9’ for ‘Missing’. It is customary to
assign missing data to the highest possible value. That is ‘9’ for one-
digit questions, ‘99’ for two-digit questions, etc. The values of ‘8’,
‘88’ etc. can be used to code ‘Not Applicable’ data.
The following table gives an example of a codebook format.
80
CODEBOOK FORMAT
Variable Column Not

name Coding instructions numbers Missing applicable
Respondent
identification number
Code Actual Number
IDNUMBER (001-528) 1-3
Q1 Highest grade 4 9 8
completed
1=1-8
2=9-11
3=12
4=13-15
5=16
6=17+
Q2 Gender of teacher 5 9 8
1=Male
2=Female
Q3 Hours worked per 6-7 99 88

week current/last job
Code in actual hours
Note that each variable is identified by its name, question content,

the coding scheme employed, column numbers, missing and non-
applicable values and any other special coding rules employed on a
variable-by-variable basis (for example, Q3). From the information
contained in the codebook, any researcher should be able to
reconstruct the computer-stored data files from the completed
questionnaires.
© UNESCO 81
Open-ended quantitative questions are usually entered by recording

the number supplied by the respondent. When preparing the
codebook attention should be paid to the number of fields needed
to enter such questions. Some programs as, for example, the
DataEntryManager software (DEM), allow the researcher to specify
a range of valid values for open-ended quantitative questions, so
that an internal filter is provided to check for the entry of non-
valid data. Suppose, for example, that an open question asks for
the number of years of teacher education. When preparing the
codebook it is possible to specify the maximum and minimum
number of years required to become a teacher in a given education
system. The programme will later block or signal the entry of data
that fall outside this range.
The preparation of a coding scheme for qualitative unstructured

items is a more laborious task. First, it is necessary to analyze the
responses obtained in order to identify a set of categories that
provide a meaningful classification system. Then each category is
assigned a numerical value. The set of numerical values into which
responses will be coded is the coding scheme. Codes for missing
and not applicable data should be prepared following the same
criteria as for closed questions.
E XERCISES
1. Explain the difference between funnel and inverted funnel

sequences.
2. Explain the concepts of validity and reliability.
3. List three aims that a good covering letter should address.
4. State the main objectives of a trial-testing programme.
82
… E XERCISES
5. Specify a coding scheme for the following three questions.

(Tick one box only)
p Boy
p Girl
About how many books are there in your home?

(Do not count newspaper or magazines. Please tick one box
only)
p None
p 1-10
p 11-50
p 51-100
p 101-200
p More than 200
Does your school have any special programs or

initiatives for reading outside the normal classroom
activities?
(You may tick more than one)
Extra-class lessons in reading p
Extra-individual tuition at school p
Special remedial reading courses p
Other p
© UNESCO 83
6 Fur ther reading

Centre for Educational Research and Innovation. 1996. Education at
a Glance. OECD Indicators. Paris: OECD.
Converse, J.M.; Presser, S. 1986. Survey questions: handcrafting the

standardized questionnaire. California: Beverly Hills, Sage.
Foddy, W. 1993. Constructing questions for interviews and

questionnaires. Cambridge: Cambridge University Press.
Guttman, L. 1950. A problem of attitude and opinion measurement.

In: Stouffer S.A. (ed.), Measurement and prediction. New Jersey:
Princeton University Press, Princeton
Johnstone, J.N. 1976. Indicators of performance of education systems.

Paris: IIEP/UNESCO.
Johnstone, J.N. 1988. Educational indicators. In J.P. Keeves

(ed.), Educational research, methodology and measurement: an
international handbook. New York: Pergamon Press, pp. 451-456.
Likert, R. 1932. A technique for the measurement of attitudes. In:

Archives of Psychology, 140:52.
Munck, I. 1991. Plan for a measurement study within the Swedish IEA
Reading Literacy Survey and some results for population A. Stockholm:
Institute of International Education, University of Stockholm.
84
OECD. 1995. Definitions, explanations and instructions. In:
UNESCO, OECD, EUROSTAT data collection instruments for the 1996
data collection on education statistics. Paris: OECD.
Oppenheim, A.N. 1992. Questionnaire design, interviewing and

attitude measurement. London: Pinter Publishers Limited.
Osgood, C.E.; Suci, G.; Tannenbaum, P. 1957. The measurement of

meaning. Urbana, Illinois: University of Illinois Press.
Payne, S.L. 1951. The art of asking questions. Princeton University

Press, Princeton, New Jersey.
Schleicher, A.; Siniscalco, M.T.; Postlethwaite, N.T. 1995. The

conditions of primary schools. A pilot study in the least developed
countries. A report to UNESCO and UNICEF.
Sheatsley, P.B. 1983. Questionnaire construction and item writing.

In: P.H. Rossi; J.D. Wright; A.B. Anderson, Handbook of survey
research. New York: Academic Press.
Thurstone, L.L. 1928. Attitudes can be measured. In: American

Journal of Sociology, 33, 529-554.
UNESCO. 1976. International Standard Classification of Education.

Paris.
Wolf, R.M. 1993. Data quality and norms in international studies.

In: Measurement and Evaluation in Counselling and Development, 26:
35-40.
© UNESCO 85
Quality (SACMEQ).


and innovation”.
10
Module
Andreas Schleicher
and Mioko Saito
Data preparation
and management



Tel: (33 1) 45 03 77 00
Fax: (33 1 ) 40 72 83 66

Module 10 Data preparation and management
Content
1. Introduction 1
Professional data management as an essential component
of educational surveys 1
Other related documentation 4
2. An overview of data management

for educational survey research 5
Integrating data management into the survey design 5
Setting-up a data management plan 6
Taking account of the data collection instruments 7
Taking account of field operations 8
The preparation of school and student record forms 9
1. School record forms 9
2. Student record forms 10
3. Planning coding and data entry 10
3. Data management and quality control 12

Common errors during data preparation 13
Preparation of a codebook 15
1. Datafiles, records, and variables 15
2. Identification, data, and control variables 16
The purpose and use of a codebook 17
1
4. The preparation of a codebook 20

Elements in codebook 20
1. Codebook information for the school identification code 20
2. Codebook information for the student identification code 23
3. Codebook information for question 1: student sex 23
4. Codebook information for question 2: student age 23
5. Codebook information for question 3: regularity of meals 24
6. Codebook information for question 4: availability of books 25
7. Codebook information for questions 5 and 6: books at home
and reading activities 25
8. Codebook information for question 7: student possessions 25
An example of a codebook listing 26
5. The data entry manager software system 30

File construction 31
1. Specifying a filename 31
2. Defining the variables 31
3. Saving the electronic codebook 38
Coding of missing data 38

1. Key requirements 39
2. Basic categories of missing data 39
Data entry 43
1. Basic approaches to data entry 43
2. Using a text editor for data entry 44
3. Using a computer-controlled approach for data entry,
the dataentrymanager programme 47
WinDEM 50
1. Entering data 50
2. Reviewing your data 52
II
Content
6. Data verification 64
Data verification steps 66
1. Verification of file integrity 67
2. Special recodings 68
3. Value validation 68
4. Treatment of duplicate identification codes 69
5. Internal validation of an hierarchical identification system 69
6. Verification of the linkages between datafiles 69
7. Verification of participation indicator variables against data variables 70
8. Verification of exclusions of respondents 71
9. Checking for inconsistencies in the data 71
Data verification procedures using WinDEM 74

1. Unique ID check 74
2. Column check 74
3. Validation check 74
4. Merge check 75
5. Double coding check 75
7. Database construction and database

management 76
8. Conclusion 78
III
Introduction 1
Professional data management

as an essential component
of educational surveys
Whenever data are collected in educational survey research studies,
two problems are often found at the data preparation phase. First,
errors can be introduced in the entry of data into computers and as
a result some data collections provide inaccurate and faulty results.
Second, the computer entry and cleaning of data prior to the main
data analyses can be extremely time consuming and therefore this
information can rapidly become “out of date” and consequently lose
its value to policy-makers.
The root causes of these two problems of “accuracy” and

“timeliness” are sometimes associated with the selection of
inappropriate research designs or the use of research designs that
are not manageable within prevailing economic, administrative
and socio-cultural constraints. In other cases, these two problems
arise from the lack of a systematic analysis of decision-making
requirements that can cause too many data of limited use to be
collected.
Unfortunately, in many cases, a study is successful in addressing

the need for appropriate research design and the data identification
of required to be collected, but breaks down when the data reach
head office. For example, a high quality data collection in the
1
field can be ruined when: (a) coding and data entry teams are
insufficiently trained or supervised; (b) coding instructions and
codebook specifications are incomplete or inadequate; or (c) the
database management is inappropriate so that information is
lost, composite variables are created incorrectly, data are used at
the wrong level of analysis, or no attention is given to “adjusting”
estimates for the structure of the sample design used.
The issues presented above illustrate the need for a great deal
of thought to be given to the management of data prior to the
commencement of an educational survey research study. In
particular, close attention must be given to: (a) the type of data
collected, (b) the data collection methods, (c) the design of data
collection instruments, and (d) the administrative procedures and
field operations.
In addition, adequate field monitoring and survey tracking

instruments must be prepared, and good data entry and data
verification procedures must be developed to implement these
standards. Finally the organization of the data into adequate
data structures is required in order to facilitate the manipulation,
analysis, and reporting of information.
The following discussion places the spotlight on the broad field

of “data management” for many issues related to the planning of
large-scale educational survey research studies. In these studies,
data may be collected from thousands of students selected by using
quite complex sample designs. The discussion has been extended
to cover the situation where data have been collected at different
“levels” (for example, students, teachers, schools) and therefore may
need to be merged, aggregated, and disaggregated prior to the main
data analyses.
2
Introduction
Scope and structure

There are several key steps in data management that are required
to ensure that the quality of collected data is adequate, that data
are turned into useful information, and that the more common
data management problems are avoided. These steps include: (a)
elaboration of data management issues during the preparation
of the survey design, (b) setting data quality standards and the
establishment of quality control mechanisms, (c) preparation of
codebooks; (d) data coding and data entry in computer-readable
format, (e) verification of data; and (f) database design and database
management.
Data management issues will be addressed in the following

discussion from a conceptual point of view and also through a set of
worked examples that lead step by step towards solving frequently-
occurring problems in data management. Several examples have
been presented to illustrate three of the most frequently collected
types of quantitative educational research data: (a) achievement
tests with multiple-choice or pre-coded free response items, (b)
questionnaire data, and (c) numerical measurements.
The discussion commences with an analysis of those aspects of data

management that need to be addressed in the initial phases of the
design of an educational survey. This is accompanied by an analysis
of approaches to the establishment of data quality standards and
mechanisms of quality control. The chapter then deals with how to
transform answers given to questions and achievement test items
into numerical codes that a computer can interpret, and how to
represent the data from questionnaires or achievement tests in a
datafile so that they can be processed and analyzed by a computer.
The chapter concludes with an examination of data entry and
verification procedures, followed by a brief overview of procedures
for organizing information into database systems.
3
Other related documentation

Some of the examples given below are concerned with the
entry, editing, and verification of data through the use of a
software system for data management called the Windows
DataEntryManager (WinDEM) program. A special version of this
program is available from the IEA. This software has easy-to-learn
features and comes with integrated file management and reporting
capabilities. Using this programme, the deviation of data values
from pre-specified validation criteria or data verification rules can
be detected quickly, thereby allowing the user to correct errors
shortly after the original survey materials arrive at the survey office.
The manual for the WinDEM programme describes how to create
new datafiles or to modify the structure of existing datafiles, and
how to change coding schemes and range validation criteria for
variables. The manual also contains an interactive tutorial through
which the user can learn how to transform a questionnaire into an
electronic codebook, how to set up a datafile, how to enter data into
this datafile, and how to make backup copies of data on diskettes.
4
An over view of data 2
management for educational
sur vey research
Integrating data management into the

survey design
Often researchers start to solve data management issues only after
the field administration has been completed and the completed
survey instruments have been returned. In these cases, the data
management plan is prepared after the field administration
has been completed and usually only involves data entry, data
verification, and data analysis. However, in order to avoid
unexpected problems, unnecessary corrective steps, and delays
in data verification and data analysis, it is important to take data
management issues into account during all phases of the research
project.
From the very beginning of a survey, the following issues should

be considered: (a) the type of data collected; (b) the data collection
methods; (c) the design of the data collection instruments in terms
of the development of coding rules and coding instructions; (d) the
design of the administrative procedures including field monitoring
and instrument receipt control; (e) the data entry and the type of
data verification procedures required; (f) the timing and deadlines;
(g) the data processing environment; and (h) the database design.
It is therefore important that staff responsible for managing
5
educational survey research data by using computers are consulted

from the very beginning of a study on all issues involving costs,
administrative and practical constraints, timelines and needed
technical/personnel resources.
Setting-up a data management plan

In the planning stage of a survey a detailed “data management plan”
needs to be developed which recognizes that action will be required
with respect to four major components.
First, the resources required for field operations, data entry, and
data processing generally depend on the sample size that is to be
used. In situations where there are severe constraints on resources,
this will often require trade-offs to be made concerning various
factors which influence the quality with which the survey can be
carried out.
Second, the procedures for coding and data entry will depend to a
great extent on the types of response required of the questions in
the data collection instruments.
Third, the establishment of identification codes for data collection

instruments depends upon the units of sampling and the units of
analyses that are to be used. The resource implications required
to establish linkages between information gathered from different
units (for example: school heads, teachers, and students) is a point
that also needs to be considered.
Finally, the complexity of data verification procedures depends

on the nature of the response patterns in the data collection
instruments. Special care needs to be taken in dealing with “filter”
or “branching” questions because these can lead to substantial
inconsistencies in responses which must then be dealt with during
the analysis of the data.
6
An overview of data management for educational survey research
Taking account of the data collection

instruments
When designing data collection instruments, it is essential to
have a clear picture of the desired information and intended
analyses, including the necessary analyses of reliability and
validity. The amount and type of data preparation required before
information can be used in the data analyses depends on the type
of questions asked and on the kind of data collection instruments
used. A variety of formats exist for asking questions. These
range from simple pre-coded multiple-choice questions which
can be transcribed directly into a computer-readable form, up to
open-ended and free response questions of various kinds which
require highly qualified coding personnel in order to transform
responses into pre-defined categories and numerical codes.
It is important to evaluate the implications of the use of different

types of questions and response formats. For example, asking
students to specify the occupation of their parents in free response
format may result in a large variety of (often very confusing)
answers, and these may be difficult and time consuming to classify
and code. It should be remembered here that the use of open-ended
test items usually requires the steps of coding and data entry to
be separated, whereas with multiple-choice questions or test items
these steps can be addressed in a single operation.
Thought also needs to be given to the physical layout of the

instruments. For example, the instruments can be printed with
codes for each response, coding columns, and control information.
These improvements to layers often result in speeding-up and
improving the accuracy of data entry.
7
Taking account of field operations

The integration of the procedures for the selection of a survey
sample into the procedures for data management can often facilitate
survey operations and reduce survey costs. For example, a list of
selected sampling units based on a computerized sampling frame
can be used to generate address labels, name-lists, and registration
forms – all of which can be used for the purposes of instrument
preparation, field monitoring, data entry, and data verification. The
establishment of proper identification (ID) codes for schools, classes,
teachers, and students is thereby critical, especially when the survey
design requires the linkage of students to their schools, classes and
teachers. The system for assigning these identification codes must
ensure that students, teachers, classes, and schools are identified
uniquely and that there is sufficient information that will permit
verification to be made at the various stages of the survey.
In educational surveys involving different levels of data aggregation,

it is often advantageous to identify respondents through a
hierarchical compound numbering system. In such a numbering
system the first section classifies respondents within the next higher
level of aggregation and, at the same time, identifies respondents
within the classification units. For example, students in a survey
may be assigned School IDs, Class IDs, and Student IDs. The Class
ID would consist of the School ID plus an identification of the class
within the school, and the Student ID in turn could then consist of
the Class ID plus an identification of the student within the class.
When such an identification system is used, the internal

consistency of the identifications can be verified by computer and
the probability of an incorrect identification of respondents can be
reduced because, during data entry, these identification codes can
be automatically cross-validated on the basis of their common parts.
Such a system can, for example, help to ensure that student data is
linked reliably to teacher and school data.
8
It is crucial that unique identification codes are assigned to students,

teachers, and classes, and that these identification codes are
carefully written onto all instruments that are prepared before the
instruments are sent out. This should be supplemented by the use
of survey tracking instruments by means of which respondents can
be followed throughout survey.
The following illustration provides an example of the design of

school and student record forms based on a two stage sample
design in which first schools are selected and then intact classes are
selected within schools.
The preparation of school and student

record forms
1. School record forms
This form should include the following items: (i) the official
identification number of the school, (ii) the name, full address,
and telephone number of the school, (iii) the name and telephone
number of the person co-ordinating the assessment in the school,
(iv) the number of classes in the target population in the school, and
(v) the number of students in the target population in the school.
Schools that, despite all efforts, do not co-operate in the assessment

are often replaced with “similar” schools, for example, with schools
of a similar type, size, location and context. This is accomplished
through the association of each sampled school with a replacement
school derived from a separately drawn replacement sample.
Though the use of replacement schools should be discouraged
because it may introduce a bias of unknown magnitude, it is
important to ensure that if replacement schools are used, the school
9
record form allows the researcher to trace such schools and to

identify schools as replacement schools.
2. Student record forms

For the selected schools, Student Record Forms should be prepared.
These will be of critical importance in various phases of the field
trial. In particular, they provide information relating to: (i) the
identification of students, (ii) the checking of the age and sex of
students, (iii) which test booklets should be given to which students,
(iv) the participation status of students in the test and questionnaire
administration, (v) the students who have been excluded from
the testing, (vi) the instruments that have been lost, and (vii) the
checking of instruments against persons.
3. Planning coding and data entry

Once the data collection instruments are finalized the codebook can
be prepared. The codebook provides a comprehensive description of
the contents and layout of the data that are entered into a computer
for analysis.
The largest part of the data collection costs are often caused by
the coding, entry, and verification of the data. Careful thought
must therefore be given to the establishment of consistent coding
schemes that are easy to apply and that cover the potential
responses and different instances of missing data in an exhaustive
and mutually exclusive way. It is important that there are enough
personnel and enough technical resources in order to complete
the entering and cleaning of the data in a timely fashion. What is
especially important is that coders are well trained and that there is
a head-coder to whom queries can be directed and who can decide
what to do when there are problems with the coding.
10
Sometimes, the tasks of coding and data entry may be separated,

especially where open-ended questions or test items are involved
which require specially- trained personnel. For both coding and
data entry, it is important to test all procedures on a sub-sample of
questionnaires so that the researcher knows how much time will be
required to complete this work.
11
3 Data management and

quality control
It is essential that, prior to the collection of data, a common
framework of data management standards be agreed upon.
Standards in this context comprise: (a) principles to which the
results of data collections and data collection operations should
conform, (b) measures by which the quality and accuracy of
results and procedures can be judged, and (c) steps that must be
undertaken to obtain adequate data in a timely manner.
The main reasons for establishing data management standards are

to ensure the quality of the data, so as to guarantee the integrity of
the data analyses, and to be confident of the adequacy of the results
of these data analyses for answering the intended research and
policy questions.
There are five main elements which must be addressed in order to

ensure that data adheres to quality standards.
• A detailed prior analysis of potential fieldwork problems.
• The specification of data verification rules.
• Adequate training of field administrators.
• The implementation of quality standards during data

verification.
12
• The development of procedures for the analytical treatment and
reporting of deviations from the quality standards.
The careful preparation of administrative procedures including

manuals, survey tracking instruments, and identification systems
are of critical importance for the verification of data quality
standards.
Common errors during data preparation

The errors that can occur during data preparation are usually
linked with the procedures adopted for instrument design, coding
procedures, and the data collection and data entry methods. For
example, the kinds of errors that occur when free response data
are manually coded and transcribed into computer readable form
differ from the kinds of errors that are likely to occur when data
are entered directly into computers from machine-readable answer
sheets. In the first situation, errors can occur when coders misread
or misinterpret the answers of respondents, when coding rules are
not correctly applied, or when the data are incorrectly transcribed,
such as when data values are omitted, shifted or otherwise wrongly
entered into the computer. For some of these problems it is often
impossible to verify whether the errors have been caused by the
respondent, during the field administration, the coding process, or
during the data transcription.
The ten most common problems in terms of quality standards have

been listed below:
• Respondents may have been assigned invalid or wrong

identification codes either during instrument preparation, field
administration or data transcription. This can lead to difficulties
if later analyses require linkages between different respondents
or between different levels of data aggregation.
13
• Questions may have accidentally been misprinted due to

technical or organizational imperfections, thereby preventing
respondents from giving appropriate answers.
• Questions may have been skipped, or not reached, by the

respondents either in a randomized fashion or in a systematic
way which results in “gaps” in the data to misleading results.
• Respondents may give two or more responses when only one

answer was allowed, or questions may have been answered in
other unintended ways.
• Certain data values may not correspond to the coding

specifications or range validation criteria.
• Answers to open-ended questions may contain outlier codes,

that is, there may be respondents with codes which are
improbably low or high even though they could be the valid
answers.
• The values for certain data variables might not correspond to

the values of certain control variables. (For example, the value
of a control variable may state that a particular student did not
respond to a particular question set, whereas the data variables
for this question set indicate actual responses).
• Data from a respondent may contain inconsistent values. (That

is, the values for two or more variables may not be in accord).
• Inconsistencies between data values from different respondents

which belong to a certain group may occur for questions which
are related to this group. (For example, for students in the same
class there may be different values for variables which are
related to the class).
• Inconsistencies may also occur between data values of different

but related datafiles or levels of aggregation.
14
Data management and quality control
Preparation of a codebook
1. Datafiles, records, and variables
Data are stored in computers in the form of units called datafiles.
In general terms, a datafile can be described as a collection of
related information. For example, a datafile can contain a number
that identifies each member of a sample of students and gives the
student responses for each item of an achievement test, and, in
addition, provides descriptive background information for each
student. Each datafile is referenced by a unique filename.
The most common form of a datafile is an ASCII raw datafile. In

such a datafile, data are stored in fixed form ASCII format where
ASCII refers to the American Standard Code for Information
Interchange. (If you use a computer other than a Personal Computer
then other interchange format standards may be used which can
usually be transformed into ASCII files). Fixed format implies that
the data for each piece of information are recorded in the same
columns of a datafile for each respondent. In a raw datafile the
different pieces of information are represented next to each other
(in columns) and respondents are represented below each other (in
rows).
Most statistical data analysis systems can read and process raw
datafiles. The user of these systems must “tell the system” in
which location and in which format the data have been written. To
simplify this process, many statistical data analysis systems employ
their own system file format in which the data and all the technical
information concerning the file structure, the data format, and the
coding schemes are integrated. However, these system files can
usually only be used with a specific software system and therefore
are often not suitable for data transfer between different software
systems.
15
Each respondent is represented in the datafile through one or

more records which comprise all of the data associated with the
respondent. A record is usually represented as a single line in a raw
datafile.
Each record in a datafile contains different categories of

information, for example, the student identification codes, the
student answers on the first test item, the student answers on the
second test item, and so on. Each of these categories of information
is represented in the computer by a variable. It is useful to
distinguish between identification variables, data variables, and
control variables. Each variable is referred to by a variable name.
2. Identification, data, and control variables

Each respondent described in a datafile should be uniquely
identified so that it is possible to distinguish respondents in later
analyses. For example, in order to calculate school mean scores of
student achievement, it is necessary to be able to identify which
school a mean score refers to. To accomplish this, a special set
of variables are defined in the codebook which provide a unique
identification for each respondent from whom information has been
collected. These variables are referred to as identification variables.
In cases where data are collected on several hierarchically related
levels (for example, students, classes, and schools), the identification
of each level of aggregation should be defined as a separate variable
so that in later analyses the connection between successive levels
of the hierarchy can easily be established. In a hierarchical system
the Class ID could for example consist of the School ID plus a
sequential number of the class within the school and the Student
ID could consist of the Class ID plus a sequential number of the
student within the class (see also above).
The variables containing the actual responses from the respondents

are referred to as data variables.
16
Errors often occur during the entry of data into a computer when
a number of variables have possible values within the same range
and, at the same time, they appear in a sequence or are coded in
a continuous string. These variables take values within the same
range and therefore this leads to greater potential for column shifts.
To guard against this type of error it is often useful to insert, at
certain positions in the datafile, variables for which a certain fixed
value (for example, a blank space) must be specified. Similarly, it is
often useful to introduce variables that indicate the participation
status of the respondent or that indicate reasons for excluding a
respondent from the assessment. Variables that do not represent
data from the respondents but that are introduced for checking
purposes are usually referred to as control variables.
The purpose and use of a codebook

After the data collection instruments have been returned by the
respondents, the responses must be entered into a datafile. In order
to be useable in computer-based data analyses, these responses
need to be transformed into numeric or alphanumeric codes. This
is usually achieved by having each response associated with a
well defined numerical code, that is usually represented by a fixed
number of digits. For each possible response there should therefore
be one, and only one, code. Questions for which more than one
response can be given must be split into several variables, each
corresponding to one response option.
Detailed instructions must be specified describing how data must

be coded and how resultant codes are stored in computer readable
form. The document providing these instructions is usually referred
to as the codebook.
This codebook should be prepared in a standardized way, using

defined naming, layout, and structural conventions. Some
17
important pieces of information which the codebook should contain

have been listed below:
• The codebook should contain an accurate reproduction of each

question, including the identification of the question and its
sequential number and/or position in the instrument.
• Each variable should be identified by a unique variable name.

Multiple or split variables referring to the same question
should be indicated as such, through a common stem in the
variable names. It is advantageous if the variable names contain
classificatory elements which, for example, may allow the
identification of the population, the type of respondent, the
kind of question, and the response type from the variable name.
Note that most software packages for data analysis impose
certain restrictions on the variable names.
There are four restrictions that apply to most standard software

packages: (i) variable names should have a maximum length of
8 characters, (ii) the first character should be a letter but later
characters can be letters, numbers, or underscores, (iii) blanks
should not be included within variable names, and (iv) variable
names should not contain special characters except for the
underscore.
• Since the variable name can only include a limited amount of

information in highly condensed form, each variable name
should be supplemented by a descriptive label which indicates
the content and/or classification categories of the variable.
• For each pre-coded question there must be a list of all possible

answers along with the definition of the corresponding codes
that are assigned to each of these answers. (Free response
questions require scoring rules and classification schemes
which assign data-values to defined categories).
18
• The location and format of the data representation in the

computer needs to be defined. (Codes are often represented in
the form of fixed-form integers, rationales, or exponentials).
• If a computer system is used for data transcription, then the

codebook should describe where on the screen the data have to
be entered and how the data values can be accessed.
• The codebook should contain a description of the validation

criteria and data verification rules that are associated with the
corresponding variables, a list of the codes that are used to
indicate the various instances of missing data, and instructions
on how missing data are coded.
• The codebook may be supplemented with information useful to

the researcher analyzing the data, e.g. information concerning
the scale-type or measurement class of the variables.
As the proper analytical use of data will depend on appropriate

coding, the coding must be completed according to the information
in the codebook. It is therefore necessary that general rules
concerning the coding and entry of data are clear to the coders
before they start entering data.
It is advisable to implement certain redundancy checks in the

coding of responses which can be used for later data verification
purposes. For example, a variable which indicates whether a
respondent was administered a particular questionnaire or test can
be used to indicate whether missing data for this questionnaire or
test means that the respondent was not administered the test or else
took the test but did not respond to it.
19
4 The preparation of a
codebook
In the following discussion, an example in the preparation
of a codebook has been illustrated for a short hypothetical
questionnaire.
In preparing the codebook it is often useful to start with the

identification variables and then to continue with the data variables
in the same sequence as they appear in the tests or questionnaires
so that the coders can proceed with the coding in the same
sequence in which they read the data collection instruments. In the
following example we will start the specification of the codebook
with the school identification code which is presented in the header
of the questionnaire.
Elements in codebook
1. Codebook information for the school
identification code
• Variable Name: Each variable must be identified by a unique
variable name. In this example the school identification
variable has been given the name IDSCHOOL.
• Variable Type: The type of coding that is used for the variable
must now be defined. Usually a distinction is made between
alphanumeric variables which are treated as categorical
data and open-ended numerical codes which are treated as
20
numbers. Sometimes also a distinction between different
types of numerical codes is made. Identification variables
always have categorical codes but we can choose between an
alpha or a numeric data representation.
• Variable Length and Recording Positions: The number of
digits (including decimal places) which are required to code
the data values of this variable and the positions in the datafile
must then be specified. Starting the datafile with the school
identification code we will put this into the columns 1-3 of the
raw datafile.
• Number of Decimal Places: Where decimals are used in data
codes it is necessary to specify how many decimal places
are used. For the school identification code there will be no
decimal places.
• Instrument Location: The codebook should also tell the
coders about the location of information in the data collection
instruments. For example, the coders should be informed that
they will find school identification codes in the headers of
assigned questionnaires.
• Variable Label: A brief descriptive label should be assigned
to the variable that can help later users of the programme to
remember what the short variable name stands for.
• Coding Scheme: For categorical variables it is necessary to
specify the code for each possible category. In addition, for
all types of variables it is necessary to specify the codes
associated with frequently-occurred data (such as missing, not
administered, not reached, etc.).
• Range Validation Criteria: It is often useful to specify a valid
range for the variable that determines which data values
the user is allowed to enter into the computer. Such range
validation criteria may take the form of a simple set of allowed
codes or they may have a complex structure, relating the codes
to responses to other questions or the responses of other
respondents.
21
FIGURE 1 Hypothetical questionnaire
School identification code ———

Student identification code —————
1. Are you a boy or a girl? (Tick one number)
Boy .............................................. 1
Girl .............................................. 2
2. How old are you? (Put in your age in years)

....................................................... years
3. How often do you eat each of the following meals? (Tick one number on each line)
Not 1 or 2 times 3 or 4 times Every
at all a week a week day
(a) Morning meal 1 2 3 4
(b) Lunch 1 2 3 4
(c) Evening meal 1 2 3 4
4. Are there any books where you live that you could read which are not your
school books? (Tick one number)
Yes .............................................. 1
No .............................................. 2 If “No”, go to question 6.
5. If “Yes”, how often do you read these books? (Tick one number)
Always .......................................... 3
Sometimes................................... 2
Never............................................ 1
6. If “Yes”, how many books are there in your home? (Tick one number)
None............................................. 1
1 to 10 books ............................. 2
11 to 50 books ........................... 3
More than 50 books ................. 4
7. Do you have the following things in your home? (Tick one number on each line)
Do not have this Have one or more
(a) Radio 0 1
(b) TV 0 1
(c) Table to Write on 0 1
(d) Bicycle 0 1
(e) Electricity 0 1
(f) Running Water 0 1
(g) Daily Newspaper 0 1
22
The preparation of a codebook
2. Codebook information for the student

identification code
The definition of the student identification code in the sample
questionnaire is similar to the definition of the school identification
code, except that a five-digit number should be used. With the
school identification code occupying columns 1-3 in the datafile, the
positions 4-8 could be allocated to the student identification code.
We will give this variable the name IDSTUDENT.
3. Codebook information for question 1:

student sex
The first question in the sample questionnaire asks for the student’s
sex. This question can be represented by the variable SSEX. This
variable has a fixed set of categorical codes, namely “1” for “boy”,
“2” for “girl”. The code “9” can be used to indicate missing data,
and the code “8” to indicate that the student was not administered
this question in the sample questionnaire. We therefore specify a
categorical variable type. The length of the code is 1 character and
there are no decimals. Since this question is the first question in the
sample questionnaire, “Question 1” can be given for the instrument
location. An appropriate variable label would be “Student sex”. For
each valid data code a brief description is provided that indicates
the meaning of the codes. These descriptions are usually referred to
as value labels. The code “1” may be selected to specify “boy” and
“2” to specify “girl”. The position of the code for the student sex in
the raw datafile would be column 9.

student age
The next variable in the codebook describes age in years. This
variable can be represented by “SAGEY” for the variable name.
Since students are requested to enter their age as an open-ended
23
number, the code for this variable is open-ended and therefore it is

necessary to specify an open-ended numerical variable type. The
length is two characters in this case and there are no decimals.
For the instrument location “Question 2” may be specified, and
“Student age in years” for the variable label. A value of “99” can
be used for the “missing” code and “98” for the “not administered”
code. Assuming that the age of the students in the sample ranges
between 8 and 16 years, values of “8” and “16” specify the extremes
of the valid range. The position of the code for the student age in the
raw datafile would be columns 10-11.

regularity of meals
In the instructions to this question the student is asked to provide
three answers, one concerning the morning meal, the second
concerning lunch, and the third concerning the evening meal. Since
each variable can contain only one data value, it is necessary to
represent this question by three separate variables with the names
of: SMEALA, SMEALB, and SMEALC. These variables have a fixed
set of categorical codes, namely “1” for “not at all”, “2” for “1 or 2
times a week”, “3” for “3 or 4 times a week”, “4” for “every day”, with
“9” being used to indicate “missing” data, and “8” being used to
indicate that the student was not administered this question. The
code for this variable is a categorical type, a length of one character,
and no decimals. “Question 3a”, “Question 3b”, and “Question 3c”
are specified, respectively, for instrument locations, and the variable
labels are “Frequency of meals/morning meals”, “Frequency of
meals/lunch”, and “Frequency of meals/evening meals”, respectively.
The valid data are 1, 2, 3, and 4 and for the corresponding value
labels are “not at all”, “1 or 2 times a week”, “3 or 4 times a week”,
and “every day” according to the instructions in the questionnaires.
The position of the codes for the three variables on the regularity of
meals in the raw datafile would be columns 12, 13, and 14.
24

availability of books
Question 4 asks about the availability of books and may be
allocated the name of SBOOKAV. Note that the responses that
follow this question will depend on the answer to this question.
Such variables are described as filter variables. The position of the
code for the question on availability of books would be column 15.
7. Codebook information for questions 5 and 6:

books at home and reading activities
Questions 5 and 6 are related to the number of books and the
reading activities of the students. The answers to these questions
depend on the answer to question 4. The coding is similar to
the coding of question 1 with variable names of SBOOKRD and
SBOOKS. The position of the code for the questions on the number
of books at home and the reading activities would be columns
16 and 17. However, if the student answers “No” to Question 4,
the coding for Questions 5 should be specially assigned (for the
special reason of “missing”), and the coding for Question 6 should
automatically become “1”.

student possessions
Question 7 is again a “split” question which asks the student about
home possessions. The coding is similar to question 3 except
that there are now 7 distinct variables – each of which has valid
data codes of 0 and 1. The variable names are SPOSSA, SPOSSB,
SPOSSC, SPOSSD, SPOSSE, SPOSSF, and SPOSSG. The position of
the code for the questions on student possessions would be columns
18, 19, 20, 21, 22, 23, and 24.
25
An example of a codebook listing

In Figure 2, the codebook for the hypothetical questionnaire has
been presented. This codebook was prepared as output from the
DataEntryManagement (WinDEM) software when applied to
the questionnaire presented in Figure 1. The different pieces of
information contained in this hypothetical codebook are described
below:
• The first column in the codebook (Var. No.) presents a

sequential number for each variable in the Reading Literacy
Codebook;
• The second column (Quest. No.) presents an identification of

the background question and its location in the instruments;
• The third column (Variable Name) presents the variable name;
• The fourth column (Variable Label) presents the variable

label;
• The fifth column (Code R:Recode) presents the codes for the
responses, and the recodes for variables for which recoding is
necessary and where recoding is not covered by the general
notes on recoding. Whenever actual numerical data are
supplied in the response to the questions, this is indicated
by the keyword “VALUE”. The missing-code presented in the
codebook indicates “missing/non-response” values. The “not
administered” code presented in the codebook indicates “not
administered” values;
• The sixth column (Option) presents the response phrase

(or an abbreviation of it) that corresponds to the code. For
variables that contain actual numeric data, it contains an
explanation and the permitted range of the value to be
entered;
26
• The seventh column (Location/Format) presents the location

and format of the variable in the raw datafile. A variables
format is the pattern used to write each value of the variable.
It consists of the variable type, the first column in the raw
datafile that is assigned to the variable, the last column the
variable occupies in the raw datafile, and the length and
the number of decimal places. In the seventh column of the
codebook the first two numbers refer to the position of the
first and last digit of the value of a variable within a record.
“C” and “N” indicate the variable type (where N refers to “non-
categorical” or open-ended numeric variables and C refers to
“categorical” alpha-numeric values). The third number refers to
the length (where the numeric code refers to the length of the
value and the number of decimal places associated with the
values) of each variable.
27
FIGURE 2 Codebook for the hypothetical questionnaire
Codebook, Date 13.07.94 File: SAMPLE1.SDB
Var. Variable Code

No. Question Name Variable Label R:Recode Option Location/Format
1 SCHOOL ID IDSCHOOL SCHOOL IDENTIFICATION CODE 999 missing 1- 3/N 3.0
998 not admin.
VLD: (IDSCHOOL>=1.AND.IDSCHOOL<=150).OR.IDSCHOOL=999.OR.
Flags: SCR : 1 / CAR :YES / CAT:D / DEF:
2 STUDENT ID IDSTUD STUDENT IDENTIFICATION CODE 99999 missing 4- 8/N 5.0
99998 not admin.
VLD: (IDSTUD>=1.AND.IDSTUD<=50000).OR.IDSTUD=99999.OR.ID
FLAGS: SCR : 2 / CAR :No / CAT:D / DEF:
3 QUEST 1 SSEX STUDENT’S GENDER 1 boy 9 /C 1.0
2 girl
9 missing
8 not admin.
VLD: SSEX$’1298’
Flags: SCR : 3 / CAR :No / CAT:B / DEF: 9
4 QUEST 2 SAGEY STUDENT AGE IN YEARS 99 missing 10- 11/N 2.0
98 not admin.
VLD: (SAGEY>=8.AND.SAGEY<=16).OR.SAGEY=99.OR.SAGEY=98
Flags: SCR : 4 / CAR :No / CAT:B / DEF:
5 QUEST 3A SMEALA FREQUENCY OF MEALS / MORNING MEALS 1 not at all 12 /C 1.0
2 1 or 2 times a week
4 every day
9 missing
8 not admin.
VLD: SMEALA$’123498’
6 QUEST 3B SMEALB FREQUENCY OF MEALS / LUNCH 1 not at all 13 /C 1.0
4 every day
9 missing
8 not admin.
VLD: SMEALB$’123498’
7 QUEST 3C SMEALC FREQUENCY OF MEALS / EVENING MEALS 1 not at all 14 /C 1.0
4 every day
9 missing
8 not admin.
VLD: SMEALC$’123498’
8 QUEST 4 BOOKAV AVAILABILITY OF BOOKS 1 No 15 /C 1.0
2 Yes
9 missing
8 not admin.
VLD: BOOKAV$’1298’
28
Var. Variable Code

No. Question Name Variable Label R:Recode Option Location/Format
9 QUEST 5 BOOKRD READING FREQUENCY 1 Never 16 /C 1.0
2 Sometimes
3 Always
9 missing
8 not admin.
VLD: BOOKRD$’12398’
10 QUEST 6 SBOOKS NUMBER OF BOOKS AT HOME 1 none 17 /C 1.0
2 1 to 10 books
3 11 to 50 books
4 more than 50 books
9 missing
8 not admin.
VLD: SBOOKS$’123498’
11 QUEST 7A SPOSSA HOME POSSESSIONS / RADIO 0 do not have this 18 /C 1.0
1 have one or more
9 missing
8 not admin.
VLD: SPOSSA$’0198’
12 QUEST 7B SPOSSB HOME POSSESSIONS / TV 0 do not have this 19 /C 1.0
1 have one or more
9 missing
8 not admin.
VLD: SPOSSB$’0198’
13 QUEST 7C SPOSSC HOME POSSESSIONS / TABLE TO WRITE ON 0 do not have this 20 /C 1.0
1 have one or more
9 missing
8 not admin.
VLD: SPOSSC$’0198’
14 QUEST 7D SPOSSD HOME POSSESSIONS / BICYCLE 0 do not have this 21 /C 1.0
1 have one or more
9 missing
8 not admin.
VLD: SPOSSD$’0198’
15 QUEST 7E SPOSSE HOME POSSESSIONS / ELECTRICITY 0 do not have this 22 /C 1.0
1 have one or more
9 missing
8 not admin.
VLD: SPOSSE$’0198’
16 QUEST 7F SPOSSF HOME POSSESSIONS / RUNNING WATER 0 do not have this 23 /C 1.0
1 have one or more
9 missing
8 not admin.
VLD: SPOSSF$’0198’
17 QUEST 7G SPOSSG HOME POSSESSIONS / DAILY NEWSPAPER 0 do not have this 24 /C 1.0
1 have one or more
9 missing
8 not admin.
VLD: SPOSSG$’0198’
29
5 The data entry manager

software system
There are software systems which allow one to create a codebook in
an interactive way. The following discussion covers this step-by-step
process using the WinDEM programme provided by the IEA (for
more detailed information, refer to the programme manual of the
WinDEM programme).
For each data file which you create with the WinDEM programme,
the programme maintains an electronic codebook which contains
all technical information required to define the file structure, the
coding scheme, the data verification rules, and quality standards
for the datafile. Whenever variables are modified, the programme
updates the electronic codebook automatically.
To illustrate the operations of the WinDEM software, consider the

preparation of a datafile that can hold the data from the sample
questionnaire in Figure 1. This will require the following three
steps: creating a new datafile, defining the variables to be included
in the datafile, and saving the resulting electronic codebook. Each
of these steps requires the user to provide input to the WinDEM
programme through a series of questions and prompts. In the
following discussion an example of this process has been presented
along with a listing of the “dialogue” that occurs between the user
and the computer.
30
File construction
1. Specifying a filename
In order to create a new datafile, the programme will first ask you
to give your datafile an alphanumeric name with a length of up to 8
characters, for example, SAMPLE1.
2. Defining the variables

The next step is to define the information to be stored in the
datafile. This can be done in the form of a “dialogue” with
the computer, where the computer will ask you to specify the
characteristics of the variables in the datafile.
A display as shown in Figure 3 will appear where you can fill in the
variable definitions in the codebook fields:
FIGURE 3 The variable definition display (first part of dialogue)
31
a. Essential information
The following pieces of information are essential for the
definition of a variable.
Unique Variable Name: Each variable must be identified
by a unique variable name. We will start with the school
identification code which is presented in the header of the
questionnaire. We have given it the name “IDSCHOOL”, so
you would enter “IDSCHOOL” into the first blank field.
Variable Type: The next question asks about the type of coding
that is used for the variable. The letter “C” indicates categorical
variables with a fixed set of alphanumeric or numeric
categories. The letter “N” indicates non-categorical variables
with open-ended numerical codes. While there are a fixed
number of schools and therefore only a fixed set of possible
school identification values, the number of possible values is
very large and can be understood as quasi-open-ended, so you
should enter “N” into the second blank field.
Variable Length: Afterwards you need to specify the number
of digits (including decimal places) which are required to code
the data values of this variable. Assuming that, in our example,
there are 150 schools the identification codes of which are the
numbers 1 to 150, we can use a three-digit code to identify the
schools, so you would enter “3” into the codebook field for the
length.
Decimals: Afterwards you can specify the number of decimal
places to be used in the codes. In the school identification code
there are no decimal places, so you would leave the “0” in this
codebook field which is the default value and go to the next
codebook field.
Location in Instrument: The next piece of information will tell
the coders where (in the data collection instruments) they will
find the question used as the source of information. You can
32
The data entry manager software system
fill in a short description that helps to locate the information

quickly. In our example, you could enter “School ID” into this
codebook field to indicate that the codes for this variable are
found in the identification part of the questionnaire.
The “Hide variable” Indicator: The question “Allow
modification of variable?” asks you to specify whether a
variable will be visible and editable in the WinDEM display
when you enter data or not. “Y” indicates that the value will be
displayed during the data entry stage, “N” indicates that the
value will not be displayed. As the later users need to enter
the school identification code, you should enter “Y” in this
codebook field.
The “Carry on” Indicator: The question “Carry data values on
as default?” asks you to specify whether the value of a variable
is carried as a default value to the next record when you enter
data. This is useful for variables which remain constant for a
number of records. If the “Carry” indicator is set to “Y” for a
particular variable, then every new record will have the data
value from the previous record as the default value. You can
then modify this default value as required. If the “Carry”
indicator is set to “N”, then the default value for this variable
will be the default value which was specified for this variable.
As we may be entering many students for the same school, you
should enter “Y”.
Order (Display): You can specify the sequential position in
which variables will appear in the WinDEM display during
data entry. If you do not specify anything, the programme will
set these sequential positions so that the variables appear on
the display in the sequence in which you define them.
Order (File): Similarly, you can specify the sequential position
in which variables will be recorded in the datafiles. If you do
not specify anything, the programme will set these sequential
33
positions so that the variables appear on the display in the

sequence in which you define them.
Field Label: For the descriptive label you could fill in “School
identification code”.
b. Optional coding information

Afterwards the display will expand to the display as shown
in Figure 4. These additional pieces of information should be
filled in to provide further information on the coding of the
variable.
FIGURE 4 The variable definition display (second part of dialogue)
Code for “Missing” Data: Following the above specifications,

in the case of the variable IDSCHOOL you could enter the
code “999” to indicate missing or omitted data.
Code for “Not Administered” Data: Correspondingly you
could specify “998” to indicate “not administered” data for the
variable IDSCHOOL.
34
“Default” Code: You can provide a code that will be used as

a programme default when you create a new record in the
datafile. In the case of the variable IDSCHOOL, you could
leave this codebook field blank or specify 999 as its default
code.
Valid Range: You can specify a valid range that determines
which data values the user is allowed to enter when entering
data. Assuming that in our example, there are 150 schools
the identification codes of which are the numbers 1 to 150,
you would enter the numbers 1 and 150 in the corresponding
codebook fields.
Variable Class: You can classify variables according to their
use in later data analyses. Since the variable IDSCHOOL is
an identification variable, select the keyword “ID”. Note that
only when the variable class is “ID” can the distinguish these
variables as identification variables.
Comment: You can associate a descriptive comment with the
variable which will be printed in the electronic codebook.
c. Adding variables
Having completed the definition of the variable, the
programme will bring you back to the tabular display where
you can review your definitions or add new variables. In the
following discussion you will find two more examples for
the preparation of variables in the electronic codebook. The
definition of the student identification code is similar to the
definition of the school identification code, except that a five-
digit number will be used. You would enter “IDSTUD” for the
variable name, “N” for the variable type, “5” for the length, “0”
for the number of decimals, “Student ID” for the instrument
location, and “Student identification code” for the variable
label. Then you could enter “99999” for the “missing” data code
and “99998” for the “not administered” data code.
35
The first question in the sample questionnaire asks for the

student’s sex. We have represented this question by the
variable SSEX. You would therefore enter “SSEX” for the
variable name. This variable has a fixed set of categorical
codes, namely “1” for “boy”, “2” for “girl”, “9” to indicate
missing data, and “8” to indicate that the student was not
administered this question in the sample questionnaire. You
should enter “C” for “categorical” into the codebook field for
the variable type. For the length enter “1” and for the number
of decimals enter “0”. Since this question is the first question in
the sample questionnaire, you may enter “Question 1” into the
codebook field instrument location. For the variable label enter
“Student sex”. For the “missing” code and the “default” code
enter “9” and for the “not administered” code enter “8”.
The programme will then ask you to specify the number
of valid data codes for this question. Note that codes for
“missing” and “not administered” data are not counted as valid
data, so your answer should be “2” (for “boys” and “girls”). The
programme will then ask you to define the valid codes:
FIGURE 5 The variable definition display (defining codes and value labels)
36
For each valid code, you will find one row displayed in a
small window. In the blank fields on the left hand side of this
window you should enter the codes, and in the blank fields
on the right hand side you should enter the meaning of the
codes, which are referred to as the value labels. For the code
“1” you would enter “boy” and for the code “2” you would
enter “girl” (the codes and value labels should be based on the
questionnaire presented in Figure 1).
For the variable class you should select “D” to indicate that
this question refers to the student’s description. Also the
remaining questions in this questionnaire will refer to the
students’ description, so you should also select “D” for the
variable class for the remaining variables.
E XERCISE 1:
Complete defining all the variables in SAMPLE1 based on the
questionnaire (Figure 1) and the printed codebook (Figure 2). You
should have the following screen (Figure 6) when you finish:
FIGURE 6 Completed variable definition display
37
3. Saving the electronic codebook

Once you have defined all variables in the electronic codebook,
the programme will ask you to confirm that you want to save the
codebook. Afterwards the programme will verify your definitions
for formal correctness. If the programme detects any errors, these
will be indicated on the display and the programme will bring you
back into the tabular display with the definitions where you can
correct these errors.
If no errors are found, the codebook will be saved and the

programme will bring you back into the main menu. You are now
ready to enter data into the new datafile.
Coding of missing data

In preparing a codebook, careful thought has to be given to
considering how to code different instances of missing data and
how to treat these different categories of missing data in data
analyses.
If you define none or too few categories of missing data, you may
end up with severe problems in the data analyses. For example, to
calculate the percentage of correct answers for an item in a reading
test you may want to assume that the students who omitted an item
could not answer it and will therefore be scored as wrong. However,
it would be unfair to score some items as wrong which were not
administered to the student because they were, for example,
misprinted in the student booklet. If the coders do not assign
different codes for each of these instances then you will not be able
to make that distinction in the data analyses.
On the other hand, if you define too many categories of missing

data for which there is no analytical use, it may be very difficult for
38
the coders to distinguish between the different instances of missing

data, and the coding may be unnecessarily complicated.
Some distinctions between different instances of missing data must

be made by the coders before the data are entered into the datafile,
whereas there are other distinctions which can be derived later
when the data are being processed.
1. Key requirements
The codes for missing data need to represent the different instances
of missing data exhaustively. This means that each code in the
datafiles should either represent a valid data value or one of the
missing codes. There should never be a situation where a position in
the datafile is just left blank. There should also never be a situation
where there is no data from the respondent but none of the missing
codes applies.
Secondly, the missing codes should be mutually exclusive. This

means that there should be no ambiguity concerning which missing
code to apply in each particular situation, and there need to be clear
definitions and instructions on how to assign the missing codes.
Finally, it should be clear how the missing codes are coded in the
datafile and how the different instances of missing data are treated
in the data analyses.
2. Basic categories of missing data

The minimum distinction which the coders must make when
entering data is between: i) data that are missing because they were
omitted by the respondents or answered in an invalid way; and
ii) data that are missing because a question or test item was not
administered.
39
a. Missing/omit
“Missing/omit” codes refer to questions/items which a
respondent should have answered but which he/she either did
not answer or which were answered in an invalid way (though
sometimes a finer distinction between these categories may be
required). Some obvious reasons for assigning this code:
No Response: Where there was no response to a question or an
item where there should be one.
Two or More Responses: Where there were two or more
responses when only one answer was allowed.
Response Unreadable: Where the response was unreadable or
uninterpretable. Often the codes “9”, “99”, “999” (depending on
the length for a variable) are assigned to this type of missing
data to distinguish them from the valid and “not applicable”
data.
Sometimes a further distinction between questions that
were omitted by a respondent and questions that have been
answered in an invalid way is required but the analytical
distinctions will then be very complicated.
b. Not administered
“Not administered” codes are assigned when data were not
collected for an observation on a specific variable. There are
some obvious cases when this code should be used:
Respondent Not Present: For example, if a student was not
present in a particular testing session, then all variables
referring to that session were supposed to be coded to “not
administered”. However, if the student received the instrument
but did not answer particular questions, then these questions
must be coded as “missing”.
40
Booklet Not Received: If a student did not receive a particular

test instrument then all variables referring to that test
instrument were to be coded as “not administered”.
Item Left Out or Misprinted: If a particular question or item
(or a whole page) was misprinted, left out, or not available to
a student, teacher or school then the corresponding variables
were to be coded as “not administered”.
Item Mistranslated: If an item was mistranslated, then all
observations for this item were also to be coded as “not
administered”.
The codes “8”, “98”, “998” (depending on the length of the
variable) are often assigned to “not applicable” data to
distinguish them from the valid and other missing data.
c. Examples for derived categories of missing data

In certain situations, there are categories of missing data
which can be derived from existing data.
When a respondent was not meant to answer a variable
because of its logical relationship to other variables, these
variables could be recoded to the missing code “logical not
applicable”. For example, if a respondent gave a negative
answer to a filter question, then the corresponding dependent
questions could be recoded to “logical not applicable” unless
all dependent variables indicate that the filter variable was
incorrectly coded in which case it might be better to recode the
filter variable.
Data recorded in an invalid or inconsistent way have in some
cases been recoded to a special missing code “invalid”. In this
sense, “invalid” means that data were recorded in an invalid
way, i.e. that the coder coded a variable to a data value that
did not conform to the specifications in the codebook; this
does not necessarily mean that the respondent gave an invalid
response.
41
d. Coding of absentees and excluded students

Certain students within the selected schools may, for different
reasons, be unable to take part in the assessment. Countries
differ widely in the percentage of the population that is
considered to be in this position and this category should be
held to a minimum to avoid biasing international comparisons.
In some educational systems these students are located in
special schools or in special classrooms and the information
available for the construction of the national sampling frame
may allow the identification of schools and students belonging
to the excluded populations prior to the construction of the
within-school sampling frames. However, in other educational
systems this information is often not available. For example,
this can occur in countries where such students are integrated
in some schools of the mainstream schooling system even
though they may be part of the excluded population.
To accommodate this situation, precise standards should be
defined which allow these students to be excluded from the
administration of the tests. For example, it will clearly not
be sufficient for a study to state that “handicapped” students
may be excluded because the understanding of “handicapped”
students may include different kinds of physical, emotional,
and mental disabilities in different countries and therefore
may vary considerably between countries.
Care needs to be taken in finding comparable categorizations
for the within-school exclusion of students and it must be
ensured that these are coded appropriately in the datafiles.
The results of a data collection will be seriously threatened if
excluded respondents are simply ignored.
42
Data entry
Once the data have been returned from the respondents the data
need to be recorded in computer readable form. This section
provides an overview of different approaches to data entry and then
discusses two approaches to data entry in a more detailed way.
1. Basic approaches to data entry

Data may be collected on free-text notebooks, questionnaires,
optical scanning forms, or micro-computers. All further steps
depend on the quality with which the data entry is completed.
Inaccurate data entry often causes substantial delays in the data
verification and data analysis phases of a survey.
Adequate procedures for data entry depend on instrument design

and on the data collection methods. Sometimes in large scale
surveys, data entry procedures are used wherein data are recorded
directly in computer readable form using optical or magnetic
character readers, optical or magnetic mark readers, or micro-
computers during fieldwork. Examples of this are computer assisted
telephone interviewing (CATI) and computer assisted personal
interviewing (CAPI) systems. Whereas transcription errors can
be minimized with these procedures, the use of such technical
innovations requires careful planning, an expensive technical
environment, and trained respondents.
The more common approaches for data entry in educational surveys

are transcriptive procedures in which respondents write their
answers onto the instruments. The answers are then transcribed
either to machine readable form or directly into the computer.
Transcription is usually costly, sometimes requiring up to half of
the total data processing costs. If the response formats are complex
or the coding requires specially trained coding personnel, then
43
an additional coding stage may need to be inserted in which the

responses are translated into their codes which are then written
on the instruments or transcribed to special code-sheets. Although
introducing an additional source of error, nonetheless separating
coding from data entry allows faster coding of data and does not
require coding skills for the data entry personnel.
Key verification procedures, or better still, independent

verification techniques where two coders code and enter the data
independently, can help to ensure the correctness of the data
entered. While perhaps too costly to process the whole dataset, at
least a reasonable sized sample of the data should be verified using
these techniques in order to estimate the error introduced and to
decide on further corrective measures to ensure sufficient data
quality. Often it is advantageous to identify the coder who entered
each record so that any errors can be traced back. This can be done
by adding a coder identification code to the datafile.
It is important to trial test data entry procedures at an early stage so

that resources required for timely entry can be planned.
2. Using a text editor for data entry

For each piece of information in the data collection instruments the
codebook defines which format and into which positions it should
be entered into the raw datafile. Following the definitions in the
codebook, it is possible to simply enter the data into a text editor or
word processor. An example for how such a text file would look like
is provided in the following using the codebook of the above sample
questionnaire.
103103042 83941991019110
103103051124232130110110
103104063 92221241000110
44
As you can see, the codebook starts with the School ID (103),
followed by the Student ID (10304), the student sex (the 2 indicates
a girl), the students age (8 years), and so on until all variables in the
codebook have been coded.
However, a great deal of caution must be used when following

this approach and there is usually a great deal of work involved in
resolving problems resulting from such an approach. To give an
example for this, four frequently occurring problems are listed in
the following:
If, by mistake, a coder skips a code or enters a code twice, then all
subsequent codes in the datafile will be shifted and thus change
their implied meaning in the datafile:
Incorrect: 10310304283941991019110
Correct: 103103042 83941991019110
The student age should be coded in columns 10-11. If, as in the

above example for student 10304, the coder puts the code for the
age in the 10th position and then continues in position 11 with the
remaining variables, then columns 10-11 would contain the value
83 and the computer would interpret this as the age of 83 years in
later analyses. All variables following the students age would be
misinterpreted similarly. This can have dramatic impacts on the
statistical results, for example, if we calculate the mean age and
there is an outlier with 83 years in the datafile, then the overall
mean can change substantially if the sample size is not too large.
The approach also does not allow to verify during data entry
whether the data values entered conform indeed to the
specifications in the codebook:
Example: 103104063 92221241000110
45
In this example the position for the student sex contains the value
“3” which is outside the set of permitted values (“1” for “boy”, “2”
for “girl”, and “8” and “9” for the missing codes) and is obviously a
coding error. Besides losing the information for this student it also
has, if undetected, an impact on the results of statistical analyses.
Furthermore, such an approach does not allow to verify the data for
internal consistency while the data are entered:
Example: 103103042 83941341019110
In this example, the questions on the student reading activities and

the number of books have been coded as “3” and “4” respectively,
indicating that the student always reads the books and that there
are between 11 and 50 books in his home. Looking at the sample
questionnaire from which the data was entered, we see that both
questions were left blank by the respondent and therefore should
have been coded to “9”. Now it could be argued that also a computer
validated range check would not have found the error, since both
“9”, “3”, and “5” are valid codes for this question and in accordance
with the codebook. However, a computerized range validation
could have taken into account that these questions have a filter
question (question 3) asking for the existence of books at home
which was coded to “1”, meaning that the student did not have
access to any books outside school. When the coder enters a code
indicating that there are 11-50 books at home while at the same
time the filter question indicates that the student does not have
access to books outside schools, the computer could have alerted
the coder requesting that these datavalue be checked once more
against the data collection instruments. Only if the respondent had
indeed answered inconsistently, then the coder would put these
inconsistencies in the datafile.
Another problem are duplicate identification codes or

inconsistencies in the identification codes:
46
Example: 103104063 92221241000110
In this example, the school identification code does not match the
first three digits in the student identification code, even though a
hierarchical identification system was used. Entering the data into a
text file the coder might not notice this mistake. While this problem
would be impossible to resolve once the original data collection
instruments are no longer available, a computer-controlled data
entry programme could verify the student and school identification
codes during data entry for internal consistency, alerting the coder
that either the student or the school identification code contains an
error and asking the coder to immediately check this information
back against the original data collection instruments.
3. Using a computer-controlled approach

for data entry, the dataentrymanager
programme
Transcriptive data entry can greatly be facilitated through the use
of interactive data entry software that integrates the processes of
the entry, editing, and verification of the data. Such data entry
systems come often also with integrated data and file management
capabilities including mechanisms for the transfer of the data
to standard statistical analysis systems. Using such systems,
deviations of datavalues from pre-specified validation criteria or
data verification rules can be detected quickly, thereby letting the
user correct the error while the original documents are still at hand.
An example for such a programme is the WinDEM which is

provided by the IEA and which is briefly described in the following:
This programme has been designed to be used by users with limited

experience in computer use. The programme can handle datafiles
with more than 1000 variables and data for more than 1 000 000
47
000 respondents. All datafiles created are fully compatible with the
dBASE IV™ standard.
The WinDEM programme operates through a system of menus and

windows. It contains nine menus with which you can accomplish
different tasks.
With the FILE menu you can open, create, delete or sort a datafile.
As you have seen earlier in this module, you can use this menu
also to edit the electronic codebook which is associated with
each datafile and which contains all information about the file
structure and the coding schemes employed. Furthermore, you
can use this menu to print the electronic codebook or to transform
the information in the electronic codebook into SAS™, SPSS™,
or OSIRIS/IDAMS™ control statements which you can use later
in order to convert the datafiles into SAS™, SPSS™, or OSIRIS/
IDAMS™ system files. Finally this menu allows you to exit the
WinDEM programme.
With the EDIT menu you can enter, modify, or delete data in
datafiles. You can look at a datafile in two different ways: (a) in
record view, you can view the data for one record at a time with
detailed information on each of the variables; (b) in table view, you
can view a datafile as a whole in tabular form with records shown
as rows and variables shown as columns. The programme will
control the processing of the data entered, interrupting and alerting
you when data values fail to meet the range validation criteria which
are specified in the electronic codebook.
With the SEARCH menu you can search for specific records using
your own search criteria or locate a record with a known record
number.
With the SUBSET menu, you can define a subset of specific records
using your own criteria. This will then restrict your view of the data
to the records which match these criteria.
48
You can use the Print menu to print pre-selected records on a

printer or to a text file.
You can use the IMPORT/EXPORT menu to generate fixed

form ASCII raw datafiles or free format datafiles from the
WinDEM system files or to import raw datafiles or free format
datafiles created with other software packages into the WinDEM
programme. For example, if you want to verify and clean a datafile
that has been created with a text editor, you can use the Import item
of the IMPORT/EXPORT menu. The Import item of the IMPORT/
EXPORT menu may also be used to combine several datafiles into
a single datafile. The Export item of the IMPORT/EXPORT menu
is helpful if you want to further process a datafile with software
packages like SAS™ or SPSS™.
You can use the VERIFY menu to apply a variety of data

verification checks to your data. With the knowledge of the data
verification rules which are specified in the electronic codebook,
the programme will check the datafiles and report when problems
occur. These problems can then be resolved in record view or table
view.
With the ANALYSIS menu you can calculate simple univariate

statistics. You can select the variables as well as the records for
which statistics are to be calculated by various criteria.
You can use the TOOLS menu to back-up data from the hard disk
onto diskettes or to restore data from the backup diskettes in case
the data on your hard disk has been damaged. You can further use
this menu to configure the programme to your specific hardware
environment.
The following section contains an example of how to interactively

enter data from our sample questionnaires using the WinDEM
programme.
49
WinDEM
This would require the following steps:
• opening the datafile;
• choosing your view on the data;
• entering the identification codes for the respondent;
• entering the response data; and
• saving the data.
After selecting the datafile, the programme will bring you to the
EDIT menu, where you can choose to look at the datafile in two
different ways, in record view or in table view. The difference
between the two displays is that record view will provide you
with a detailed display of one record at a time, whereas table view
will provide you with a tabular overview of several records of the
datafile at the same time.
1. Entering data
The record view: Suppose we would choose the Record view item
from the EDIT menu.
When you start editing in Record view, you will see some useful
information on the screen (Figure 7).
• the main menu with the highlighted bar positioned on the

currently selected menu (EDIT);
50
• the status line with the filename (SAMPLE1.DBF), the number

of the current record, and the total number of records (1 of (1))
in the datafile; and
• fields filled with default codes (“999” and “99999”) for the
identification variables (IDSCHOOL and IDSTUD respectively).
You may enter data in the fields filled with default codes. You can go
to a previous variable with the [á ]-key or to a subsequent variable
with the [â ]-key provided that the variable in which the cursor is
currently positioned has a valid code.
You have to complete the identification variables first. Into

the identification fields on the display you would enter the
identification codes shown in the sample questionnaire, that is,
enter 103 for the variable IDSCHOOL and 10304 for the variable
IDSTUDENT.
FIGURE 7 Entering codes for the data variables
51
Note that the programme will only allow you to enter those
variables which match the criteria which you have specified in the
codebook. For example, if you enter the code “3” for the variables
SSEX (students sex) the programme will reject this value. This is
because we have defined only the codes “1” for “boy” and “2” for
“girl”, “9” for “missing”, and “8” for “not administered”.
If, for an open-ended or “non-categorical” variable (variable type

“N”), a data value is entered which is outside the range specified in
the codebook, then the programme will alert you and ask you to re-
enter the value.
2. Reviewing your data

You can review your datafile in a tabular display in which each
student is represented in one row with the different variables
represented as the columns. To enter into this view, select the Table
view item of the EDIT menu. The following display will then appear
(Figure 8):
FIGURE 8 Reviewing data in table view
52
E XERCISE 2:
Enter data using the questionnaires filled by 10 students, shown in
Figure 9.
53
FIGURE 9 10 cases of collected questionnaires

Case 1
School identification code 103

Student identification code 10304
Boy .............................................. 1
Girl .............................................. 2 v

8
....................................................... years
(a) Morning meal 1 2 3v 4
(b) Lunch 1 2 3 4
(c) Evening meal 1 2 3 4v
Yes .............................................. 1 v
Always .......................................... 3
Sometimes................................... 2
Never............................................ 1
None............................................. 1
1 to 10 books ............................. 2
11 to 50 books ........................... 3
More than 50 books ................. 4
(a) Radio 0 1v
(b) TV 0v 1
(c) Table to Write on 0 1v
(d) Bicycle 0 1
(e) Electricity 0 1v
(f) Running Water 0 1v
(g) Daily Newspaper 0v 1
54
Case 2

Boy .............................................. 1 v
Girl .............................................. 2

10 years
.......................................................
(a) Morning meal 1 2v 3 4
(b) Lunch 1 2 3 4v
Yes .............................................. 1 v
Always .......................................... 3 v
Sometimes................................... 2
Never............................................ 1
None............................................. 1
1 to 10 books ............................. 2 v
11 to 50 books ........................... 3
More than 50 books ................. 4
(a) Radio 0 1v
(b) TV 0 1v
(d) Bicycle 0v 1
(g) Daily Newspaper 0 1v
55
Case 3

Boy .............................................. 1 v
Girl .............................................. 2

9 years
.......................................................
(a) Morning meal 1 2 3 4v
(b) Lunch 1 2 3 4v
Yes .............................................. 1 v
Always .......................................... 3
Sometimes................................... 2 v
Never............................................ 1
None............................................. 1 v
1 to 10 books ............................. 2
11 to 50 books ........................... 3
More than 50 books ................. 4
(a) Radio 0 1v
(b) TV 0v 1v
(d) Bicycle 0v 1
56
Case 4

Boy .............................................. 1
Girl .............................................. 2 v

9 years
.......................................................
(b) Lunch 1 2 3 4v
Yes .............................................. 1
No .............................................. 2 v If “No”, go to question 6.
Always .......................................... 3
Sometimes................................... 2 v
Never............................................ 1
None............................................. 1 v
1 to 10 books ............................. 2
11 to 50 books ........................... 3
More than 50 books ................. 4
(a) Radio 0v 1
(b) TV 0v 1
(c) Table to Write on 0v 1
(d) Bicycle 0v 1
57
Case 5

Boy .............................................. 1
Girl .............................................. 2 v

15 years
.......................................................
(b) Lunch 1 2 3 4v
(c) Evening meal 1 2 3v 4
Yes .............................................. 1 v
Always .......................................... 3
Sometimes................................... 2 v
Never............................................ 1
None............................................. 1
1 to 10 books ............................. 2
11 to 50 books ........................... 3
More than 50 books ................. 4 v
(a) Radio 0 1v
(b) TV 0 1v
(d) Bicycle 0 1v
58
Case 6

Boy .............................................. 1 v
Girl .............................................. 2

12 years
.......................................................
(b) Lunch 1 2 3 4v
(c) Evening meal 1 2v 3 4
Yes .............................................. 1 v
Always .......................................... 3
Sometimes................................... 2
Never............................................ 1 v
None............................................. 1
1 to 10 books ............................. 2 v
11 to 50 books ........................... 3
More than 50 books ................. 4
(a) Radio 0 1v
(b) TV 0v 1
(d) Bicycle 0v 1
59
Case 7

Boy .............................................. 1 v
Girl .............................................. 2

12 years
.......................................................
(b) Lunch 1 2 3v 4
Yes .............................................. 1 v
Always .......................................... 3
Sometimes................................... 2 v
Never............................................ 1
None............................................. 1
1 to 10 books ............................. 2 v
11 to 50 books ........................... 3
More than 50 books ................. 4
(a) Radio 0v 1
(b) TV 0v 1
(d) Bicycle 0v 1
60
Case 8

Boy .............................................. 1
Girl .............................................. 2 v

8
....................................................... years
(b) Lunch 1 2 3 4v
Yes .............................................. 1 v
Always .......................................... 3 v
Sometimes................................... 2
Never............................................ 1
None............................................. 1
1 to 10 books ............................. 2
11 to 50 books ........................... 3
More than 50 books ................. 4 v
(a) Radio 0 1v
(b) TV 0v 1
(d) Bicycle 0v 1
61
Case 9

Boy .............................................. 1
Girl .............................................. 2 v

9 years
.......................................................
(a) Morning meal 1 2 3 4v
(b) Lunch 1 2 3 4v
Yes .............................................. 1 v
Always .......................................... 3 v
Sometimes................................... 2
Never............................................ 1
None............................................. 1
1 to 10 books ............................. 2
11 to 50 books ........................... 3 v
More than 50 books ................. 4
(a) Radio 0 1v
(b) TV 0 1v
(d) Bicycle 0 1
62
Case 10

Boy .............................................. 1 v
Girl .............................................. 2

10 years
.......................................................
(b) Lunch 1 2 3 4v
Yes .............................................. 1 v
Always .......................................... 3
Sometimes................................... 2
Never............................................ 1 v
None............................................. 1
1 to 10 books ............................. 2
11 to 50 books ........................... 3 v
More than 50 books ................. 4
(a) Radio 0 1v
(b) TV 0 1v
(d) Bicycle 0 1v
63
6 Data verif ication

A critical step in the management of survey data is the verification
of the data. It must be ensured that the data are consistent and
conform to the definitions in the codebook and are ready for
analytical use. This section describes common approaches to
data verification and indicates potential problems that need to be
addressed.
Whenever data are collected, almost inevitably errors are introduced

by various sources. Substantial delays occur when these errors
are not anticipated and safeguards built into procedures. Errors
in the data may be caused: (a) by faulty preparation of the field
monitoring and survey tracking instruments, (b) by the assignment
of incorrect identifications to the respondents, (c) during the field
administration, (d) during the preparation of the instruments
including the coding of the responses, and (e) during transcription
of the data.
Whereas in an ideal situation all questions would be answered

in the way intended in the codebook there are many sources for
deviations from the codebook. For example, questions may have
been skipped due to technical or organizational imperfections (e.g.
misprints, missing pages) or questions may have been skipped
or answered in a way not intended because of misunderstanding
during translation and/or ambiguities in the question. Such
deviations can result in: (a) variables which have been skipped but
which should have been included; or, variables which have been
included when they should not have been (e.g. when they were
misprinted); (b) incorrectly coded variables; (c) variables which have
a content different from that specified by the codebook.
64
The amount of work involved in resolving these problems, often
called “data cleaning” can greatly be reduced by using well-designed
instruments, qualified field administration and coding personnel
and appropriate transcription mechanisms. The steps that must
be undertaken to verify the data are implied in the quality
standards that have been defined for the corresponding survey.
Procedures must be implemented for checking invalid, incorrect
and inconsistent data, which may range from simple deterministic
univariate range checks to multivariate contingency tests between
different variables and different respondents. The criteria on which
these checks are based depend, on the one hand, on variable type
(i.e. different checks may apply to data variables, control variables,
and identification variables) and, on the other hand, to the manner
and sequence in which questions are asked. For some questions a
certain number of responses are required, or responses must be
given in a special way, due to a dependency or logical relationship
between questions.
Depending on the data collection procedures used, it must

be defined when and at what stage data verification steps are
implemented. For example, some range validation checks may
be applied during data entry whereas more complex checks
for the internal consistency of data or for outliers may be more
appropriately applied after the completion of the data entry.
Problems that have been detected through verification procedures
need to be resolved efficiently and quickly.
Some problems can, using certain assumptions, be resolved

automatically on the basis of cross-checks in the datafiles. Other
problems will require further inspection and manual resolving.
Where problems cannot be resolved, the problematic data-values
must be recoded to special missing codes.
The criteria on which the checking was based depended, on the

one hand, on the type of variables that were used to code the
information (for example, different criteria apply to data variables,
65
identification variables, control variables, filter variables, and

dependent variables) and, on the other hand on the way and
sequence in which questions were asked (for example, for some
questions a certain number of responses are required, or responses
must be given in a special way, or there is a dependency, or a logical
relationship between certain questions).
A report on the verification of the data should be produced listing

each error that was encountered and the solutions undertaken.
For each problem the following questions ultimately have to be
answered: (a) when to correct a data-value on the basis of other data
values; (b) when to set a data-value to a specific missing code which
indicates that the question was not administered, had an invalid,
missing, or no applicable answer; (c) when to drop a respondent
because of invalid, missing or not administered data; and (d) when
to drop a question or variable.
Data verification steps

Usually, data verification is undertaken through a series of steps,
which for each survey need to be established and sequenced in
accordance with the quality requirements, the type of data collected,
and the field operations and data entry procedures applied.
Common data verification steps are:
• the verification of returned instruments for completeness and

correctness;
• the verification of identification codes against field monitoring

and survey tracking instruments;
• a check for the file integrity, in particular, the verification of

the structure of the datafiles against the specifications in the
codebook;
66
Data verification
• the verification of the identification codes for internal

consistency;
• the verification of the data variables for each student or teacher

against the validation criteria specified in the codebook;
• the verification of the data variables for each student and

teacher against certain control variables in the datafiles;
• the verification of the data obtained for each respondent for

internal consistency (for example, the responses for questions
which were coded through split variables the answers to these
can be cross-validated);
• the cross-validation of related data between respondents;
• the verification of linkages between related datafiles, especially

in the case of hierarchically structured data; and
• the verification of the handling of missing data.
The most important of these steps are described in more detail in

the following:
1. Verification of file integrity

As a first step it needs to be ensured that the overall structure of
the datafiles conforms to the specifications in the codebook. For
example, if raw datafiles are used, then each record should have the
same length in correspondence with the codebook. Often it is useful
to introduce column-control variables in the codebook at regular
intervals which the coder should code with blanks. When reviewing
the datafile these positions should appear as blank columns and
provide therefore a useful means of detecting columns shifts in the
datafile. If we find a non-blank value in any of these columns, this
can indicate the following problems:
67
• a single transcription error for one of these column-control

variables;
• a column shift for the current respondent (for example, a coder
might have entered a value for a certain data-value twice and
thus codes in subsequent columns were wrong). In this case,
a check was made to see whether subsequent column-control
variables also had invalid values;
• a global column shift for the whole datafile which means that
the variables were not coded in the format specified in the
codebook;
• In more advanced data collection systems such as the
WinDEM programme, data are directly transcribed into a
database which ensures the file integrity automatically.
2. Special recodings
Sometimes it is necessary to recode certain variables before they
can be used in the data analyses. Examples for such situations are:
• The sequence of questions or items has been changed for some

reason and is not any more in accordance with the codebook;
• A question or item may have been asked to some respondents
in a different way or format than was intended;
• A question or item has not been asked to some respondents
but the missing information could be derived from other
variables.
3. Value validation
Background questions and test items for which a fixed set of codes
rather than open-ended values applied need to be checked against
the range validation criteria defined in the codebook. Variables with
open-ended values (e.g. “Student age”) need to be checked against
theoretical ranges.
68
Data verification
4. Treatment of duplicate identification codes

Each datafile should be checked for duplicate identification codes. It
is thereby often useful to distinguished between the following two
cases:
• Respondents who have identical identification codes but

different values for a number of key data variables. These
respondents have probably been assigned invalid identification
codes.
• Respondents who have identical identification codes and at the
same time also identical datavalues for the key data variables.
These respondents are most likely duplicate entries in which
case the second occurrence of the respondents data was
removed from the datafiles.
5. Internal validation of an hierarchical

identification system
If a hierarchical identification system is used for identifying
respondents at different levels, then the structure of this system can
be verified for internal consistency. The number of errors in identi-
fication codes which are often a serious threat to the use of the data
can thus be dramatically reduced. Often inconsistencies can then be
resolved automatically or even avoided during the entry of data.
6. Verification of the linkages between datafiles

Often data are collected at different levels, for example data are
collected for students and for the teachers which teach the classes
in which the students are enrolled. Then it is important to verify
the linkages between such levels of data collection. There is a wide
range of potential problems, for example:
69
• There may be cases where for a teacher in the teacher datafile

there are no students linked to him or her in the student
datafile (that is, for a certain Teacher ID in the teacher datafile
there is no matching Teacher ID in the student datafile);
• There may be cases in the student datafile which do not have a
match in the teacher datafile even though they are associated
with a Teacher ID;
• There may be cases where the Class IDs of all students which
were linked to a teacher were different from the Class ID of
this teacher in the teacher datafile; and
• There may be cases where the Class IDs of all students which
were linked to a teacher are different from the Class ID of this
teacher in the teacher datafile.
7. Verification of participation indicator

variables against data variables
In many situations, respondents are asked to respond to multiple
assessment instruments, often in multiple assessment sessions. For
example, students may be given a test in the morning and a second
test in the afternoon. If we find no data for a student for a particular
testing session then it is then crucial to know whether there are no
data because the student did not participate in the testing session
or whether there are no data because the student did not respond to
any of the test items in this session. This is so important because,
for example, in the first situation we would base the student score
only on the answers given in the first testing session and exclude
the items in the second testing session from scoring, whereas in the
second situation case we would score all items in the second testing
session as wrong.
To allow the verification of this, the datafile should, besides the

variables with the questions and test items, contain also information
about the participation status of the respondents in the different
70
Data verification
testing sessions. It is often useful to group the variables in the

codebook into blocks. Each of these blocks can then begin with a
variable the code of which indicates the participation of the student
in the respective testing session. It can then be verified whether
the participation indicator variables matched the data in the
corresponding data variables.
8. Verification of exclusions of respondents

Some surveys allow that respondents are excluded from the
assessment for certain reasons. A severe mistake results if these
respondents are not entered into the datafiles but simply ignored
because then they will not be accounted for in any reports. For
example, if test administrators are allowed to excluded certain
mentally or physical handicapped students from the assessment
but if these students are not accounted for in the presentation of
the overall achievement of the sample, then the survey results may
be severely biased. It is therefore important that each respondent
is entered into the datafiles. If the respondent was excluded, then
this should be indicated in a specially designed variable. The codes
of this variable should be verified against the information in the
participation indicator variables for each respondent.
9. Checking for inconsistencies in the data

The criteria on which the consistency checks are based usually de-
pend on the way and sequence in which questions and items were
asked (e.g. for some questions a certain number of responses is
required, or responses must be given in a special way). Some ques-
tions could be answered independently from each other, whereas, in
other cases, questions were logically related to other questions.
For the purpose of data verification inconsistency checks can often

be classified as follows:
71
• Inconsistencies between the answers to particular questions

for a given respondent (e.g. inconsistencies between answers
to a dependent question and the corresponding filter question).
• Inconsistencies between the responses of different
respondents to particular questions (e.g. answers of students
of the same class to variables referring to the class).
• Inconsistencies between class- and school level aggregates of
student questions.
It is always good to establish the data verification rules for
inconsistencies while the data collection instruments are being
prepared so that decisions related to the response format of the
questions can be made taking the complexity of the data verification
procedures and the analytical treatment of the data into account.
Some questions are asked and/or coded in terms of more than one
variable. The data verification rules applicable to such variables then
depend on whether the variables were related to each other and on
whether open-ended codes or a fixed set of codes were used.
Further problems arise when data that are missing are not properly
distinguished from “zero” values. For example, suppose in a
questionnaire for school principals there is a question asking for the
enrolment rates of boys and girls. If the school principal in an all
girls-school leaves out the question asking for the boys enrolment
rates implying that the omission means a “zero”, then the coder
might enter a missing code for this question to the datafile which is
misleading. An extra data verification step needs then to be applied
(that in this case would cross-check the variables for boys and girls
enrolments) in order to check for these problems and to avoid a
distortion of the corresponding sample estimates.
Sometimes there are some questions which provide a checklist in

which respondents are asked to either check or omit each of the
response options. The coders are, for example, asked to code the
checked options to “2”, the not-checked options to “1”. A potential
problem is that this type of coding does not allow a distinction
72
Data verification
between cases where a student did not check any of the response
options because none of the options applied and cases where a
student omitted the whole question. However, in the analysis it is
important to know whether a respondent omitted the whole question
or whether he or she did not check a particular response option. It is
best to avoid such problems by not using such response formats.
Establishing data verification rules for split variables with open-

ended codes is more difficult, in particular when they were only
partially answered. If the question requires a composite answer and
only one component of the answer has to be given, then the decision
has to be made whether to interpret the missing answer as missing,
zero, or whether some form of imputation should take place.
Relationships between filter and dependent questions need to

be verified. Sometimes these relationships are made explicit by a
statement like: “If you answered “No” to question 3 then please go
to question 6”. In other questions the dependency is not explicitly
stated, but the answer to a first question should condition the
answer to some following questions. If the patterns of answers
are consistent, we would expect that when a filter question was
answered “No”, its dependent questions would have either been
skipped (in case of explicit dependence) or answered in a negative
way. In both of these cases the corresponding variables could then
be recoded to a “logical not applicable” code and the calculation of
statistics would then be based only on the variables with a positive
answer to the filter question.
73
Data verification procedures

using WinDEM
WinDEM offers some simple data verification procedures. All the
procedures are found under the “Verify” menu. The following
section describes some of the fundamental ones:
1. Unique ID check
There must be only one record within a file for each unit of analysis
surveyed. This verification procedure checks whether each record
has been allocated with a unique identification code.
2. Column check
When a series of similar variables exist in a file, it is possible
that the enterer skips a variable or enters a variable twice, and
consequently a column shift occurs. This can be avoided if you
introduce variables in the datafile at regular positions in the
codebook, into which the data entry personnel must enter a blank
value. In order to be recognized by the automatic checking routines
of the WinDEM program, the names of these variables must have
the prefix “CHECK”. Column shift should not occur if the data
enters followed these directions of entering the blank values. You
can also see that the data entry proceeded correctly by looking at
the “Table entry” from the “view” menu.
3. Validation check
As mentioned before, WinDEM assures that the values are within
the range specified in the structure file unless the data puncher
explicitly confirms the out-of-range values entered. This validation
criteria check will show all the variables of all the cases that have
been “confirmed” to contain out-of-range values. This can be useful
especially when many data enterers are involved in the survey study.
74
Data verification
4. Merge check
WinDEM allows you to check the consistency between variables.
This check detects records in a datafile that do not have matches in
a related datafile for a higher level of data aggregation. For example,
a survey in which data is collected from the students and from the
school principals of the schools in which the students are enrolled.
In such a case, the student data could be recorded in a student
datafile with the name “student.DBF”, and the data from the school
principals could be recorded in a school datafile with the name
“school.DBF”. To check whether each student in the student datafile
has a matching school principal in the school datafile, the school
identification code “IDSCHOOL” must exist in both the student
datafile and the school datafile.
Using “Merge check” from the “Verify” menu, you can select the
variables (or variable combinations) by which the records in the
selected data file are matched against the records in the higher-
level aggregated data file. The software will ask you to specify the
datafile against which to check the merge of the current datafile in
the “File Open Dialog”.
The program will notify you if some errors are found. The software
will ask you if you want to open the data verification report for
further details.
5. Double coding check

In order to produce high-quality data, it is sometimes recommended
to enter data into two different computers (requiring two different
data enterers). This allows you to examine if the two files have
exactly the same structures and the values on all records. In order
to check this, however, you will have to have these two files under
different names and the two corresponding codebooks on one
computer.
75
7 Database construction and

database management
It is often desirable to integrate the cleaned data into a database
management system which allows for efficient use of the collected
information. This section provides a brief overview on the
establishment of such a management system.
The format of the data after data entry and data verification
processes have been completed, is often not the best format for the
use in data analyses. In order to manipulate, analyze and report
the information collected in a convenient and efficient way, the
data needs to be organized in a database system. Such a database
system is a structured aggregation of data-elements which can be
related and linked to each other through specified relationships.
The data-elements can then be accessed through specified criteria
and through a set of transaction operations, which are usually
implemented through a data-retrieval language. In such a database
system the links between the physical data stored in the computer,
their conceptual representation, and the views of the users on the
data are implemented through a database management system.
The database system ensures that: (a) information is stored with as
little redundancy as possible; (b) data are stored in a way which is
independent of the application and the storage is independent of
the users’ view on the data; (c) inconsistencies between different
datafiles are avoided; and (d) data can be stored centrally and be
shared and controlled by a single security system.
76
For the purpose of database construction all data need first to
be organized in logical entities such that the data-elements are
logically de-coupled especially with respect to different levels of
data aggregation. Different conceptual data-models are used in
the design of database systems which are associated with different
functional tasks.
The researcher communicates with the database through a set of

commands consisting of keywords and a syntax for instructions
which makes it possible: (a) to derive the required output such
as reports and data analyses; (b) to describe the format of the
data-elements in a database; and (c) to maintain and update the
data. The researcher can then relate variables and respondents to
each other and analyze the generated information.
If the primary focus of the researcher is data analysis, then the use
of statistical packages with integrated data management capabilities
are often valuable. For this purpose the Statistical Analysis System
(SAS) is currently the most promising candidate. It allows the
researcher to generate and link data structures and to programme
data analysis requests without requiring software development
expertise.
77
8 Conclusion
The careful planning and implementation of data management are
essential to obtain accurate and valid survey results and to avoid
delays in survey administration. Computing staff should therefore
be consulted from the very beginning of a research project.
This module has shown that data management issues are relevant,
and must be planned, during almost all phases of a research project;
starting from the design of the data collection instruments and
the development of the coding schemes, through the design of the
data collection methods and field administration procedures, the
setting of quality standards, the data entry and data verification,
and finishing with database construction. To implement each of
these steps, various technologies are available and it is the task of
the researcher to decide which procedures are most appropriate for
the survey design given administrative, logistical, and economic
constraints.
78
Quality (SACMEQ).


and innovation”.

Educational Research

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Educational Research

Caricato da

Copyright:

Formati disponibili

Quantitative research methods

UNESCO International Institute for Educational Planning

The publication is available from the following two Internet Websites:

International Institute for Educational Planning/UNESCO

September 2005 © UNESCO

Graphic design: Sabine Lebeau

2. Types of educational research 2

3. Three types of research questions

4. Identifying research issues for educational

5. Sequential stages in the research process 16

Validity and reliability 39

Research is the orderly investigation of a subject matter for the

Within the realm of educational planning, many things are always

2 Types of educational research

An exploratory study is undertaken in situations where there is

A more widely applied way of classifying educational research

1. Historical research generates descriptions, and sometimes

3. Correlational research involves the search for relationships

4. Causal research aims to suggest causal linkages between

5. Experimental research is used in settings where

6. Case study research generally refers to two distinct

that take place between students and other relevant persons.

7. Ethnographic research usually consists of a description of

8. Research and development research differs from the

process. Alternatively, it can be ‘summative’ (by evaluating

3 Three types of research

Some examples are:

• What is the physical state of school buildings in the country?

• Where systems of education have teacher housing, how

• What is the level of achievement in the core subject areas at a

of achievement (at the same age level) of different national

Some examples are:

• Do students in poorer school buildings have lower achievement

• Do students in better equipped classrooms have better

• Do students in schools where the teachers have better teacher

• Do male students do better than female students in the

As will be seen from another module in this series on ‘Research

• All other factors being equal do students with Textbook A

• What is the relative effect on school achievement of the

• the socio-economic level of students in the school;

• peer group pressure;

A wise educational researcher, educational planner, and ex-Minister

“I have suggested areas of research that seem to me to be of special

The examples of questions given earlier are general questions.

Before proceeding to a discussion of the sequential steps in the

1. Indonesia national evaluation of grade nine

2. Indonesia non formal learning behaviour

3. Thailand adult education project

4. Thailand community secondary schools project

5. Malaysia remedial reading project

6. Malaysia moral education project

These are only a few selected examples of research conducted by

7. Indicators of the quality of education: a summary of a

Indeed in the mid-1990s there were another seven southern African

Most studies were concerned with inputs to schools and the

5 Sequential stages in the research

General and specific research questions

Once the variables on which data are to be collected are known,

For research that aims to generalize its research findings, a more

Once these decisions have been taken, the instrument construction

Figure 1 Stages in the research process

Stage 1 Identification of research issues in terms of general and specific

The two main purposes of most pilot studies are:

a. To assess whether a questionnaire has been designed in