One major way in which test
results can be interpreted from different perspectives involves the distinction
between norm- and criterion-referenced testing, two different frames of
reference that we can use to interpret test scores. As Thorndike and Hagen
(1969) point out, a test score, especially just the number of questions
answered correctly, “taken by itself, has no meaning. It gets meaning only by
comparison with some reference” (Thorndike and Hagen: 241). That comparison may
be with other students, or it might be with some pre-established standard or
criterion, and the difference between norm- and criterion-referenced tests
derives from which of these types of criterion is being used.
Norm-referenced tests
(NRTs) are tests on which an
examinees results are interpreted by comparing them to how well others did on
the test. NRT scores are often reported in terms of test takers’ percentile
scores, that is, the percentage of other examinees who scored below them.
(Naturally, percentiles are most commonly used in large-scale testing;
otherwise, it does not make much sense to divide test takers into 100 groups!).
Those others may be all the other examinees who took the test, or, in the
context of large-scale testing, they may be the norming sample—a
representative group that took the test before it entered operational use, and
whose scores were used for purposes such as estimating item (i.e. test
question) difficulty and establishing the correspondence between test scores
and percentiles. The norming sample needs to be large enough to ensure that the
results are not due to chance—for example, if we administer a test to only 10
people, that is too few for us to make any kind of trustworthy generalizations
about test difficulty. In practical terms, this means that most norm-referenced
tests have norming samples of several hundred or even several thousand; the
number depends in part on how many people are likely to take the test after it
becomes operational.
The major drawback of
norm-referenced tests is that they tell test users how a particular examinee
performed with respect to other examinees, not how well that person did
in absolute terms. In other words, we do not know how much ability or knowledge
they demonstrated, except that it was more or less than a certain percentage of
other test takers. That limitation is why criterion-referenced tests are so
important, because we usually want to know more about students than that.
“About average,” “a little below average,” and “better than most of the others
by themselves do not tell teachers much about a learner s ability p er se. On
the other hand, criterion-referenced tests (CRTs) assess language
ability in terms of how much learners know in “absolute ’ terms, that is, in
relation to one or more standards, objectives, or other criteria, and not with
respect to how much other learners know. When students take a CRT, we are
interested in how much ability or knowledge they are demonstrating with
reference to an external standard of performance, rather than with reference to
how anyone else performed. CRT scores are generally reported in terms of the
percentage correct, not percentile. Thus, it is possible for all of the
examinees taking a test to pass it on a CRT; in fact, this is generally
desirable in criterion-referenced achievement tests, since most teachers hope
that all their students have mastered the course content.
Note also that besides
being reported in terms of percentage correct, scores may also be reported in
terms of a scoring rubric or a rating scale, particularly in the
case of speaking or writing tests. When this is done with a CRT, however, the
score bands are not defined in terms of below or above “average'5 or “most
students,’ but rather in terms of how well the student performed—that is, how
much ability he or she demonstrated. A rubric that defined score bands in terms
of the “average,” “usual,” or “most students,” for example, would be
norm-referenced.
0 komentar:
Posting Komentar