Presentacion Daniel Koretz

8/10/2019 Presentacion Daniel Koretz

1/67

Using tests for monitoring and accountability

Prof. Daniel Koretz

Harvard Graduate School of Education

Agencia de Calidad de la EducacinSantiago, Chile

November 3, 2014


2/67

Two basic questions posed by the Agencia

de Calidad de la Educacin

Value and limitations of standardized tests as

indicators of the quality of education

The use of tests to improve learning

What can go wrong when tests are used for

accountability?

(2)


3/67

Two basic questions posed by the Agencia

de Calidad de la Educacin, revised

Value and limitations of standardized tests as

indicators of the quality of education the performance

of students

Test scores describe what students can do; they

do not explain why they can do it

The use of tests to improve learning

What can go wrong when tests are used for

accountability?

(3)


4/67

Topics

Session 1:

What is a test? The sampling principle of testing

Examples of design choices for different purposes

Session 2:

What are the risks of test-based accountability?

Undesirable changes in instruction

Score inflation

Why do these occur?

(4)


5/67

Part I: The value and limitations of

standardized tests

(5)


6/67

What is a test?

A test is a very small sample of behavior from thestudent

It is valuable only to the extent that it lets us estimate

mastery of the domainthe knowledge and skills itrepresents

For example, 40 or 50 test items are often used to

estimate something like cumulative mastery of

mathematics through grade 8

(6)


7/67

7

Sampling to obtain a test

1

2 Student achievement Other

3 Domains selected for testing Untested domains

4 Tested parts of selected domains Untested portions of domains

5 Tested sample Untested sample

Goals of education


8/67

8

The sampling principle of testing:

analogy of a political poll

In September, a Connecta poll of 601 people predicted

second round results: 57,6% for Bachelet, 23,1% for

Matthei, 19,3% other, dont know

Actual second-round vote: 62,2% for Bachelet, 37,8%

for Matthei

Would you have cared how those particular 601 people

actually voted?

Why is information from those 601 people valuable?


9/67

How is a test not like a poll?

Similar: In polling, we sample people; in testing, wesample content

Not similar: in testing, we have other decisions after

sampling content, for example: How is content represented on the test?

Graphically? Verbally? What item format?

What are the task demandsfor example, scoring

rubrics?

(9)


10/67

How similar are tested representations?Calculate the area of basic polygons drawn on a coordinate plane


11/67

11

What are the consequences of incomplete sampling?

All cases:

Systematically incomplete evaluation of education

Low pressure testing: modest effects on scores

Measurement error (uncertainty): fluctuations in scores

Differences in results among tests: sometimes small,

but occasionally large, for example, TIMSS vs. PISA

High pressure (accountability): very large effects

Incentives to focus preparation on the tested sample,

not the domain Narrowed instruction, bad test preparation

Score inflation


12/67

Why use standardized tests

Standardized tests were originally designed to provide

supplementary, specialized information that teachersdo not have already

Tests provide information that is consistent across

classrooms, schools, and years (unlike teachers

grades)

Tests are very efficient: they provide substantial

information from a short amount of time

Tests can be designed to support a number of different

uses

(12)


13/67

Topics

Session 1:



Session 2:



Score inflation

Why do these occur?

(13)


14/67

Some common uses of standardized tests

Monitor performance of a national or regional system

Evaluate performance relative to standards or

normatively (relative to the performance of others)

Provide pedagogically useful information to educators,

for example, formative assessment results

Hold educators accountable for student performance

(14)


15/67

Design trade-offs

Different purposes for tests suggest different designs

Designing a test to be better for one function may

make it worse for other functions

For example, a test designed for summative evaluationis often poorly designed to provide instructional

feedback

Sometimes, using a test for one purpose will make itless valuable for others

Session 2: score inflation from accountability

(15)


16/67


17/67

Difficulty

Items that are too hard or too easy for a student will

produce an unreliable score

Tests that are too hard or too easy may have floor or

ceiling effects

Cannot accurately show differences or changes inperformance

Will distort trends

So tests best suited for low-performing students maybe poorly suited for high-performing students

(17)


18/67

Raw scores from a test that has become too easy

(18)


19/67

Sampling of students and schools

To hold schools accountable, tests must be frequent,and many or all students must be tested

To hold teachers accountable, all students must be

tested with the same items

To monitor the system, you can test less often and use

sparse matrix sampling:

Only a sample of students tested in each school Students are given different items, so the content

can be broader

(19)


20/67

The design of test items for different purposes

For summative purposes, it is enough to whetherstudents can do something

For formative purposes, you want to know why some

students are unable to perform the task Knowing why students fail allows teachers to

modify instruction

May show which incorrect ideas cause students to

make mistakes May break a complex skill into its constituent parts

(20)


21/67

Ideal items for a formative test

Should provide information for improving instructionthat teachers may not have without the test

Should reveal the sources of errors

Should not look like the items on the summative test

If the items are similar to the summative test, this

will encourage narrowing of instruction and score

inflation

(21)


22/67

A diagnostic item for elementary fractions

(22)

In which of the following diagrams is one-quarter of thearea shaded?

Source: National Council of Teachers of Mathematics,http://www.nctm.org/news/content.aspx?id=11474

Tells you why a student answers incorrectly, not justwhether she answers incorrectly
http://www.nctm.org/news/content.aspx?id=11474http://www.nctm.org/news/content.aspx?id=11474http://www.nctm.org/news/content.aspx?id=11474


23/67

Reporting for different purposes

For summative purposes, a single score may tell youwhich groups are performing better than others

For pedagogical purposes, educators need more detail

about different aspects of performance, to know whereimprovements are needed

(23)


24/67

An old-fashioned option for pedgogically

useful information: norm-referenced reporting

How do you know that: 2,5 minutes/km is a fast time for a runner?

4,8 l/km is good gasoline economy?

We compare to the distribution of speed or economy

Norm-referenced reporting compares each score to the a

relevant distribution of scores

Norm-referenced reporting offers teachers:

A basis for evaluating their own expectations

A way to compare performance across different areas

(are my students better at computation than at

problem-solving?)

(24)


25/67

An example of a report from a norm-referenced test

(25)


26/67


27/67

Compare level of detail to the previous slides

(27)


28/67

Summary: what tests can do

Standardized tests can provide important informationto policymakers and educators with small demands on

time

No one test can serve every goaldesign must bematched to purpose

To some degree, the various purposes compete,

create conflicting demands for design

To resolve this, we need either a clear choice among

uses or multiple tests

(28)


29/67

What tests cannot do

Scores cannot provide a complete evaluation of aprogram or school

Some important goals are omitted

Scores taken alone do not isolate the contributions ofteachers or schools

Test scores describe; they do not explain

Many factors other than schooling influence scores

Efforts to separate the effects of schooling are

complex and controversial

(29)


30/67


31/67


32/67


33/67

Topics

Session 1:



Session 2:



Score inflation

Why do these occur?

(33)


34/67

34

What we learned from the US experience

Effects on educational practice are mixed

Some improvements

Many undesirable effectsbad test preparation,

other gaming

Scores can become severely inflated (increase muchmore than actual learning)

Overall improvement is exaggeratedoftenseverely

Relative effectiveness is estimated incorrectly Teachers, schools, and systems ranked

incorrectly

Can create an illusion of greater equity


35/67

35

What we dont know

What is the net effect on student achievement?

Weak research designs, weaker data Some evidence of inconsistent, modest effects in

elementary math, none in reading

Effects are likely to vary across contexts

Which types of test-based accountability systems are

best?

Which programs maximize real improvements

Which programs minimize gaming, bad testpreparation, & score inflation

Reason: grossly inadequate research and evaluation


36/67

36

Campbells Law (1975)

The more any quantitative social indicator is

used for social decision making, the more

subject it will be to corruption pressures and

the more apt it will be to distort and corrupt the

social processes it is intended to monitor.

Donald T. Campbell, (1975). Assessing the impact ofplanned social change. In G. M. Lyons (Ed.), SocialResearch And Public Policies : The Dartmouth/OECDConference.


37/67


38/67

Campbells Law in testing

Raising scores becomes the primary goal

Educators find ways to raise scores on the specific test

used for accountability

Scores are inflated: they increase more than learning

Overall improvement is exaggerated

Relative improvement is estimated incorrectly, and

schools are ranked incorrectly

(38)


39/67

39

Logic of studies of score inflation

Scores are meaningful onlyif they generalize to the

domain

A poll is useful only if its results generalize to the

entire electorate

If gains generalize to the domain, they must generalizeto other tests of the same domain

Gains on a high-stakes test should generalize to a

lower-stakes audit test

If a poll is accurate, other good polls will showsimilar results


40/67

40

Good versus bad preparation for a test

Good: gives students knowledge and skills that theycan apply elsewhere

In later education

In later employment

Therefore, on other tests

Bad: generatesscore inflation: test-specific gainsthat

do not generalize beyond that test


41/67

41

Performance on coached and uncoached tests

3.0

3.2

3.4

3.6

3.8

4.0

4.2

4.4

1985 1986 1987 1988 1989 1990 1991

Year

GradeE

quivalents

Test C Test B

District tests

Koretz, et al., test

SOURCE: Adapted from Koretz, Linn, Dunbar, and Shepard (1991)


42/67


43/67

43

Reading change, grade 4 KIRIS and NAEP,

1992-1994

KIRIS NAEP

Gain in scale scores 18.8 -1

Standardized Gain 0.76 -0.03

Trends by Race on New York State vs NAEP


44/67

Trends by Race on New York State vs. NAEP

Standardized Mean Scale Scores by Race on 8thGrade Math


45/67

Inconsistency between school ratings calculated from

high-stakes and lower-stakes tests:Best and worst cases from 48 correlations across grade, year, subject, and model

45

= .27, reading, 2000 = .63, math, 2000


46/67

Topics

Session 1:



Session 2:



Score inflation

Why do these occur?

(46)


47/67

Why inflation occurs

Tests show predictableemphases, omissions, and

forms of presentation over time.

Some are intentional, for technical reasons

Some are accidental, or to save time and money

Test preparation can focus on these patterns:

Focusing instruction on emphasized content, at the

cost of other content relevant to the inference

Focusing on incidental characteristics of the test,

such as item format

(47)


48/67

48

Ways to raise scores

Teaching more

Working harder

Working more effectively

Reallocation

Coaching

Cheating

Changing who is tested

M d il li i i


49/67

More detail on sampling in constructing a test

49

1. Domain selected for testing (math, ELA, etc.)

2. Elements from domain included instandards

Elements from domain omittedfrom standards

3. Tested subset of standards Untested subset of standards

4. Tested material from within testedstandards

Untested material from withintested standards

5. Tested representations Untested representations

R ll ti


50/67

50

Reallocation

Shifting instructional resources to fit the testingprogram

Within a subject

Between subjects

Within a subject, can lead to either meaningful change

or inflation

Inflates if material getting decreasedemphasis is

also important for the inference

Narrowed instruction is a type of reallocation


51/67

C hi


52/67

52

Coaching

Focusing preparation on substantively unimportantdetails of the test

Minor, unimportant details of content

Details of the presentation of material

Includes test-taking tricks (e.g., process of elimination,

plug-in)

Can inflate scores or simply waste time

H i il t t d t ti ?


53/67

53

How similar are tested representations?

2008 item, New York grade 7 math test

Which tool is most appropriate for measuring the mass of aserving of cheese?

a. rulerb. thermometerc. measuring cupd. weighing scale

2009 it N Y k d 7 th t t


54/67

54

2009 item, New York grade 7 math test

Which tool would be the most appropriate for Natasha touse when finding the mass of a watermelon?

a. scale

b. inch rulerc. meter stickd. measuring cup

H i il t t d t ti ?


55/67

55

How similar are tested representations?

NY 7N7: Compare numbers written in scientific notation.

A l f hi ( h ti ?)


56/67

56

An example of coaching (cheating?)

The question on the review sheet for[the]

examreads in part:

The average amount that each band member must

raise is a function of the number of band members, b,

with the rule f(b)=12000/b.

The question on the actual test reads in part:

The average amount each cheerleader must pay is a

function of the number of cheerleaders, n, with the rulef(n)=420/n.

Strauss, V., The Washington Post, July 10, 2001, p. A09


57/67


58/67

Recommendation 1: make the evaluation


59/67

Recommendation 1: make the evaluation

and accountability system broad

Do not rely only or excessively on standardized tests

Evaluate other outcomes

Evaluatepracticesas well as outcomes

May need to use subjectiveas well as objective

measures

59

Recommendation 2: couple evaluation and


60/67

Recommendation 2: couple evaluation and

accountability with training and support

Many teachers need help, not just incentives, to

improve instruction

Provide support, for example, training for better

teaching

60

Recommendation 3: Use summative


61/67

Recommendation 3: Use summative

tests appropriately

Report in detail Show teachers what needs improving, not just how

high or low they score

Add formative tests for difficult topics, if possible

Set realisticperformance targets that teachers can

reach by appropriate methods

Creates less incentive to use bad test preparation

61


62/67

Supplementary slides

62

Math trends KIRIS and ACT


63/67

63

Math trends, KIRIS and ACT

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Year

Standar

d

Deviation

KIRIS

ACT

1992 199519941993

Standardized mathematics gains


64/67

64

Standardized mathematics gains

in Kentucky, 1992-1996

KIRIS NAEP

Grade 4 0.61 0.17

Grade 8 0.52 0.13

One look at the TX miracle (Klein et al 2000)


65/67

65

One look at the TX miracle (Klein, et al. 2000)

Item from G8 MCAS


66/67

66

Eva has four sets of straws. The

measurements of the straws are given

below. Which set of straws could not be

used to form a triangle?

A. Set 1: 4 cm, 4 cm, 7 cm

B. Set 2: 2 cm, 3 cm, 8 cm

C. Set 3: 3 cm, 4 cm, 5 cm

D. Set 4: 5 cm, 12 cm, 13 cm

Item from G8 MCAS

S h l Cl ifi ti


67/67

School Classification

SchoolClassification

DIMENSIONS INDICATORS WEIGHT

Learning standards

outcomes

Learning standards 67,0 %

Simce scores 3,3 %

Simce trend 3,3 %

Other quality indicators

School motivation and

academic self-esteem

3,3 %

School climate 3,3 %

Civic participation and

citizenship

3,3 %

Healthy habits 3,3 %Attendance 3,3 %

Gender equity 3,3 %

School retention 3,3 %

T h i l f i l 3 3 %

Presentacion Daniel Koretz

Documents

Transcript of Presentacion Daniel Koretz