Presentacion Daniel Koretz

download Presentacion Daniel Koretz

of 67

Transcript of Presentacion Daniel Koretz

  • 8/10/2019 Presentacion Daniel Koretz

    1/67

    Using tests for monitoring and accountability

    Prof. Daniel Koretz

    Harvard Graduate School of Education

    Agencia de Calidad de la EducacinSantiago, Chile

    November 3, 2014

  • 8/10/2019 Presentacion Daniel Koretz

    2/67

    Two basic questions posed by the Agencia

    de Calidad de la Educacin

    Value and limitations of standardized tests as

    indicators of the quality of education

    The use of tests to improve learning

    What can go wrong when tests are used for

    accountability?

    (2)

  • 8/10/2019 Presentacion Daniel Koretz

    3/67

    Two basic questions posed by the Agencia

    de Calidad de la Educacin, revised

    Value and limitations of standardized tests as

    indicators of the quality of education the performance

    of students

    Test scores describe what students can do; they

    do not explain why they can do it

    The use of tests to improve learning

    What can go wrong when tests are used for

    accountability?

    (3)

  • 8/10/2019 Presentacion Daniel Koretz

    4/67

    Topics

    Session 1:

    What is a test? The sampling principle of testing

    Examples of design choices for different purposes

    Session 2:

    What are the risks of test-based accountability?

    Undesirable changes in instruction

    Score inflation

    Why do these occur?

    (4)

  • 8/10/2019 Presentacion Daniel Koretz

    5/67

    Part I: The value and limitations of

    standardized tests

    (5)

  • 8/10/2019 Presentacion Daniel Koretz

    6/67

    What is a test?

    A test is a very small sample of behavior from thestudent

    It is valuable only to the extent that it lets us estimate

    mastery of the domainthe knowledge and skills itrepresents

    For example, 40 or 50 test items are often used to

    estimate something like cumulative mastery of

    mathematics through grade 8

    (6)

  • 8/10/2019 Presentacion Daniel Koretz

    7/67

    7

    Sampling to obtain a test

    1

    2 Student achievement Other

    3 Domains selected for testing Untested domains

    4 Tested parts of selected domains Untested portions of domains

    5 Tested sample Untested sample

    Goals of education

  • 8/10/2019 Presentacion Daniel Koretz

    8/67

    8

    The sampling principle of testing:

    analogy of a political poll

    In September, a Connecta poll of 601 people predicted

    second round results: 57,6% for Bachelet, 23,1% for

    Matthei, 19,3% other, dont know

    Actual second-round vote: 62,2% for Bachelet, 37,8%

    for Matthei

    Would you have cared how those particular 601 people

    actually voted?

    Why is information from those 601 people valuable?

  • 8/10/2019 Presentacion Daniel Koretz

    9/67

    How is a test not like a poll?

    Similar: In polling, we sample people; in testing, wesample content

    Not similar: in testing, we have other decisions after

    sampling content, for example: How is content represented on the test?

    Graphically? Verbally? What item format?

    What are the task demandsfor example, scoring

    rubrics?

    (9)

  • 8/10/2019 Presentacion Daniel Koretz

    10/67

    How similar are tested representations?Calculate the area of basic polygons drawn on a coordinate plane

  • 8/10/2019 Presentacion Daniel Koretz

    11/67

    11

    What are the consequences of incomplete sampling?

    All cases:

    Systematically incomplete evaluation of education

    Low pressure testing: modest effects on scores

    Measurement error (uncertainty): fluctuations in scores

    Differences in results among tests: sometimes small,

    but occasionally large, for example, TIMSS vs. PISA

    High pressure (accountability): very large effects

    Incentives to focus preparation on the tested sample,

    not the domain Narrowed instruction, bad test preparation

    Score inflation

  • 8/10/2019 Presentacion Daniel Koretz

    12/67

    Why use standardized tests

    Standardized tests were originally designed to provide

    supplementary, specialized information that teachersdo not have already

    Tests provide information that is consistent across

    classrooms, schools, and years (unlike teachers

    grades)

    Tests are very efficient: they provide substantial

    information from a short amount of time

    Tests can be designed to support a number of different

    uses

    (12)

  • 8/10/2019 Presentacion Daniel Koretz

    13/67

    Topics

    Session 1:

    What is a test? The sampling principle of testing

    Examples of design choices for different purposes

    Session 2:

    What are the risks of test-based accountability?

    Undesirable changes in instruction

    Score inflation

    Why do these occur?

    (13)

  • 8/10/2019 Presentacion Daniel Koretz

    14/67

    Some common uses of standardized tests

    Monitor performance of a national or regional system

    Evaluate performance relative to standards or

    normatively (relative to the performance of others)

    Provide pedagogically useful information to educators,

    for example, formative assessment results

    Hold educators accountable for student performance

    (14)

  • 8/10/2019 Presentacion Daniel Koretz

    15/67

    Design trade-offs

    Different purposes for tests suggest different designs

    Designing a test to be better for one function may

    make it worse for other functions

    For example, a test designed for summative evaluationis often poorly designed to provide instructional

    feedback

    Sometimes, using a test for one purpose will make itless valuable for others

    Session 2: score inflation from accountability

    (15)

  • 8/10/2019 Presentacion Daniel Koretz

    16/67

  • 8/10/2019 Presentacion Daniel Koretz

    17/67

    Difficulty

    Items that are too hard or too easy for a student will

    produce an unreliable score

    Tests that are too hard or too easy may have floor or

    ceiling effects

    Cannot accurately show differences or changes inperformance

    Will distort trends

    So tests best suited for low-performing students maybe poorly suited for high-performing students

    (17)

  • 8/10/2019 Presentacion Daniel Koretz

    18/67

    Raw scores from a test that has become too easy

    (18)

  • 8/10/2019 Presentacion Daniel Koretz

    19/67

    Sampling of students and schools

    To hold schools accountable, tests must be frequent,and many or all students must be tested

    To hold teachers accountable, all students must be

    tested with the same items

    To monitor the system, you can test less often and use

    sparse matrix sampling:

    Only a sample of students tested in each school Students are given different items, so the content

    can be broader

    (19)

  • 8/10/2019 Presentacion Daniel Koretz

    20/67

    The design of test items for different purposes

    For summative purposes, it is enough to whetherstudents can do something

    For formative purposes, you want to know why some

    students are unable to perform the task Knowing why students fail allows teachers to

    modify instruction

    May show which incorrect ideas cause students to

    make mistakes May break a complex skill into its constituent parts

    (20)

  • 8/10/2019 Presentacion Daniel Koretz

    21/67

    Ideal items for a formative test

    Should provide information for improving instructionthat teachers may not have without the test

    Should reveal the sources of errors

    Should not look like the items on the summative test

    If the items are similar to the summative test, this

    will encourage narrowing of instruction and score

    inflation

    (21)

  • 8/10/2019 Presentacion Daniel Koretz

    22/67

    A diagnostic item for elementary fractions

    (22)

    In which of the following diagrams is one-quarter of thearea shaded?

    Source: National Council of Teachers of Mathematics,http://www.nctm.org/news/content.aspx?id=11474

    Tells you why a student answers incorrectly, not justwhether she answers incorrectly

    http://www.nctm.org/news/content.aspx?id=11474http://www.nctm.org/news/content.aspx?id=11474http://www.nctm.org/news/content.aspx?id=11474
  • 8/10/2019 Presentacion Daniel Koretz

    23/67

    Reporting for different purposes

    For summative purposes, a single score may tell youwhich groups are performing better than others

    For pedagogical purposes, educators need more detail

    about different aspects of performance, to know whereimprovements are needed

    (23)

  • 8/10/2019 Presentacion Daniel Koretz

    24/67

    An old-fashioned option for pedgogically

    useful information: norm-referenced reporting

    How do you know that: 2,5 minutes/km is a fast time for a runner?

    4,8 l/km is good gasoline economy?

    We compare to the distribution of speed or economy

    Norm-referenced reporting compares each score to the a

    relevant distribution of scores

    Norm-referenced reporting offers teachers:

    A basis for evaluating their own expectations

    A way to compare performance across different areas

    (are my students better at computation than at

    problem-solving?)

    (24)

  • 8/10/2019 Presentacion Daniel Koretz

    25/67

    An example of a report from a norm-referenced test

    (25)

  • 8/10/2019 Presentacion Daniel Koretz

    26/67

  • 8/10/2019 Presentacion Daniel Koretz

    27/67

    Compare level of detail to the previous slides

    (27)

  • 8/10/2019 Presentacion Daniel Koretz

    28/67

    Summary: what tests can do

    Standardized tests can provide important informationto policymakers and educators with small demands on

    time

    No one test can serve every goaldesign must bematched to purpose

    To some degree, the various purposes compete,

    create conflicting demands for design

    To resolve this, we need either a clear choice among

    uses or multiple tests

    (28)

  • 8/10/2019 Presentacion Daniel Koretz

    29/67

    What tests cannot do

    Scores cannot provide a complete evaluation of aprogram or school

    Some important goals are omitted

    Scores taken alone do not isolate the contributions ofteachers or schools

    Test scores describe; they do not explain

    Many factors other than schooling influence scores

    Efforts to separate the effects of schooling are

    complex and controversial

    (29)

  • 8/10/2019 Presentacion Daniel Koretz

    30/67

  • 8/10/2019 Presentacion Daniel Koretz

    31/67

  • 8/10/2019 Presentacion Daniel Koretz

    32/67

  • 8/10/2019 Presentacion Daniel Koretz

    33/67

    Topics

    Session 1:

    What is a test? The sampling principle of testing

    Examples of design choices for different purposes

    Session 2:

    What are the risks of test-based accountability?

    Undesirable changes in instruction

    Score inflation

    Why do these occur?

    (33)

  • 8/10/2019 Presentacion Daniel Koretz

    34/67

    34

    What we learned from the US experience

    Effects on educational practice are mixed

    Some improvements

    Many undesirable effectsbad test preparation,

    other gaming

    Scores can become severely inflated (increase muchmore than actual learning)

    Overall improvement is exaggeratedoftenseverely

    Relative effectiveness is estimated incorrectly Teachers, schools, and systems ranked

    incorrectly

    Can create an illusion of greater equity

  • 8/10/2019 Presentacion Daniel Koretz

    35/67

    35

    What we dont know

    What is the net effect on student achievement?

    Weak research designs, weaker data Some evidence of inconsistent, modest effects in

    elementary math, none in reading

    Effects are likely to vary across contexts

    Which types of test-based accountability systems are

    best?

    Which programs maximize real improvements

    Which programs minimize gaming, bad testpreparation, & score inflation

    Reason: grossly inadequate research and evaluation

  • 8/10/2019 Presentacion Daniel Koretz

    36/67

    36

    Campbells Law (1975)

    The more any quantitative social indicator is

    used for social decision making, the more

    subject it will be to corruption pressures and

    the more apt it will be to distort and corrupt the

    social processes it is intended to monitor.

    Donald T. Campbell, (1975). Assessing the impact ofplanned social change. In G. M. Lyons (Ed.), SocialResearch And Public Policies : The Dartmouth/OECDConference.

  • 8/10/2019 Presentacion Daniel Koretz

    37/67

  • 8/10/2019 Presentacion Daniel Koretz

    38/67

    Campbells Law in testing

    Raising scores becomes the primary goal

    Educators find ways to raise scores on the specific test

    used for accountability

    Scores are inflated: they increase more than learning

    Overall improvement is exaggerated

    Relative improvement is estimated incorrectly, and

    schools are ranked incorrectly

    (38)

  • 8/10/2019 Presentacion Daniel Koretz

    39/67

    39

    Logic of studies of score inflation

    Scores are meaningful onlyif they generalize to the

    domain

    A poll is useful only if its results generalize to the

    entire electorate

    If gains generalize to the domain, they must generalizeto other tests of the same domain

    Gains on a high-stakes test should generalize to a

    lower-stakes audit test

    If a poll is accurate, other good polls will showsimilar results

  • 8/10/2019 Presentacion Daniel Koretz

    40/67

    40

    Good versus bad preparation for a test

    Good: gives students knowledge and skills that theycan apply elsewhere

    In later education

    In later employment

    Therefore, on other tests

    Bad: generatesscore inflation: test-specific gainsthat

    do not generalize beyond that test

  • 8/10/2019 Presentacion Daniel Koretz

    41/67

    41

    Performance on coached and uncoached tests

    3.0

    3.2

    3.4

    3.6

    3.8

    4.0

    4.2

    4.4

    1985 1986 1987 1988 1989 1990 1991

    Year

    GradeE

    quivalents

    Test C Test B

    District tests

    Koretz, et al., test

    SOURCE: Adapted from Koretz, Linn, Dunbar, and Shepard (1991)

  • 8/10/2019 Presentacion Daniel Koretz

    42/67

  • 8/10/2019 Presentacion Daniel Koretz

    43/67

    43

    Reading change, grade 4 KIRIS and NAEP,

    1992-1994

    KIRIS NAEP

    Gain in scale scores 18.8 -1

    Standardized Gain 0.76 -0.03

    Trends by Race on New York State vs NAEP

  • 8/10/2019 Presentacion Daniel Koretz

    44/67

    Trends by Race on New York State vs. NAEP

    Standardized Mean Scale Scores by Race on 8thGrade Math

  • 8/10/2019 Presentacion Daniel Koretz

    45/67

    Inconsistency between school ratings calculated from

    high-stakes and lower-stakes tests:Best and worst cases from 48 correlations across grade, year, subject, and model

    45

    = .27, reading, 2000 = .63, math, 2000

  • 8/10/2019 Presentacion Daniel Koretz

    46/67

    Topics

    Session 1:

    What is a test? The sampling principle of testing

    Examples of design choices for different purposes

    Session 2:

    What are the risks of test-based accountability?

    Undesirable changes in instruction

    Score inflation

    Why do these occur?

    (46)

  • 8/10/2019 Presentacion Daniel Koretz

    47/67

    Why inflation occurs

    Tests show predictableemphases, omissions, and

    forms of presentation over time.

    Some are intentional, for technical reasons

    Some are accidental, or to save time and money

    Test preparation can focus on these patterns:

    Focusing instruction on emphasized content, at the

    cost of other content relevant to the inference

    Focusing on incidental characteristics of the test,

    such as item format

    (47)

  • 8/10/2019 Presentacion Daniel Koretz

    48/67

    48

    Ways to raise scores

    Teaching more

    Working harder

    Working more effectively

    Reallocation

    Coaching

    Cheating

    Changing who is tested

    M d il li i i

  • 8/10/2019 Presentacion Daniel Koretz

    49/67

    More detail on sampling in constructing a test

    49

    1. Domain selected for testing (math, ELA, etc.)

    2. Elements from domain included instandards

    Elements from domain omittedfrom standards

    3. Tested subset of standards Untested subset of standards

    4. Tested material from within testedstandards

    Untested material from withintested standards

    5. Tested representations Untested representations

    R ll ti

  • 8/10/2019 Presentacion Daniel Koretz

    50/67

    50

    Reallocation

    Shifting instructional resources to fit the testingprogram

    Within a subject

    Between subjects

    Within a subject, can lead to either meaningful change

    or inflation

    Inflates if material getting decreasedemphasis is

    also important for the inference

    Narrowed instruction is a type of reallocation

  • 8/10/2019 Presentacion Daniel Koretz

    51/67

    C hi

  • 8/10/2019 Presentacion Daniel Koretz

    52/67

    52

    Coaching

    Focusing preparation on substantively unimportantdetails of the test

    Minor, unimportant details of content

    Details of the presentation of material

    Includes test-taking tricks (e.g., process of elimination,

    plug-in)

    Can inflate scores or simply waste time

    H i il t t d t ti ?

  • 8/10/2019 Presentacion Daniel Koretz

    53/67

    53

    How similar are tested representations?

    2008 item, New York grade 7 math test

    Which tool is most appropriate for measuring the mass of aserving of cheese?

    a. rulerb. thermometerc. measuring cupd. weighing scale

    2009 it N Y k d 7 th t t

  • 8/10/2019 Presentacion Daniel Koretz

    54/67

    54

    2009 item, New York grade 7 math test

    Which tool would be the most appropriate for Natasha touse when finding the mass of a watermelon?

    a. scale

    b. inch rulerc. meter stickd. measuring cup

    H i il t t d t ti ?

  • 8/10/2019 Presentacion Daniel Koretz

    55/67

    55

    How similar are tested representations?

    NY 7N7: Compare numbers written in scientific notation.

    A l f hi ( h ti ?)

  • 8/10/2019 Presentacion Daniel Koretz

    56/67

    56

    An example of coaching (cheating?)

    The question on the review sheet for[the]

    examreads in part:

    The average amount that each band member must

    raise is a function of the number of band members, b,

    with the rule f(b)=12000/b.

    The question on the actual test reads in part:

    The average amount each cheerleader must pay is a

    function of the number of cheerleaders, n, with the rulef(n)=420/n.

    Strauss, V., The Washington Post, July 10, 2001, p. A09

  • 8/10/2019 Presentacion Daniel Koretz

    57/67

  • 8/10/2019 Presentacion Daniel Koretz

    58/67

    Recommendation 1: make the evaluation

  • 8/10/2019 Presentacion Daniel Koretz

    59/67

    Recommendation 1: make the evaluation

    and accountability system broad

    Do not rely only or excessively on standardized tests

    Evaluate other outcomes

    Evaluatepracticesas well as outcomes

    May need to use subjectiveas well as objective

    measures

    59

    Recommendation 2: couple evaluation and

  • 8/10/2019 Presentacion Daniel Koretz

    60/67

    Recommendation 2: couple evaluation and

    accountability with training and support

    Many teachers need help, not just incentives, to

    improve instruction

    Provide support, for example, training for better

    teaching

    60

    Recommendation 3: Use summative

  • 8/10/2019 Presentacion Daniel Koretz

    61/67

    Recommendation 3: Use summative

    tests appropriately

    Report in detail Show teachers what needs improving, not just how

    high or low they score

    Add formative tests for difficult topics, if possible

    Set realisticperformance targets that teachers can

    reach by appropriate methods

    Creates less incentive to use bad test preparation

    61

  • 8/10/2019 Presentacion Daniel Koretz

    62/67

    Supplementary slides

    62

    Math trends KIRIS and ACT

  • 8/10/2019 Presentacion Daniel Koretz

    63/67

    63

    Math trends, KIRIS and ACT

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Year

    Standar

    d

    Deviation

    KIRIS

    ACT

    1992 199519941993

    Standardized mathematics gains

  • 8/10/2019 Presentacion Daniel Koretz

    64/67

    64

    Standardized mathematics gains

    in Kentucky, 1992-1996

    KIRIS NAEP

    Grade 4 0.61 0.17

    Grade 8 0.52 0.13

    One look at the TX miracle (Klein et al 2000)

  • 8/10/2019 Presentacion Daniel Koretz

    65/67

    65

    One look at the TX miracle (Klein, et al. 2000)

    Item from G8 MCAS

  • 8/10/2019 Presentacion Daniel Koretz

    66/67

    66

    Eva has four sets of straws. The

    measurements of the straws are given

    below. Which set of straws could not be

    used to form a triangle?

    A. Set 1: 4 cm, 4 cm, 7 cm

    B. Set 2: 2 cm, 3 cm, 8 cm

    C. Set 3: 3 cm, 4 cm, 5 cm

    D. Set 4: 5 cm, 12 cm, 13 cm

    Item from G8 MCAS

    S h l Cl ifi ti

  • 8/10/2019 Presentacion Daniel Koretz

    67/67

    School Classification

    SchoolClassification

    DIMENSIONS INDICATORS WEIGHT

    Learning standards

    outcomes

    Learning standards 67,0 %

    Simce scores 3,3 %

    Simce trend 3,3 %

    Other quality indicators

    School motivation and

    academic self-esteem

    3,3 %

    School climate 3,3 %

    Civic participation and

    citizenship

    3,3 %

    Healthy habits 3,3 %Attendance 3,3 %

    Gender equity 3,3 %

    School retention 3,3 %

    T h i l f i l 3 3 %