Ensuring and Evaluating Assessment Quality | 9 Methods for Establishing a Strong Evidence Base to Support Claims of Comparability The integration of comparability supports and audits throughout the design of the pilot is a sign of strength in any innovative assessment and accountability system.

The pilot must be designed to support the validity and comparability of the annual determinations.

For a system that relies on local flexibility in the assessments administered to support annual determinations, comparability will rest on evidence regarding the local scoring within districts, the performance standards for student achievement among pilot districts, and finally, the annual determinations across pilot districts and non-pilot districts. Gathering evidence at each of these levels will be essential for supporting the claims of comparability, and ultimately supporting the validity of the system as a whole.

Examples of the activities and audits that could occur at the three levels are summarized in Figure 3 and described in detail below.

Within-District Comparability in Expectations for Student Performance States must plan for efforts to improve and monitor within-district comparability.

Promoting and evaluating consistency in educator scoring of student work within districts should be accomplished using multiple methods, and may include one or

more of the following three example methodologies:

1) Within-district calibration sessions resulting in annotated anchor papers.

Providing training and resources for participating districts to hold grade-level calibration sessions for the scoring of common or local assessments is the first step for within-district calibration. Teachers would bring samples of their student work from one or more assessments that represent the range of achievement in their classrooms and will then come to a common understanding about how to use the rubrics to score papers and identify prototypical examples of student work for each score point on each rubric dimension. The educators annotate each of the anchor papers documenting the groups’ rationale for the given score-point decision. These annotated anchor papers are then distributed throughout the district to help improve within-district consistency in scoring. Additionally, if this work is done using an assessment that is common across districts, the anchor papers could be vetted and shared across districts to simultaneously improve cross-district calibration in scoring.

2) Within-district estimates of inter-rater reliability.

External audits of the consistency in scoring could be achieved by asking each district to submit a sample of papers from each assessment (or a sample of assessments) that have been double-blind scored by teachers. The collection of double scores could then be analyzed using a variety of traditional inter-rater reliability techniques for estimating rater scoring consistency within-districts (e.g., percent exact and adjacent agreement, Cohen’s Kappa,6 intraclass correlations7).

3) Testing the generalizability of the local assessment systems.

If the design of the innovative pilot involves at least some common tasks and some local tasks for generating annual summative scores, much of the work for gathering evidence of comparability will rely on the use of the common tasks as calibration tools. However, the utility of the common items or tasks for judging the degree of comparability across districts rests heavily on the assumption that within-district or local scoring on the common tasks is representative of local scoring on the local tasks. This assumption requires that findings associated with the common tasks are generalizable across the entire assessment system within each participating district. Therefore, it will be necessary to test this assumption by running generalizability analyses using all of the assessment scores (local and common) within a given district’s assessment system. Conducting these analyses has the added benefit of providing an index of score reliability that can be used to support technical quality of the assessment results.

Cross-District Comparability in Evaluating Student Work The primary goal of a cross-district comparability audit is quality control: to gather evidence of the degree to which there are systematic differences in the stringency or leniency of scoring across participating districts. Depending on the design of the pilot, there are methods for evaluating the degree in comparability in scoring across districts for common assessments and for local assessments. Additionally, the comparability of the results of the assessment system can be evaluated by critically examining bodies of evidence (student work) generated by a cross-district sample of students participating in the innovative assessment system. An example of each of these types

of methods is provided below:

1) Social moderation audit with common tasks.

The design of social moderation audits can be modeled after a number of international examples; one that may be particularly useful is Queensland, Australia where externally-moderated school-based assessments replaced external standardized assessments.8 If all students in the participating pilot districts are taking at least one common performance task, then student scores on these tasks can be used to determine the degree of comparability of teacher judgments about the quality of student work across districts. A consensus scoring social moderation method could involve pairing teachers together;

each representing different districts, to score student work samples from yet a third district. After training and practice, both judges within the pairs are asked to individually score their assigned samples of student work and record their Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2), 420.

Queensland Studies Authority. (2014). School-based assessment: The Queensland assessment. Queensland, Australia:

The State of Queensland (Queensland Studies Authority).

Ensuring and Evaluating Assessment Quality | 11 scores. Working through the work samples one at a time, the judges discuss their individual scores and then come to an agreement on a “consensus score.” The purpose of collecting consensus score data is to estimate what might be considered analogous to a “true score,” which is used as a calibration weight.

These consensus scores are then used in follow-up analyses to detect any systematic, cross-district differences in the stringency of standards used for local scoring. If systematic differences are detected, the project leaders can make defensible decisions about calibrating (or making adjustments to) the districtspecific performance standards.

2) Social moderation audit with local tasks.

The comparability of local tasks measuring the same or similar knowledge and skills can be evaluated using a rank-ordering social moderation technique. In the United Kingdom (UK), the results of written exams are used to inform decisions about post-secondary job and university placements. However, across the UK, different awarding bodies (or examination boards) are responsible for creating their own written examinations. Therefore, social moderation audits are used to ensure the standard for post-secondary placements is comparable across awarding bodies. One approach for ensuring comparability is a rank-ordering social moderation method.9 The rank ordering method involves asking trained judges to rank-order samples of student work within a number of pre-designed packets. The packets are grouped by similar overall score, which is blind to the reviewers. The work within the packets is arranged and distributed across judges in a way that allows for each sample of work to be compared with all other student work receiving similar scores and ranked by more than one judge. The rank-order data resulting from the judges can then be transformed into pairedcomparison data that can be used to estimate a Thurstone scale. An indicator of relative district stringency and leniency in scoring can be derived from comparing the Thurstone scale scores with the local scores of each sample of student work.

3) Validating the performance standards with a body-of-work method.

Cross-district comparability rests on the notion that the results of the assessment system in one district carry the same meaning and can be interpreted and used in the same way as results of the assessment system in another district. Since the innovative pilot will likely involve a degree of local variability across the assessment systems in the pilot districts, the assumption of comparable results must be verified. One way to validate the district standards is to engage in a student work-based standard setting method such as the Body of Work method or some variation thereof.10 The body of work method requires teachers or other judges to review the portfolios of student work and make judgments about student achievement relative to the achievement level descriptors. These teacher judgments are then reconciled with the reported achievement levels as an additional source of validity evidence to support the comparability of the annual determinations across pilot districts.

Bramley, T. (2005). A rank-ordering method for equating tests by expert judgment. Journal of Applied Measurement, 6(2), 202–223.

Kahl, S.R., Crockett, T.J., DePascale, C.A., & Rindfleisch, S.L. (1995, June). Setting standards for performance levels using the student-based constructed-response method. Paper presented at the annual meeting of the American Research Association, San Francisco, CA.

Ensuring and Evaluating Assessment Quality | 12 Comparability of Annual Determinations Across Pilot Districts and Non-Pilot Districts The accountability uses for the assessment system results rests on the comparability of annual determinations. Therefore, the comparability claims for the innovative pilot will apply to the reported performance levels (as opposed to scale scores for more traditional assessment models). The comparability processes and audits that occur at both the local, within-district level, and the cross-district level are all in an effort to support the claim of comparability in the annual determinations. However, if the pilot is not statewide, a major ESSA comparability requirement is that the pilot system results are comparable with the non-pilot district results. The following are examples of procedures that could be used to formally promote and evaluate the comparability

of the annual determinations across both pilot and non-pilot districts:

1) Setting standards using common achievement level descriptors (ALDs).

Achievement level descriptors are exhaustive, content-based descriptions that illustrate and define student achievement at each of the reported performance levels. Detailed ALDs are typically developed by teams of content experts and educators to be used for the purposes of setting criterion-referenced performance standards (i.e., cutscores) for an assessment program. The use of common ALDs across the pilot and non-pilot assessment systems will support shared interpretations of performance relative to the content standards; and ultimately, through the chosen standard setting procedures, provide evidence for the comparability of the performance standards across the two assessment systems. If the selected standard setting methods across the two programs rest heavily on common ALDs, then having common ALDs will serve as a foundation for the inference that the resulting achievement levels carry the same meaning and can be used to support the same purposes (i.e., accountability and reporting).

2) Administering a common standardized assessment to a sample of students in both pilot and non-pilot districts.

Importantly, the degree of comparability of the annual determinations across the two assessment systems within the state can be directly evaluated by administering an assessment that is common across the two programs to a sample of students. For example, a state could administer the statewide standardized assessment to students in select grade levels and subjects within the pilot districts. The comparability of the annual determinations between pilot and non-pilot districts could then by evaluated by directly comparing annual determinations for the students that participated in both assessment systems.

By calculating two sets of annual determinations for these students, the state will have both traditional and innovative data points for some of the students in each pilot district. The degree of agreement between the two sets of annual determinations could then be analyzed to provide further evidence regarding the comparability of the interpretations of the reported achievement levels, or if systematic differences are detected, inform decisions about calibrating results to provide for comparability when appropriate.

Ensuring and Evaluating Assessment Quality | 13 We note however, that just because we have two sets of data to evaluate the performance of students across different settings, it does not mean that the results should be equivalent. For example, if approximately 55% of the students were scoring in Levels 3 and 4 on the state standardized assessment, that does not mean we should expect exactly 55% of the students to be classified in Levels 3 and 4 on the innovative pilot assessments. There could be very good reasons why the results would differ in either direction. For example, if a state is using an innovative performance assessment model in the pilot districts, these assessments may be capturing additional information relative to real-world application and knowledge transfer that provides for more valid representations of the construct than possible with traditional standardized assessments. That said, states should be able to explain these discrepancies in terms of their theories of action. Further, it would be hard to explain significant variations between the two sets of results, especially if such variability was found in only a subset of the pilot districts.

Figure 3. Establishing an Evidence-Base for Comparable Annual Determinations

