WWW.THESES.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Theses, dissertations, documentation
 
<< HOME
CONTACTS



Pages:     | 1 || 3 |

«® Knowledge Works Logo CMYK Equivalents - For Print Materials Acknowledgements Thanks to generous support from the Nellie Mae Education Foundation, ...»

-- [ Page 2 ] --

Ensuring and Evaluating Assessment Quality | 9 Methods for Establishing a Strong Evidence Base to Support Claims of Comparability The integration of comparability supports and audits throughout the design of the pilot is a sign of strength in any innovative assessment and accountability system.

The pilot must be designed to support the validity and comparability of the annual determinations.

For a system that relies on local flexibility in the assessments administered to support annual determinations, comparability will rest on evidence regarding the local scoring within districts, the performance standards for student achievement among pilot districts, and finally, the annual determinations across pilot districts and non-pilot districts. Gathering evidence at each of these levels will be essential for supporting the claims of comparability, and ultimately supporting the validity of the system as a whole.

Examples of the activities and audits that could occur at the three levels are summarized in Figure 3 and described in detail below.

Within-District Comparability in Expectations for Student Performance States must plan for efforts to improve and monitor within-district comparability.

Promoting and evaluating consistency in educator scoring of student work within districts should be accomplished using multiple methods, and may include one or

more of the following three example methodologies:

1) Within-district calibration sessions resulting in annotated anchor papers.

Providing training and resources for participating districts to hold grade-level calibration sessions for the scoring of common or local assessments is the first step for within-district calibration. Teachers would bring samples of their student work from one or more assessments that represent the range of achievement in their classrooms and will then come to a common understanding about how to use the rubrics to score papers and identify prototypical examples of student work for each score point on each rubric dimension. The educators annotate each of the anchor papers documenting the groups’ rationale for the given score-point decision. These annotated anchor papers are then distributed throughout the district to help improve within-district consistency in scoring. Additionally, if this work is done using an assessment that is common across districts, the anchor papers could be vetted and shared across districts to simultaneously improve cross-district calibration in scoring.

Ensuring and Evaluating Assessment Quality | 10

2) Within-district estimates of inter-rater reliability.

External audits of the consistency in scoring could be achieved by asking each district to submit a sample of papers from each assessment (or a sample of assessments) that have been double-blind scored by teachers. The collection of double scores could then be analyzed using a variety of traditional inter-rater reliability techniques for estimating rater scoring consistency within-districts (e.g., percent exact and adjacent agreement, Cohen’s Kappa,6 intraclass correlations7).

3) Testing the generalizability of the local assessment systems.

If the design of the innovative pilot involves at least some common tasks and some local tasks for generating annual summative scores, much of the work for gathering evidence of comparability will rely on the use of the common tasks as calibration tools. However, the utility of the common items or tasks for judging the degree of comparability across districts rests heavily on the assumption that within-district or local scoring on the common tasks is representative of local scoring on the local tasks. This assumption requires that findings associated with the common tasks are generalizable across the entire assessment system within each participating district. Therefore, it will be necessary to test this assumption by running generalizability analyses using all of the assessment scores (local and common) within a given district’s assessment system. Conducting these analyses has the added benefit of providing an index of score reliability that can be used to support technical quality of the assessment results.

Cross-District Comparability in Evaluating Student Work The primary goal of a cross-district comparability audit is quality control: to gather evidence of the degree to which there are systematic differences in the stringency or leniency of scoring across participating districts. Depending on the design of the pilot, there are methods for evaluating the degree in comparability in scoring across districts for common assessments and for local assessments. Additionally, the comparability of the results of the assessment system can be evaluated by critically examining bodies of evidence (student work) generated by a cross-district sample of students participating in the innovative assessment system. An example of each of these types

of methods is provided below:

1) Social moderation audit with common tasks.

The design of social moderation audits can be modeled after a number of international examples; one that may be particularly useful is Queensland, Australia where externally-moderated school-based assessments replaced external standardized assessments.8 If all students in the participating pilot districts are taking at least one common performance task, then student scores on these tasks can be used to determine the degree of comparability of teacher judgments about the quality of student work across districts. A consensus scoring social moderation method could involve pairing teachers together;





each representing different districts, to score student work samples from yet a third district. After training and practice, both judges within the pairs are asked to individually score their assigned samples of student work and record their Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2), 420.

Queensland Studies Authority. (2014). School-based assessment: The Queensland assessment. Queensland, Australia:

The State of Queensland (Queensland Studies Authority).

Ensuring and Evaluating Assessment Quality | 11 scores. Working through the work samples one at a time, the judges discuss their individual scores and then come to an agreement on a “consensus score.” The purpose of collecting consensus score data is to estimate what might be considered analogous to a “true score,” which is used as a calibration weight.

These consensus scores are then used in follow-up analyses to detect any systematic, cross-district differences in the stringency of standards used for local scoring. If systematic differences are detected, the project leaders can make defensible decisions about calibrating (or making adjustments to) the districtspecific performance standards.

2) Social moderation audit with local tasks.

The comparability of local tasks measuring the same or similar knowledge and skills can be evaluated using a rank-ordering social moderation technique. In the United Kingdom (UK), the results of written exams are used to inform decisions about post-secondary job and university placements. However, across the UK, different awarding bodies (or examination boards) are responsible for creating their own written examinations. Therefore, social moderation audits are used to ensure the standard for post-secondary placements is comparable across awarding bodies. One approach for ensuring comparability is a rank-ordering social moderation method.9 The rank ordering method involves asking trained judges to rank-order samples of student work within a number of pre-designed packets. The packets are grouped by similar overall score, which is blind to the reviewers. The work within the packets is arranged and distributed across judges in a way that allows for each sample of work to be compared with all other student work receiving similar scores and ranked by more than one judge. The rank-order data resulting from the judges can then be transformed into pairedcomparison data that can be used to estimate a Thurstone scale. An indicator of relative district stringency and leniency in scoring can be derived from comparing the Thurstone scale scores with the local scores of each sample of student work.

3) Validating the performance standards with a body-of-work method.

Cross-district comparability rests on the notion that the results of the assessment system in one district carry the same meaning and can be interpreted and used in the same way as results of the assessment system in another district. Since the innovative pilot will likely involve a degree of local variability across the assessment systems in the pilot districts, the assumption of comparable results must be verified. One way to validate the district standards is to engage in a student work-based standard setting method such as the Body of Work method or some variation thereof.10 The body of work method requires teachers or other judges to review the portfolios of student work and make judgments about student achievement relative to the achievement level descriptors. These teacher judgments are then reconciled with the reported achievement levels as an additional source of validity evidence to support the comparability of the annual determinations across pilot districts.

Bramley, T. (2005). A rank-ordering method for equating tests by expert judgment. Journal of Applied Measurement, 6(2), 202–223.

Kahl, S.R., Crockett, T.J., DePascale, C.A., & Rindfleisch, S.L. (1995, June). Setting standards for performance levels using the student-based constructed-response method. Paper presented at the annual meeting of the American Research Association, San Francisco, CA.

Ensuring and Evaluating Assessment Quality | 12 Comparability of Annual Determinations Across Pilot Districts and Non-Pilot Districts The accountability uses for the assessment system results rests on the comparability of annual determinations. Therefore, the comparability claims for the innovative pilot will apply to the reported performance levels (as opposed to scale scores for more traditional assessment models). The comparability processes and audits that occur at both the local, within-district level, and the cross-district level are all in an effort to support the claim of comparability in the annual determinations. However, if the pilot is not statewide, a major ESSA comparability requirement is that the pilot system results are comparable with the non-pilot district results. The following are examples of procedures that could be used to formally promote and evaluate the comparability

of the annual determinations across both pilot and non-pilot districts:

1) Setting standards using common achievement level descriptors (ALDs).

Achievement level descriptors are exhaustive, content-based descriptions that illustrate and define student achievement at each of the reported performance levels. Detailed ALDs are typically developed by teams of content experts and educators to be used for the purposes of setting criterion-referenced performance standards (i.e., cutscores) for an assessment program. The use of common ALDs across the pilot and non-pilot assessment systems will support shared interpretations of performance relative to the content standards; and ultimately, through the chosen standard setting procedures, provide evidence for the comparability of the performance standards across the two assessment systems. If the selected standard setting methods across the two programs rest heavily on common ALDs, then having common ALDs will serve as a foundation for the inference that the resulting achievement levels carry the same meaning and can be used to support the same purposes (i.e., accountability and reporting).

2) Administering a common standardized assessment to a sample of students in both pilot and non-pilot districts.

Importantly, the degree of comparability of the annual determinations across the two assessment systems within the state can be directly evaluated by administering an assessment that is common across the two programs to a sample of students. For example, a state could administer the statewide standardized assessment to students in select grade levels and subjects within the pilot districts. The comparability of the annual determinations between pilot and non-pilot districts could then by evaluated by directly comparing annual determinations for the students that participated in both assessment systems.

By calculating two sets of annual determinations for these students, the state will have both traditional and innovative data points for some of the students in each pilot district. The degree of agreement between the two sets of annual determinations could then be analyzed to provide further evidence regarding the comparability of the interpretations of the reported achievement levels, or if systematic differences are detected, inform decisions about calibrating results to provide for comparability when appropriate.

Ensuring and Evaluating Assessment Quality | 13 We note however, that just because we have two sets of data to evaluate the performance of students across different settings, it does not mean that the results should be equivalent. For example, if approximately 55% of the students were scoring in Levels 3 and 4 on the state standardized assessment, that does not mean we should expect exactly 55% of the students to be classified in Levels 3 and 4 on the innovative pilot assessments. There could be very good reasons why the results would differ in either direction. For example, if a state is using an innovative performance assessment model in the pilot districts, these assessments may be capturing additional information relative to real-world application and knowledge transfer that provides for more valid representations of the construct than possible with traditional standardized assessments. That said, states should be able to explain these discrepancies in terms of their theories of action. Further, it would be hard to explain significant variations between the two sets of results, especially if such variability was found in only a subset of the pilot districts.

Figure 3. Establishing an Evidence-Base for Comparable Annual Determinations

–  –  –



Pages:     | 1 || 3 |


Similar works:

«TRANSITIONAL OBJECTS: THE SPOTS OF TIME IN THE PRELUDE OF 1799 RICHARD LANSDOWN Our childhood sits, Our simple childhood sits, upon a throne That has more power than all the elements. (The Prelude v. 531-3) When Wordsworth spoke about authors of his time ‘creating the taste by which they would be enjoyed’,1 he did so, in effect, as a Moses of literature. He would not see this development, or he would benefit from it only in his old age when his gift had deserted him, but what he and his...»

«Noah Noah's Swim-A-Thon S Swim A Thon Than you need impulsive for online years from a HR and just of the today as companies. Of the profile helps single that credit to sure processes of this electricity, each public in lower amount is essential. The about this fast job can want our time likely. Need apart get that their history should do been, for I serve what made that loan so. A annual someone having embarrassment ca earn you off of close and in the estate merchandise. A stressful everyone is...»

«T.S. Kuhn LA ESTRUCTURA DE LAS REVOLUCIONES CIENTÍFICAS Para que el cultivo de la historia de la ciencia adquiera cabal sentido y rinda todos los frutos que promete, se impone el examen de ciertas coyunturas, propias del desenvolvimiento científico. La revolución científica es quizá la circunstancia en que el desarrollo de la ciencia exhibe su plena peculiaridad, sin que importe gran cosa de qué materia se trate o la época considerada. El presente trabajo es un estudio, casi único en su...»

«Note: This digital document was adapted from Sather, N. 1991. “Western prairie fringed orchid: a threatened midwestern prairie plant.” Minnesota Department of Natural Resources, St. Paul. 14 pages.WESTERN PRAIRIE FRINGED ORCHID A THREATENED MIDWESTERN PRAIRIE PLANT Photo by G. N. Rysgaard What is Western Prairie Fringed Orchid? The Western prairie fringed orchid (Platanthera praeclara) is a federally threatened prairie wildflower presently known to occur in 7 states and one Canadian...»

«Charles D. Ferris, Staff Director, Democratic Policy Committee, 1963-1977, Oral History Interviews, Senate Historical Office, Washington, D.C. FROM THE SENATE TO THE HOUSE Interview #5 Wednesday, September 23, 2009 FERRIS: The Congress is lacking in civility. Politics is angry and confrontational these days. RITCHIE: It’s gotten more on the edge. It seems to be hard to do anything in the center. FERRIS: Yes, no one wants to be in the center. Except they want to co-opt it. They want to be on...»

«Regimental History of the 357th Infantry Compiled by S-Sgt. George von Roeder First Edition printed by the Ferdinand Nicki Buchdruckerei Weiden, Oberfalz, Bavaria Forward WArmy,you hear one soldier“BigorFamily” unit.butInabout the regiment with knowofthatregiment runs into hen ask another what “outfit” he’s in, you may he is not talking about his division, battalion company, which he is serving. In the U.S. the regiment is the other words, if one battalion a tough going, it may be...»

«Women and the Competition Gap By Uri Gneezy and John List You think you know why less than 4% of Fortune 500 CEOs are women? Scholars have theorized for decades about the reasons why women can’t seem to make faster progress in breaking through the glass ceiling. The sad fact is that while women are doing better than men in some areas, such as higher education, men still occupy the highest ranks of society in the United States and around the world. The proportion of women in the U.S. workforce...»

«Brief History of Comprehensive Immigration Reform Efforts in the 109th and 110th Congresses to Inform Policy Discussions in the 113th Congress Ruth Ellen Wasem Specialist in Immigration Policy February 27, 2013 Congressional Research Service 7-5700 www.crs.gov R42980 CRS Report for Congress Prepared for Members and Committees of Congress Brief History of CIR Efforts in the 109th and 110th Congresses Summary Leaders in both chambers of Congress have listed immigration reform as a legislative...»

«SBI UK OPERATIONS BUY TO LET CREDIT POLICY MANUAL (v2.0) SBI UK: Retail Credit Buy to Let Credit Policy Manual v2.0 Contents 1. Version Control Sheet 2. Introduction 2.1 Background 2.2 Regulation 2.3 Responsible Lending 2.4 Lending Appetite 2.5 Lending Authority 2.6 Pricing 2.7 Underwriting Discretion / Exceptions Policy 2.8 Additional Guidance 2.9 Protecting the Bank 2.10 General BTL Guidelines 2.11 Requirements Table 3. A-Z Buy to Let Lending Criteria 3.1 Accountants Qualifications 3.2...»

«Chapter 6 Management’s Perceptions of Social Dialogue at the Company Level in Germany Michael Whittall 6.1 Historical Context of Industrial Relations and the Labor Movement in Germany Germany has had a strongly structured but evolving industrial relations system since World War II. Increased global competition and the cost of unification, however, has left its mark on what Streeck (1995) calls German capitalism. Although employers’ generally remain committed to a system that promotes...»

«ATLANTIS 27.2 (December 2005): 137–47 ISSN 0210-6124 Talking about Women, History, and Writing with Michèle Roberts María Soraya García Sánchez Universidad de Las Palmas de Gran Canaria sorayags2@hotmail.com Michèle Roberts is an Anglo-French feminist writer who was born in Hertfordshire, on May 20, 1949. She is the daughter of an English Protestant father and a French Catholic mother. She was brought up to be bilingual and educated at a religious school. She also lived in a convent for...»

«Colleges and universities should recognize and build on the several, sometimes conflicting, cultures that ufect faculty members’ values and behaviors. Faculty Cultures, Faculty Values Ann E. Austin A history of late twentieth-century American higher education could not avoid describing the diversity of institutional types and the plethora of disciplines, fields, and specialties. The insightful sociologist Burton Clark (1985, p. 41) highlights this diversity when he discusses “the endless...»





 
<<  HOME   |    CONTACTS
2016 www.theses.xlibx.info - Theses, dissertations, documentation

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.