«Test-task validation has been an important strand in recent revision projects for University of Cambridge Local Examinations Syndicate (UCLES) ...»
Using observation checklists to validate
Barry O’Sullivan The University of Reading, Cyril J. Weir
University of Surrey, Roehampton and Nick Saville
University of Cambridge Local Examinations Syndicate
Test-task validation has been an important strand in recent revision projects for
University of Cambridge Local Examinations Syndicate (UCLES) examinations.
This article addresses the relatively neglected area of validating the match between
intended and actual test-taker language with respect to a blueprint of language functions representing the construct of spoken language ability. An observation checklist designed for both a priori and a posteriori analysis of speaking task output has been developed. This checklist enables language samples elicited by the task to be scanned for these functions in real time, without resorting to the laborious and somewhat limited analysis of transcripts. The process and results of its development, implications and further applications are discussed.
I Background to the study This article reports on the development and use of observation checklists in the validation of the Speaking Tests within the University of Cambridge Local Examinations Syndicate (UCLES) ‘Main Suite’ examination system (see Figure 1). These checklists are intended to ALTE Level 1 ALTE Level 2 ALTE Level 3 ALTE Level 4 ALTE Level 5 Waystage Threshold Independent Competent Good User User User User User
CAMBRIDGE CAMBRIDGE CAMBRIDGE CAMBRIDGE CAMBRIDGELevel 1 Level 2 Level 3 Level 4 Level 5 Key English Preliminary First Certificate in Certificate of Test (KET) English Test Certificate in Advanced Proficiency in (PET) English (FCE) English (CAE) English (CPE)
BASIC INTERMEDIATE ADVANCEDThe Cambridge/ALTE ve-level system Figure 1 Address for correspondence: Barry O’Sullivan, Testing and Evaluation Unit, School of Linguistics and Applied Language Studies, The University of Reading, PO Box 241, Whiteknights, Reading RG6 6WB, UK; email: b.e.osullivanKreading.ac.uk 10.1191/0265532202lt219oa Ó 2002 Arnold Language Testing 2002 19 (1) 33–56 34 Validating speaking-test tasks Table 1 Format of the Main Suite Speaking Test Part Participants Task format 1 Interviewer–candidate Interview: Verbal questions 2 Candidate–candidate Collaborative task: Visual stimulus;
provide an effective and ef cient tool for investigating variation in language produced by different task types, different tasks within task types, and different interview organization at the pro ciency levels in Figure 1. As such, they represent a unique attempt to validate the match between intended and actual test-taker language with respect to a blueprint of language functions representing the construct of spoken language ability in the UCLES tests of general language pro ciency, from PET to CPE level (for further information related to the different tests in the ‘Main Suite’ battery, see the individual handbooks produced by UCLES). Beyond this study, the application of such checklists has clear relevance for any test of spoken interaction.
The standard Cambridge approach in testing speaking is based on a paired format involving an interlocutor, an additional examiner and two candidates. Careful attention has been given to the tasks through which the spoken language performance is elicited in each different part. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1.
II Issues in validating tests of oral performance In considering the issue of the validity of a performance test1 of speaking, we need a framework that describes the relationship between the construct being measured, the tasks used to operationalize that construct and the assessment of the performances that are used to make inferences to that underlying ability.
There have been a number of models that have attempted to portray the relationship between a test-taker’s knowledge of, and ability to use, a language and the score they receive in a test designed to evaluate that knowledge (e.g., Milanovic and Saville, 1996; McNamara, 1996; Skehan, 1998; Upshur and Turner, 1999).
By performance tests we are referring to direct tests where a test-taker’s ability is evaluated from their performance on a set task or tasks.
Barry O’Sullivan, Cyril J. Weir and Nick Saville Milanovic and Saville (1996) provide a useful overview of the variables that interact in performance testing and suggest a conceptual framework for setting out different avenues of research. The framework was in uential in the revisions of the Cambridge examinations during the 1990s, including the development of KET and CAE exams and revisions to PET, FCE and, most recently, CPE (for a summary of the UCLES approach, see Saville and Hargreaves, 1999).
The Milanovic and Saville framework is one of the earliest, and most comprehensive of these models (reproduced here as Figure 2).
This framework highlights the many factors (or facets) that must be considered when designing a test from which particular inferences are to be drawn about performances; all of the factors represented in the model pose potential threats to the reliability and validity of these inferences. From this model, a framework can be derived, through which a validation strategy can be devised for Speaking Tests such as those produced by UCLES.
The essential elements of this framework are:
· the test-taker;
· the interlocutor/examiner;
· the assessment criteria (scales);
· the task;
· the interactions between these elements.
Figure 2 A conceptual framework for performance testing Source: adapted from Milanovic and Saville, 1996: 6 36 Validating speaking-test tasks The subject of this study, the task, has been explored from a number
of perspectives. Brie y, these have been:
· Task/method comparison (quantitative): involving studies in which comparisons are made between performances on different tasks or methods (Clark, 1979; 1988; Henning, 1983; Shohamy, 1983; Shohamy et al., 1986; Clark and Hooshmand, 1992; Stanseld and Kenyon, 1992; Wigglesworth and O’Loughlin, 1993;
Chalhoub-Deville, 1995a; O’Loughlin, 1995; Fulcher, 1996; Lumley and O’Sullivan, 2000; O’Sullivan, 2000).
· Task/method comparison (qualitative): as above but where qualitative methods are employed (Shohamy, 1994; Young, 1995;
Luoma, 1997; O’Loughlin, 1997; Bygate, 1999; Kormos, 1999).
· Task performance (method effect): where aspects of the task are systematically manipulated; e.g., planning time, pre- or post-task operations, etc. (Foster and Skehan, 1996; 1999; Wigglesworth, 1997; Mehnert, 1998; Ortega, 1999; Upshur and Turner, 1999).
· Native speaker/Nonnative speaker comparison: where native speaker performance on speci c tasks is compared to nonnative speaker performance on the same tasks (Weir, 1983; Ballman, 1991).
· Task dif culty/classi cation: where an attempt has been made to classify tasks in terms of their dif culty (Weir, 1993; Fulcher, 1994; Kenyon, 1995; Robinson, 1995; Skehan, 1996; 1998; Norris et al., 1998 ).
The central importance of the test task has been clearly recognized;
however, in terms of test validation, there is one question that has, to date, remained largely unexplored. Although there has been a great deal of debate over the validation of performance tests through analysis of the language generated in the performance of language elicitation tasks (LETs) (e.g., van Lier, 1989; Lazaraton, 1992; 1996), attention has not been drawn to the one aspect of task performance that would appear to be of most interest to the test designer. That is, when tasks are performed in a test event, how does that performance relate to the test designer’s predictions or expectations based on their de nition or interpretation of the construct? After all, no matter how reliably the performance is scored, if it does not match the expectations of the test designer (in other words represent the constructs which are to be tested), then the inferences that the test designer hopes to draw from the evaluated performance will not be valid.
Cronbach went to the heart of the matter (1971: 443): ‘Construction of a test itself starts from a theory about behaviour or mental organization derived from prior research that suggests the ground plan for the test.’ Davies (1977: 63) argued in similar vein: ‘it is, after Barry O’Sullivan, Cyril J. Weir and Nick Saville all, the theory on which all else rests; it is from there that the construct is set up and it is on the construct that validity, of the content and predictive kinds, is based.’ Kelly (1978: 8) supported this view, commenting that: ‘the systematic development of tests requires some theory, even an informal, inexplicit one, to guide the initial selection of item content and the division of the domain of interest into appropriate sub-areas.’ Because we lack an adequate theory of language in use, a priori attempts to determine the construct validity of pro ciency tests involve us in matters that relate more evidently to content validity.
We need to talk of the communicative construct in descriptive terms and, as a result, we become involved in questions of content relevance and content coverage. Thus, for Kelly (1978: 8) content validity seemed ‘an almost completely overlapping concept’ with construct validity, and for Moller (1982: 68): ‘the distinction between construct and content validity in language testing is not always very marked, particularly for tests of general language pro ciency.’ Content validity is considered important as it is principally concerned with the extent to which the selection of test tasks is representative of the larger universe of tasks of which the test is assumed to be a sample (see Bachman and Palmer, 1981; Henning, 1987: 94;
Messick, 1989: 16; Bachman, 1990: 244). Similarly, Anastasi (1988:
131) de ned content validity as involving: ‘essentially the systematic examination of the test content to determine whether it covers a representative sample of the behaviour domain to be measured.’ She
outlined (Anastasi, 1988: 132) the following guidelines for establishing content validity:
1) ‘the behaviour domain to be tested must be systematically analysed to make certain that all major aspects are covered by the test items, and in the correct proportions’;
2) ‘the domain under consideration should be fully described in advance, rather than being de ned after the test has been prepared’;
3) ‘content validity depends on the relevance of the individual’s test responses to the behaviour area under consideration, rather than on the apparent relevance of item content.’ The directness of t and adequacy of the test sample is thus dependent on the quality of the description of the target language behaviour being tested. In addition, if the responses to the item are invoked Messick (1975: 961) suggests ‘the concern with processes underlying test responses places this approach to content validity squarely in the realm of construct validity’. Davies (1990: 23) similarly notes: ‘content validity slides into construct validity’.
38 Validating speaking-test tasks Content validation is, of course, extremely problematic given the dif culty we have in characterizing language pro ciency with sufcient precision to ensure the validity of the representative sample we include in our tests, and the further threats to validity arising out of any attempts to operationalize real life behaviours in a test. Specifying operations, let alone the conditions under which these are performed, is challenging and at best relatively unsophisticated (see Cronbach, 1990). Weir (1993) provides an introductory attempt to specify the operations and conditions that might form a framework for test task description (see also Bachman, 1990; Bachman and Palmer, 1996).
The dif culties involved do not, however, absolve us from attempting to make our tests as relevant as possible in terms of content. Generating content related evidence is seen as a necessary, although not suf cient, part of the validation process of a speaking test. To this end we sought to establish in this study an effective and ef cient procedure for establishing the content validity of speaking tests. As well as being useful in helping specify the domain to be tested we would argue that the checklist discussed below would enable the researcher to address how predicted vs. actual task performance can be compared.
III Methodological issues While it is relatively easy to rationalize the need to establish that the LETs used in performance tests are working as predicted (i.e., in terms of language generated), the dif culty lies in how this might best be done.
UCLES EFL (English as a foreign language) routinely collects audio recordings and carries out transcriptions of its Speaking Tests.
These transcripts are used for a range of validation purposes, and in particular they contribute to revision projects for the Speaking Tests, for example, FCE which was revised in 1996, and currently the revision of the International English Language Testing System (IELTS) Speaking Test, in addition to the CPE revision project.
In a series of UCLES studies focusing on the language of the Speaking Tests, Lazaraton has applied conversational analysis (CA) techniques to contribute to our understanding of the language used in pair-format Speaking Tests, including the language of the candidates and the interlocutor. Her approach requires a very careful, ne-tuned transcription of the tests in order to provide the data for analysis (see Lazaraton, 2000). Similar qualitative methodologies have been applied by Young and Milanovic (1992) – also to UCLES data – by Brown (1998) and by Ross and Berwick (1992), amongst others.
Barry O’Sullivan, Cyril J. Weir and Nick Saville While there is clearly a great deal of potential for this detailed analysis of transcribed performances, there are also a number of drawbacks, the most serious of which involves the complexity of the transcription process. In practice, this means that a great deal of time and expertise is required in order to gain the kind of data that will answer the basic question concerning validity. Even where this is done, it is impractical to attempt to deal with more than a small number of test events;