«Test-task validation has been an important strand in recent revision projects for University of Cambridge Local Examinations Syndicate (UCLES) ...»
therefore, the generalizability of the results may be questioned.
Clearly then, a more ef cient methodology is required that allows the test designer to evaluate the procedures and, especially, the tasks in terms of the language produced by a larger number of candidates.
Ideally this should be possible in ‘real’ time, so that the relationship of predicted outcome to speci c outcome can be established using a data set that satisfactorily re ects the typical test-taking population.
The primary objective of this project, therefore, was to create an instrument, built on a framework that describes the language of performance in a way that can be readily accessed by evaluators who are familiar with the tests being observed. This work is designed to be complementary to the use of transcriptions and to provide an additional source of validation evidence.
The FCE was chosen as the focus of this study for a number of
· It is ‘stable’, in that it is neither under review nor is due to be reviewed.
· It represents the middle of the ALTE (and UCLES Main Suite) range, and is the most widely subscribed test in the battery.
· It offers the most likelihood of a wide range of performance of any Main Suite examination: as it is often used as an ‘entry-point’ into the suite, candidates tend to range from below to above this level in terms of ability.
· Like all of the other Main Suite examinations, a database of recordings (audio and video) already existed.
IV The development of the observation checklists Weir (1993), building on the earlier work of Bygate (1988), suggests that the language of a speaking test can be described in terms of the informational and interactional functions and those of interaction management generated by the participants involved. With this as a starting point, a group of researchers at the University of Reading were commissioned by UCLES EFL, to examine the spoken language, second language acquisition and language testing literatures to come up with an initial set of such functions (see Schegloff et al., 40 Validating speaking-test tasks 1977; Schwartz, 1980; van Ek and Trim, 1984; Bygate, 1988;
Shohamy, 1988; 1994; Walker, 1990; Weir, 1994; Stenstrom, 1994;
¨ Chalhoub-Deville, 1995b; Hayashi, 1995; Ellerton, 1997; Suhua, 1998; Kormos, 1999; O’Sullivan, 2000; O’Loughlin, 2001).
These were then presented as a draft set of three checklists (Appendix 1), representing each of the elements of Weir’s categorization. What follows in the three phases of the development process described below (Section VI), was an attempt to customize the checklist to more closely re ect the intended outcomes of spoken language test tasks in the UCLES Main Suite. The checklists were designed to help establish which of these functions resulted, and which were absent.
The next concern was with the development of a procedure for devising a ‘working’ version of the checklists to be followed by an evaluation of using this type of instrument in ‘real’ time (using tapes or perhaps live speaking tests).
V The development model The process through which the checklists were developed is shown in Figure 3. The concept that drives this model is the evaluation at each level by different stakeholders. At this stage of the project these
stakeholders were identi ed as:
Figure 3 The development model Barry O’Sullivan, Cyril J. Weir and Nick Saville · the consulting ‘expert’ testers (the University of Reading group);
· the test development and validation staff at UCLES;
· UCLES Senior Team Leaders (i.e., key staff in the oral examiner training system).
All these individuals participated in the application of each draft. It should also be noted that a number of drafts were anticipated.
VI The development process In order to arrive at a working version of the checklists, a number of developmental phases were anticipated. At each phase, the latest version (or draft) of the instruments was applied and this application evaluated.
Phase 1 The rst attempt to examine how the draft checklists would be viewed, and applied, by a group of language teachers was conducted by ffrench (1999). Of the participants at the seminar, approximately 50% of the group reported that English (British English, American English or Australian English) was their rst language, while the remaining 50% were native Greek speakers.
In their introduction to the application of the Observation Checklists (OCs), the participants were given a series of activities that focused on the nature and use of those functions of language seen by task designers at UCLES to be particularly applicable to their EFL Main Suite Speaking Tests (principally FCE, CAE and CPE). Once familiar with the nature of the functions (and where they might occur in a test), the participants applied the OCs in ‘real’ time to an FCE Speaking Test from the 1998 Standardization Video. This video featured a pair of French speakers who were judged by a panel of ‘expert’ raters (within UCLES) to be slightly above the criterion (‘pass’) level.
Of the 37 participants, 32 completed the task successfully, that is, they attempted to make frequency counts of the items represented in the OCs. Among this group, there appear to be varying degrees of agreement as to the use of language functions, particularly in terms of the speci c number of observations of each function. However, when the data are examined from the perspective of agreement on whether a particular function was observed or not (ignoring the count, which in retrospect was highly ambitious when we consider the lack of systematic training in the use of the questionnaires given to the teachers who attended), we nd that there is a striking degree of 42 Validating speaking-test tasks agreement on all but a small number of functions (Appendix 2). Note here that, in order to make these patterns of behaviour clear, the data have been sorted both horizontally and vertically by the total number of observations made by each participant and of each item.
From this perspective, this aspect of the developmental process was considered to be quite successful. However, it was apparent that there were a number of elements within the checklists that were causing some dif culty. These are highlighted in the table by the tram-lines.
Items above the lines have been identi ed by some participants, in one case by a single person, while those below have been observed by a majority of participants (in two cases by all of them). For these cases, we might infer a high degree of agreement. However, the middle range of items appears to have caused a degree of confusion, and so are highlighted here, i.e., marked for further investigation.
Phase 2 In this phase, a much smaller gathering was organized, this time involving members of the development team as well as the three UKbased UCLES Senior Team Leaders. In advance of this meeting all participants were asked to study the existing checklists and to exemplify each function with examples drawn from their experiences of the various UCLES Main Suite examinations. The resulting data were collated and presented as a single document that formed the basis of discussion during a day-long session. Participants were not made aware of the ndings from Phase 1.
During this session many questions were asked of all aspects of the checklist, and a more streamlined version of the three sections was suggested. In addition to a number of participants making a written record of the discussions, the entire session was recorded. This proved to be a valuable reminder of the way in which particular changes came about and was used when the nal decisions regarding inclusion, con ation or omission were being made. Although it is beyond the scope of this project to analyse this recording, when coupled with the earlier and revised documents, it is in itself a valuable source of data in that it provides a signi cant record of the developmental process.
Among the many interesting outcomes of this phase were the decisions either to rethink, to reorganize or to omit items from the initial list. These decisions were seen to mirror the results of the Phase 1 application quite closely. Of the 13 items identi ed in Phase 1 as being in need of review (7 were rarely observed, indicating a high degree of agreement that they were not, in fact, present, and 6 appeared to be confused with very mixed reported observations), 7 Barry O’Sullivan, Cyril J. Weir and Nick Saville were recommended for either omission or inclusion in other items by the panel, while the remaining 6 items were identi ed by them as being of value. Although no examples of the latter had appeared in the earlier data, the panel agreed that they represented language functions that the UCLES Main Suite examinations were intended to elicit.
It was also decided that each item in this latter group was in need of further clari cation and/or exempli cation. Of the remaining 17
· two were changed: the item ‘analysing’ was recoded as ‘staging’ in order to clarify its intended meaning, while it was decided to separate the item ‘(dis)agreeing’ into its two separate components;
· three were omitted: it was argued that the item ‘providing nonpersonal information’ referred to what was happening with the other items in the informational function category, while the items ‘explaining’ and ‘justifying/supporting’ were not functions usually associated with the UCLES Main Suite tasks and no occurrences of these had been noted.
We would emphasize that, as reported in Section IV above, the initial list was developed to cover the language functions that various spoken language test tasks might elicit. The development of the checklists described here re ects an attempt to customize the lists, in line with the intended functional outcomes of a speci c set of tests.
We are, of course, aware that closed instruments of this type may be open to the criticism that valuable information could be lost. However, for reasons of practicality, we felt it necessary to limit the list to what the examinations were intended to elicit, rather than attempt to operationalize a full inventory. Secondly, any functions that appeared in the data that were not covered by the reduced list would have been noted. There appeared to be no cases of this.
The data from these two phases were combined to result in a working version of the checklists (Appendix 3), which was then applied to a pair of FCE Speaking Tests in Phase 3.
Phase 3 In the third phase, the revised checklists were given to a group of 15 MA TEFL students who were asked to apply them to two FCE tests.
Both of these tests involved a mixed-sex pair of learners, one pair of approximately average ability and the other pair above average.
Before using the observation checklists (OCs), the students were asked rst to attempt to predict which functions they might expect to nd. To help in this pre-session task, the students were given details of the FCE format and tasks.
44 Validating speaking-test tasks Unfortunately, a small number of students did not manage to complete the observation task, as they were somewhat overwhelmed with the real-time application of the checklists. As a result only 12 sets of completed checklists were included in the nal analysis.
Prior to the session, the group was given an opportunity to have a practice run using a third FCE examination. While this ‘training’ period, coupled with the pre-session task, was intended to provide the students with the background they needed to apply the checklists consistently, there was a problem during the session itself. This problem was caused by the failure of a number of students to note the change from Task 3 to Task 4 in the rst test observed. This was possibly caused by a lack of awareness of the test itself and was not helped by the seamless way in which the examiner on the video moved from a two-way discussion involving the test-takers to a three-way discussion. This meant that a full set of data exists only for the rst two tasks of this test. As the problem was noticed in time, the second test did not cause these problems. Unlike the earlier seminar, on this occasion the participants were asked only to record each function when it was rst observed. This was done as it was felt that the earlier seminar showed that, without extensive training, it would be far too dif cult to apply the OCs fully in ‘real’ time in order to generate comprehensive frequency counts. We are aware that a full tally would enable us to draw more precise conclusions about the relative frequency of occurrence of these functions and the degree of consensus (reliability) of observers.
Against this we must emphasize that the checklists, in their current stage of development, are designed to be used in real time. Their use was therefore restricted to determining the presence or absence of a particular function. Rater agreement, in this case, is limited to a somewhat crude account of whether a function occurred or did not occur in a particular task performance. We do not, therefore, have evidence of whether the function observed was invariant across raters.
The results from this session are included as Appendix 4. It can be seen from this table that the participants again display mixed levels of agreement, ranging from a single perceived observation to total agreement. As with the earlier session, it appears that there is relatively broad agreement on a range of functions, but that others appear to be more dif cult to identify easily. These dif culties appear to be greatest where the task involves a degree of interaction between the test-takers.
Phase 4 In this phase a transcription was made of the second of the two interviews used in Phase 3, since there was a full set of data available for Barry O’Sullivan, Cyril J. Weir and Nick Saville this interview. The OCs were then ‘mapped’ on to this transcript in order to give an overview from a different perspective of what functions were generated (it being felt that this map would result in an accurate description of the test in terms of the items included in the OCs). This mapping was carried out by two researchers, who initially worked independently of each other, but discussed their nished work in order to arrive at a consensus.
Finally the results of Phases 2 and 3 were compared (Appendix 5). This clearly indicates that the checklists are now working well.