«Test-task validation has been an important strand in recent revision projects for University of Cambridge Local Examinations Syndicate (UCLES) ...»
There are still some problems in items such as ‘staging’ and ‘describing’, and feedback from participants suggests that this may be due to misunderstandings or misinterpretations of the gloss and examples used. In addition, there are some similar dif culties with the initial three items in the interactional functions checklist, in which the greatest dif culties in applying the checklists appear to lie.
VII Discussion and initial conclusions The results of this study appear to substantiate our belief that, although still under development for use with the UCLES Main Suite examinations, an operational version of these checklists is certainly feasible, and has potentially wider application, mutatis mutandis, to the content validation of other spoken language tests. Further re nement of the checklists is clearly required, although the developmental process adopted here appears to have borne positive results.
1 Validities We would not wish to claim that the checklists on their own offer a satisfactory demonstration of the construct validity of a spoken language test, for, as Messick argues (1989: 16): ‘the varieties of evidence supporting validity are not alternatives but rather supplements to one
another.’ We recognize the necessity for a broad view of ‘the evidential basis for test interpretation’ (Messick, 1989: 20). Bachman (1990:
237) similarly concludes: ‘it is important to recognise that none of these [evidences of validity] by itself is suf cient to demonstrate the validity of a particular interpretation or use of test scores’ (see also Bachman, 1990: 243). Fulcher (1999: 224) adds a further caveat against an overly narrow interpretation of content validity when he
quotes Messick (1989: 41):
the major problem is that so-called content validity is focused upon test forms rather than test scores, upon instruments rather than measurements... selecting content is an act of classi cation, which is in itself a hypothesis that needs to be con rmed empirically.
46 Validating speaking-test tasks Like these authors, we regard as inadequate any conceptualization of validity that does not involve the provision of evidence on a number of levels, but would argue strongly that without a clear idea of the match between intended content and actual content, any comprehensive investigation of the construct validity of a test is built on sand.
De ning the construct is, in our view, underpinned by establishing the nature of the actual performances elicited by test tasks, i.e. the true content of tasks.
2 Present and future applications of observational checklists Versions of the checklists require a degree of training and practice similar to that given to raters if a reliable and consistent outcome is to be expected. This requires that standardized training materials be developed alongside the checklists. In the case of these checklists, this process has already begun with the initial versions piloted during Phase 3 of the project.
The checklists have great potential as an evaluative tool and can provide comprehensive insight into various issues. It is hoped that,
amongst other issues, the checklists will provide insights into the following:
· the language functions that the different task-types (and different sub-tasks within these) employed in the UCLES Main Suite Paper 5 (Speaking) Tests typically elicit;
· the language that the pair-format elicits, and how it differs in nature and quality from that elicited by interlocutor-single candidate testing;
· the extent to which there is functional variation across the top four levels of the UCLES Main Suite Spoken Language Test.
In addition to these issues, the way in which the checklists can be applied may allow for other important questions to be answered. For example, by allowing the evaluator multiple observations (stopping and starting a recording of a test at will), it will be possible to establish whether there are quanti able differences in the language functions generated by the different tasks; i.e., the evaluators will have the time they need to make frequency counts of the functions.
While the results to date have focused on a posteriori validation procedures, these checklists are also relevant to task design. By taking into account the expected response of a task (and by describing that response in terms of these functions) it will be possible to explore predicted and actual test task outcome. It will also be a useful guide for item writers in taking a priori decisions about content coverage.
Through this approach it should be possible to predict more accurately Barry O’Sullivan, Cyril J. Weir and Nick Saville linguistic response (in terms of the elements of the checklists) and to apply this to the design of test tasks – and of course to evaluate the success of the prediction later on. In the longer term this will lead to a greater understanding of how tasks and task formats can be manipulated to result in speci c language use. We are not claiming that it is possible to predict language use at a micro level (grammatical form or lexis), but that it is possible to predict informational and interactional functions and features of interaction management – a notion supported by Bygate (1999).
The checklists should also enable us to explore how systematic variation in such areas as interviewer questioning behaviour (and interlocutor frame adherence) affects the language produced in this type of test. In the interview transcribed for this study, for example, the examiner directed his questions very deliberately (systematically aiming the questions at one participant and then the other). This tended to sti e any spontaneity in the intended three-way discussion (Task 4), so occurrences of Interactional and Discourse Management Functions did not materialize to the extent intended by the task designers. It is possible that a less deliberate (unscripted) questioning technique would lead to a less interviewer-oriented interaction pattern and allow for the more genuine interactive communication envisaged in the task design.
Perhaps the most valuable contribution that this type of validation procedure offers is its potential to improve the quality of oral assessment in both low-stakes and high-stakes contexts. By offering the investigator an instrument that can be used in real time, the checklists broaden the scope of investigation from limited case study analysis of small numbers of test transcripts to large scale eld studies across a wide range of testing contexts.
We would like to thank Don Porter and Rita Green for their early input into the rst version of the checklist. In addition, help was received from members of the ELT division in UCLES, in particular from Angela ffrench, Lynda Taylor and Christina Rimini, from a group of UCLES Senior Team Leaders and from MA TEFL students at the University of Reading. Finally, we would like to thank the editors and anonymous reviewers of Language Testing for their insightful comments and helpful suggestions for its improvement. The faults that remain are, as ever, ours.
48 Validating speaking-test tasks
Anastasi, A. 1988: Psychological testing. 6th edition. New York: Macmillan.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L.F. and Palmer, A.S. 1981: The construct validation of the FSI oral interview. Language Learning 31, 67–86.
—— 1996: Language testing in practice. Oxford: Oxford University Press.
Ballman, T.L. 1991: The oral task of picture description: similarities and differences in native and nonnative speakers of Spanish. In Teschner, R.V., editor, Assessing foreign language pro ciency of undergraduates. AAUSC Issues in Language Program Direction. Boston: Heinle and Heinle, 221–31.
Brown, A. 1998: Interviewer style and candidate performance in the IELST oral interview. Paper presented at the Language Testing Research Colloquium, Monterey, CA.
Bygate, M. 1988: Speaking. Oxford: Oxford University Press.
—— 1999: Quality of language and purpose of task: patterns of learners’ language on two oral communication tasks. Language Teaching Research 3, 185–214.
Chalhoub-Deville, M. 1995a: Deriving oral assessment scales across different tests and rater groups. Language Testing 12, 16–33.
—— 1995b: A contextualized approach to describing oral language prociency. Language Learning 45, 251–81.
Clark, J.L.D. 1979: Direct vs. semi-direct tests of speaking ability. In Briere, E.J. and Hinofotis, F.B., editors, Concepts in language testing:
some recent studies. Washington DC: TESOL.
—— 1988: Validation of a tape-mediated ACTFL/ILR scale based test of Chinese speaking pro ciency. Language Testing 5, 187–205.
Clark, J.L.D. and Hooshmand, D. 1992: ‘Screen to Screen’ testing: an exploratory study of oral pro ciency interviewing using video teleconferencing. System 20, 293 –304.
Cronbach, L.J. 1971: Validity. In Thorndike, R.L., editor, Educational measurement. 2nd edition. Washington DC: American Council on Education, 443–597.
—— 1990: Essentials of psychological testing. 5th edition. New York:
Harper & Row.
Davies, A. 1977: The construction of language tests. In Allen, J.P.B. and Davies, A., editors, Testing and experimental methods. The Edinburgh Course in Applied Linguistics, Volume 4. London: Oxford University Press, 38–194.
—— 1990: Principles of language testing. Oxford: Blackwell.
Ellerton, A.W. 1997: Considerations in the validation of semi-direct oral testing. Unpublished PhD thesis, CALS, University of Reading.
ffrench, A. 1999: Language functions and UCLES speaking tests. Seminar in Athens, Greece. October 1999.
Barry O’Sullivan, Cyril J. Weir and Nick Saville Foster, P. and Skehan, P. 1996: The in uence of planning and task type on second language performance. Studies in Second Language Acquisition 18, 299–323.
—— 1999: The in uence of source of planning and focus of planning on task-based performance. Language Teaching Research 3, 215–47.
Fulcher, G. 1994: Some priority areas for oral language testing. Language Testing Update 15, 39–47.
—— 1996: Testing tasks: issues in task design and the group oral. Language Testing 13, 23–51.
—— 1999: Assessment in English for academic purposes: putting content validity in its place. Applied Linguistics 20, 221–36.
Hayashi, M. 1995: Conversational repair: a contrastive study of Japanese and English. MA Project Report, University of Canberra.
Henning, G. 1983: Oral pro ciency testing: comparative validities of interview, imitation, and completion methods. Language Learning 33, 315 –32.
—— 1987: A guide to language testing. Cambridge, MA: Newbury House.
Kelly, R. 1978: On the construct validation of comprehension tests: an exercise in applied linguistics. Unpublished PhD thesis, University of Queensland.
Kenyon, D. 1995: An investigation of the validity of task demands on performance-based tests of oral pro ciency. In Kunnan, A.J., editor, Validation in language assessment: selected papers from the 17th Language Testing Research Colloquium, Long Beach. Mahwah, NJ: Lawrence Erlbaum, 19–40.
Kormos, J. 1999: Simulating conversations in oral-pro ciency assessment:
a conversation analysis of role plays and non-scripted interviews in language exams. Language Testing 16, 163–88.
Lazaraton, A. 1992: The structural organisation of a language interview: a conversational analytic perspective. System 20, 373–86.
——1996: A qualitative approach to monitoring examiner conduct in the Cambridge assessment of spoken English (CASE). In Milanovic, M.
and Saville, N., editors, Performance testing, cognition and assessment: selected papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem. Studies in Language Testing 3.
Cambridge: University of Cambridge Local Examinations Syndicate, 18–33.
—— 2000: A qualitative approach to the validation of oral language tests.
Studies in Language Testing, Volume 14. Cambridge: Cambridge University Press.
Lumley, T. and O’Sullivan, B. 2000: The effect of speaker and topic variables on task performance in a tape-mediated assessment of speaking.
Paper presented at the 2nd Annual Asian Language Assessment Research Forum, The Hong Kong Polytechnic University.
Luoma, S. 1997: Comparability of a tape-mediated and a face-to-face test of speaking: a triangulation study. Unpublished Licentiate Thesis, Centre for Applied Language Studies, Jyvaskyla University, Finland.
¨ ¨ 50 Validating speaking-test tasks
McNamara, T. 1996: Measuring second language performance. London:
Mehnert, U. 1998: The effects of different lengths of time for planning on second language performance. Studies in Second Language Acquisition 20, 83–108.
Messick, S. 1975: The standard problem: meaning and values in measurement and evaluation. American Psychologist 30, 955–66.
—— 1989: Validity. In Linn, R.L., editor, Educational measurement. 3rd edition. New York: Macmillan.
Milanovic, M. and Saville, N. 1996: Introduction. Performance testing, cognition and assessment. Studies in Language Testing, Volume 3. Cambridge: University of Cambridge Local Examinations Syndicate, 1–17.
Moller, A. D. 1982: A study in the validation of pro ciency tests of English as a Foreign Language. Unpublished PhD thesis, University of Edinburgh.
Norris, J, Brown, J. D., Hudson, T. and Yoshioka, J. 1998: Designing second language performance assessments. Technical Report 18.
Honolulu, HI: University of Hawaii Press.
O’Loughlin, K. 1995: Lexical density in candidate output on direct and semi-direct versions of an oral pro ciency test. Language Testing 12, 217–37.
—— 1997: The comparability of direct and semi-direct speaking tests: a case study. Unpublished PhD Thesis, University of Melbourne, Melbourne.
—— 2001: An investigatory study of the equivalence of direct and semidirect speaking skills. Studies in Language Testing 13. Cambridge:
Cambridge University Press/UCLES.
Ortega, L. 1999: Planning and focus on form in L2 oral performance. Studies in Second Language Acquisition 20, 109–48.