«Methods for Evaluating Interactive Information Retrieval Systems with Users By Diane Kelly Contents 1 Introduction 2 1.1 Purpose and Scope 4 1.2 ...»
Foundations and Trends R in
Vol. 3, Nos. 1–2 (2009) 1–224
c 2009 D. Kelly
Methods for Evaluating Interactive Information
Retrieval Systems with Users
By Diane Kelly
1 Introduction 2
1.1 Purpose and Scope 4
1.2 Sources and Recommended Readings 6
1.3 Outline of Paper 7 2 What is Interactive Information Retrieval? 9 3 Background 15
3.1 Cognitive Viewpoint in IR 15
3.2 Text Retrieval Conference 17 4 Approaches 25
4.1 Exploratory, Descriptive and Explanatory Studies 25
4.2 Evaluations and Experiments 26
4.3 Laboratory and Naturalistic Studies 27
4.4 Longitudinal Studies 28
4.5 Case Studies 29
4.6 Wizard of Oz Studies and Simulations 29 5 Research Basics 31
5.1 Problems and Questions 31
5.2 Theory 33
5.3 Hypotheses 33
5.4 Variables and Measurement 35
5.5 Measurement Considerations 39
5.6 Levels of Measurement 41 6 Experimental Design 44
6.1 Traditional Designs and the IIR Design 44
6.2 Factorial Designs 48
6.3 Between- and Within-Subjects Designs 49
6.4 Rotation and Counterbalancing 50
6.5 Randomization and User Choice 56
6.6 Study Mode 57
6.7 Protocols 58
6.8 Tutorials 58
6.9 Timing and Fatigue 59
6.10 Pilot Testing 60 7 Sampling 61
7.1 Probability Sampling 63
7.2 Non-Probability Sampling Techniques 66
7.3 Subject Recruitment 68
7.4 Users, Subjects, Participants and Assessors 69 8 Collections 71
8.1 Documents, Topics, and Tasks 71
8.2 Information Needs: Tasks and Topics 76 9 Data Collection Techniques 84
9.1 Think-Aloud 84
9.2 Stimulated Recall 85
9.3 Spontaneous and Prompted Self-Report
School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA, firstname.lastname@example.org Abstract This paper provides overview and instruction regarding the evaluation of interactive information retrieval systems with users. The primary goal of this article is to catalog and compile material related to this topic into a single source. This article (1) provides historical background on the development of user-centered approaches to the evaluation of interactive information retrieval systems; (2) describes the major components of interactive information retrieval system evaluation; (3) describes diﬀerent experimental designs and sampling strategies; (4) presents core instruments and data collection techniques and measures; (5) explains basic data analysis techniques; and (4) reviews and discusses previous studies. This article also discusses validity and reliability issues with respect to both measures and methods, presents background information on research ethics and discusses some ethical issues which are speciﬁc to studies of interactive information retrieval (IIR). Finally, this article concludes with a discussion of outstanding challenges and future research directions.
Introduction Information retrieval (IR) has experienced huge growth in the past decade as increasing numbers and types of information systems are being developed for end-users. The incorporation of users into IR system evaluation and the study of users’ information search behaviors and interactions have been identiﬁed as important concerns for IR researchers . While the study of IR systems has a prescribed and dominant evaluation method that can be traced back to the Cranﬁeld studies , studies of users and their interactions with information systems do not have well-established methods. For those interested in evaluating interactive information retrieval systems with users, it can be diﬃcult to determine how to proceed from a scan of the literature since guidelines for designing and conducting such studies are for the most part missing.
In interactive information retrieval (IIR), users are typically studied along with their interactions with systems and information. While classic IR studies
humans out of the evaluation model, IIR focuses on users’ behaviors and experiences — including physical, cognitive and aﬀective — and the interactions that occur between users and systems, and users and information. In simple terms, classic IR evaluation asks the question, does this system retrieve relevant documents? IIR evaluation asks the question, can people use this system to retrieve relevant documents? IIR studies include both system evaluations as well as more focused studies of users’ information search behaviors and their interactions with systems and information. IIR is informed by many ﬁelds including traditional IR, information and library science, psychology, and human–computer interaction (HCI). IIR has often been presented more generally as a combination of IR and HCI, or as a sub-area of HCI, but Ruthven  argues convincingly that IIR is a distinct research area. Recently, there has been interest in HCIR, or human computer information retrieval, but this looks similar to IIR and papers about this area have not established its uniqueness (e.g., ).
The proposition that IR systems are fundamentally interactive and should be evaluated from the perspective of users is not new. A review of IR literature reveals that many leaders in the ﬁeld were writing about and studying interactive IR systems during the early years of IR research. For instance, Salton wrote a paper entitled “Evaluation problems in interactive information retrieval” which was published in 1970.
In this paper, Salton  identiﬁed user eﬀort measures as important components of IR evaluation, including the attitudes and perceptions of users. Cleverdon et al.  identiﬁed presentation issues and user eﬀort as important evaluation measures for IR systems, along with recall and precision. Tague and Schultz  discuss the notion of user friendliness.
Some of the ﬁrst types of IR interactions were associated with relevance feedback. Looking closely at this seemingly simple type of interaction, we see the diﬃculties inherent in IIR studies. Assuming that users are provided with information needs, each user is likely to enter a diﬀerent query, which will lead to diﬀerent search results and different opportunities for relevance feedback. Each user, in turn, will provide diﬀerent amounts of feedback, which will create new lists of search results. Furthermore, causes and consequences of these interactions cannot be observed easily since much of this exists in the user’s head. The actions that are available for observation — querying, saving a document, providing relevance feedback — are surrogates of cognitive activities. From such observable behaviors we must infer cognitive 4 Introduction activity; for instance, users who save a document may do so because it changes or adds to their understanding of their information needs.
User–system interactions are inﬂuenced by a number of other factors that are neither easily observable nor measurable. Each individual user has a diﬀerent cognitive composition and behavioral disposition. Users vary according to all sorts of factors including how much they know about particular topics, how motivated they are to search, how much they know about searching, how much they know about the particular work or search task they need to complete, and even their expectations and perceptions of the IIR study [139, 194]. Individual variations in these factors mean that it is diﬃcult to create an experimental situation that all people will experience the same, which in turn, makes it diﬃcult to establish causal relationships. Moreover, measuring these factors is not always practical since there are likely a large number of factors and no established measurement practices.
The inclusion of users into any study necessarily makes IIR, in part, a behavioral science. As a result, appropriate methods for studying interactive IR systems must unite research traditions in two sciences which can be challenging. It is also the case that diﬀerent systems, interfaces and use scenarios call for diﬀerent methods and metrics, and studies of behavior and interaction suggest research designs that go beyond evaluation. For these reasons, there is no strong evaluation or experimental framework for IIR evaluations as there is for IR studies.
IIR researchers are able to make many choices about how to design and conduct their evaluations, but there is little guidance about how to do this.
1.1 Purpose and Scope There is a small body of research on evaluation models, methods, and metrics for IIR, but such studies are the exception rather than the rule (e.g., [34, 149]). In contrast to other disciplines where studies of methods and experimental design comprise an important portion of the literature, there are few, if any, research programs in IIR that investigate these issues and there is little formal guidance about how to conduct such studies, despite a long-standing call for such work
1.1 Purpose and Scope . Tague’s [260, 262] work and select chapters of the edited volume by Sp¨rck-Jones  provide good starting points, but these writings a are 15–20-years-old. While it might be argued that Sp¨rck-Jones’ book a still describes the basic methodology behind traditional IR evaluations, Tague’s work, which focuses on user-centered methods, needs updating given changes in search environments, tasks, users, and measures. It is also the case that Tague’s work does not discuss data analysis. One might consult a statistics textbook for this type of information, but it can sometimes be diﬃcult to develop a solid understanding of these topics unless they are discussed within the context of one’s own area of study.
The purpose of this paper is to provide a foundation on which those new to IIR can make more informed choices about how to design and conduct IIR evaluations with human subjects.1 The primary goal is to catalog and compile material related to the IIR evaluation method into a single source. This paper proposes some guidelines for conducting one basic type of IIR study — laboratory evaluations of experimental IIR systems. This is a particular kind of IIR study, but not the only kind. This paper is also focused more on quantitative methods, rather than qualitative. This is not a statement of value or importance, but a choice necessary to maintain a reasonable scope for this paper.
This article does not prescribe a step-by-step recipe for conducting IIR evaluations. The design of IIR studies is not a linear process and it would be imprudent to present the design process in this way. Typically, method design occurs iteratively, over time. Design decisions are interdependent; each choice impacts other choices. Understanding the possibilities and limitations of diﬀerent design choices help one make better decisions, but there is no single method that is appropriate for all study situations. Part of the intellectual aspects of IIR is the method design itself. Prescriptive methods imply research can only be done in 1 The terms user and subject are often used interchangeably in published IIR studies. A distinction between these terms will be made in Section 7. Since this paper focuses primarily on laboratory evaluations, the term subject will be used when discussing issues related to laboratory evaluations and user will be used when discussing general issues related to all IIR studies. Subject is used to indicate a person who has been sampled from the user population to be included in a study.
6 Introduction one way and often prevent researchers from discovering better ways of doing things.
The focus of this paper is on text retrieval systems. The basic methodological issues presented in this paper are relevant to other types of IIR systems, but each type of IIR system will likely introduce its own special considerations and issues. Additional attention is given to the study of diﬀerent types of IIR systems in the ﬁnal section of this paper.
Digital libraries, a speciﬁc setting where IIR occurs, are also not discussed explicitly, but again, much of the material in this paper will be relevant to those working in this area .
Finally, this paper surveys some of the work that has been conducted in IIR. The survey is not intended to be comprehensive. Many of the studies that are cited are used to illustrate particular evaluation issues, rather than to reﬂect the state-of-the-art in IIR. For a current survey of research in IIR, see Ruthven . For a more historic perspective, see Belkin and Vickery .
1.2 Sources and Recommended Readings A number of papers about evaluation have been consulted in the creation of this paper and have otherwise greatly inﬂuenced the content of this paper. As mentioned earlier, the works of Tague [260, 262, 263, 264] and Tague and Schultz  are seminal pieces. The edited volume by Sp¨rck-Jones  also formed a foundation for this paper.
a Other research devoted to the study and development of individual components or models for IIR evaluation have also inﬂuenced this paper. Borlund [32, 34] has contributed much to IIR evaluation with her studies of simulated information needs and evaluation measures.
Haas and Kraft  reviewed traditional experimental designs and related these to information science research. Ingwersen and J¨rvelin a  present a general discussion of methods used in information seeking and retrieval research. Finally, the TREC Interactive Track  and all of the participants in this Track over the years have made signiﬁcant contributions to the development of an IIR evaluation framework.
Review articles have been written about many topics discussed in this paper. These articles include Sugar’s  review of user-centered
1.3 Outline of Paper perspectives in IR and Turtle et al.’s  review of interactive IR research as well as Ruthven’s  more recent version. The Annual Review of Information Science and Technology (ARIST) has also published many chapters on evaluation over its 40-year history including King’s  article on the design and evaluation of information systems,2 Kantor’s  review of feedback and its evaluation in IR, Rorvig’s  review of psychometric measurement in IR, Harter and Hert’s  review of IR system evaluation, and Wang’s  review of methodologies and methods for user behavior research.