«MTR 0 4B00 000 17 MITR E TECHN IC AL R EPORT Confirmation Bias in Complex Analyses October 2004 Brant A. Cheikes† Mark J. Brown Paul E. Lehner ...»
Approved for Public Release; Distribution Unlimited
MTR 0 4B00 000 17
MITR E TECHN IC AL R EPORT
Confirmation Bias in Complex Analyses
Brant A. Cheikes†
Mark J. Brown
Paul E. Lehner
Sponsor: MITRE Sponsored Research
Dept. No.: G062 Project No.: 51MSR114-A4
The views, opinions and/or findings contained in this report are those of Approved for public release; distribution unlimited.
The MITRE Corporation and should not be construed as an official Government position, policy, or decision, unless designated by other documentation.
©2004 The MITRE Corporation. All Rights Reserved.
Center for Integrated Intelligence Systems Bedford, Massachusetts † E-mail: firstname.lastname@example.org ‡ George Mason University Abstract Most research investigating the confirmation bias involves
experimental tasks where subjects draw inferences from just a few items of evidence. These tasks are not representative of complex analysis tasks characteristic of law enforcement investigations, financial analysis and intelligence analysis. This study examines the confirmation bias in a more complex analysis task and evaluates a recommended procedure, called Analysis of Competing Hypotheses (ACH), designed to mitigate confirmation bias. Results indicate that participant assessment of new evidence was significantly impacted by beliefs they held at the time evidence was received.
Evidence confirming current beliefs was given more “weight” than disconfirming evidence.
However, current beliefs did not influence the assessment of whether an evidence item was confirming or disconfirming. ACH did reduce confirmation bias, but the effect was limited to participants without professional analysis experience.
iii Table of Contents 1 Introduction 1 2 Method 5
2.1 Participants 5
2.2 Procedures 5 2.2.1 Procedures for ACH Group 5 2.2.2 Procedures for the Non-ACH Group 7
2.3 Information Manipulation 7 3 Results 9
3.1 Anchoring Effect 9
3.2 Confirmation Bias 10
v vi 1 Introduction Wickens and Hollands (2000, p. 312) define the confirmation bias as a tendency “for people to seek information and cues that confirm the tentatively held hypothesis or belief, and not seek (or discount) those that support an opposite conclusion or belief.” Klayman and Ha (1987) have correctly pointed out that “positive testing” may be the only strategy to obtain critical falsifications, for example, in cases where the hypothesis is not yet defined specifically enough to be falsified by one instance. However, the concern is that in cases where the hypotheses are well defined, the tendency for people to seek confirming information might result in “cognitive tunnel vision … in which operators fail to encode or process information that is contradictory to or inconsistent with the initially formulated hypothesis” (Wickens & Holland, p. 312). This “… may be dangerous because potential risks and warning signals may be overlooked and, thus, decision fiascos may be the consequence” (Jonas, Schulz-Hardt, Frey, & Thelen, 2001, p. 557). Indeed, anecdotal evidence (e.g., the Senate Intelligence Committee Report, 2004) suggests that recent intelligence analysis failures may be due in part to confirmation bias.1 The concept of a confirmation bias was introduced by Wason (1960), who used a “rule
identification task” such as the following (from Bazerman, 2002, p. 34):
Imagine that the sequence of three numbers (e.g., 2-4-6) follows a rule. Your task is to diagnose that rule by writing down another sequence of 3 numbers. Your instructor will tell you whether or not your sequence follows the correct rule.
The typical result in such tasks is that people tend to generate number sequences that are consistent with (or confirm) the rule that they think is the correct rule, such as 1-3-5 if one believes the rule is “numbers that go up by two.” People seldom generate sequences that try to disconfirm the rule, which in this study was “any three ascending numbers.” In addition to rule identification tasks, three other types of conceptual tasks have routinely been used to study the confirmation bias. One has been the “trait hypothesis-testing paradigm” (Galinsky & Moskowitz, 2000, p. 398). In this paradigm, participants are given a narrative describing a person and then asked to decide whether the person described in the narrative possesses one or more traits (e.g., self control) by either (a) selecting information that confirms or disconfirms the focal hypothesis, or (b) asking participants to remember previous information, some of which confirms or disconfirms the focal hypothesis. Another conceptual task is the “pseudo-diagnosticity” task (Evans, et al., 2002, p. 32), originally proposed by Doherty, Mynatt, Tweney, and Schiavo (1979). In this task, participants are typically asked to indicate which of three pieces of data about two hypotheses they would select to answer a particular question, with the typical finding being that they request data about the focal hypothesis (confirmation bias) because it seems (incorrectly) most diagnostic in answering the question. The last type of Although we note that such “anecdotal evidence” may itself be an instance of confirmation bias.
conceptual task is the “scientific inquiry” task (Koslowski & Maqueda, 1993, p. 105), proposed by Myatt, Doherty, and Tweney (1977). In this task, participants select a small number of science tests designed to generate data confirming or disconfirming their hypothesis.
Research with all four types of tasks has used minimal data (e.g., less than 10 data items) that did not vary in interpretability or reliability, and sometimes not even in diagnosticity. This level of artificiality raises concerns as to whether confirmation bias is characteristic of more complex analysis tasks where there is substantial evidence and the evidence items vary greatly in interpretability, reliability and diagnosticity.
One effort to experimentally investigate confirmation bias in a more representative setting was Tolcott, Marvin, and Lehner (1989), who worked with Army tactical intelligence analysts.
Working in teams of two, analysts were given an initial battlefield scenario and then asked to estimate the most likely avenue of approach (of three possible) for the enemy’s attack and their degree of confidence for it on a 0 to 100 scale. They were then given three rounds of incoming intelligence data. Each round contained 15 pieces of data, three supporting each of the two most likely avenues of approach, and nine being neutral. The analysts provided a new estimate and confidence level after each round. After the third round, they rated the degree to which each of the 45 pieces of intelligence data (presented during the three rounds) supported or contradicted the avenue of approach (hypothesis) they considered most likely, on a -2 to +2 scale. Tolcott et al.
(1989, p. 606) found that “Regardless of initial hypothesis, confidence was generally high and tended to increase as the situation evolved. Confirming evidence was sought, and weighted significantly higher than disconfirming evidence. Contradictory evidence was usually recognized as disconfirming, but was weighted lower than supportive evidence, was often regarded as neutral, and sometimes as deliberatively deceptive.” Consistent with Wickens & Hollands (2000, pp. 311we refer to the Tolcott et al. findings that participants did not change their confidence in the initial hypothesis (even given evidence inconsistent with it) as representing an anchoring effect (or heuristic) and the greater weighting of confirming evidence as representing the confirmation bias.
This paper describes an experiment (1) to replicate the Tolcott et al. result that confirmation bias is manifest in complex analysis tasks, and if so, (2) to determine whether a procedure recommended for use in the intelligence analysis community (Heuer, 1999; Jones, 1998) successfully mitigates it.
The first goal of the experiment was to see if we could replicate the Tolcott et al. findings.
Although confident, we were not certain that we would do so since (1) most confirmation-bias studies have used tasks that were conceptually different than the Tolcott et al. and (2) there are a number of studies that failed to obtain (or mitigate) the supposedly ubiquitous “confirmation bias” implied in introductory texts (e.g., Bazerman, 2002). For example, Ayton (1992) reviewed studies mitigating the confirmation bias for a rule identification task; Galinsky & Moskowitz (2000) did so for a “trait hypothesis testing” task; and Evans et al. (2002) did so for a “pseudo diagnosticity” task.
Moreover, it was not clear if the Tolcott et al. differential-weight findings were at odds with recent research on “predecision distortions” in jury decision making by Carlson and Russo (2001). The latter used the same approach as Tolcott et al. of (1) presenting initial background information and obtaining a confidence rating for the most likely hypothesis (in their case, between plaintiff and defendant), then (2) presenting rounds of new evidence (three in favor of the plaintiff and three the defendant), and (3) obtaining participants’ rating of the degree to which each evidence item supported the hypothesis (plaintiff or defendant). Carlson and Russo also found a significant relationship between participants’ initial confidence rating (“predecison”) and subsequent coding of new evidence (“distortion”). However, Carlson and Russo only measured distortion as the difference between a participant’s rating of evidence and an unbiased, mean rating of the evidence. Consequently, one cannot tell from their paper if participants’ initial confidence rating caused them to (a) completely reinterpret subsequent, disconfirming evidence (e.g., participants with a confidence rating favoring the plaintiff rated subsequent evidence favoring the defendant as actually favoring the plaintiff) or (b) simply gave the evidence a lower rating (e.g., one still favoring the defendant), thereby giving it less weight as Tolcott et al. found, before making their final decision. The current study distinguishes between evidence reinterpretation and weighting.
The second goal of the study was to test the effectiveness of a procedure, called Analysis of Competing Hypotheses (ACH), proposed by Heuer (1999) and Jones (1998) to minimize or eliminate characteristics of the confirmation bias. Although ACH has eight steps, their approach revolves around developing a “hypothesis testing matrix,” where the rows represent the evidence, the columns the hypotheses under consideration, and the cells the extent to which each piece of evidence is consistent or inconsistent with each hypothesis. The goals of the ACH matrix are to overcome the memory limitations affecting one’s ability to keep multiple data and hypotheses in mind, and to break the tendency to focus on developing a single coherent story for explaining the evidence—a tendency which Carlson & Russo (2001) hypothesized creates predecision distortions (and presumably the confirmation bias). ACH is hypothesized to offset confirmation bias by ensuring that analysts actively rate evidence against multiple hypotheses and reminding analysts to focus on disconfirming evidence. However, the only experiment testing the effectiveness of ACH found mixed results (Folker, 1999): it helped intelligence analysts identify the correct answer to one problem, but not another. (We note that both problems had fewer than 20 evidence items.) No experiment has directly tested ACH’s ability to mitigate the confirmation bias.
2 Method This section describes the participants, procedures, and information manipulation used to conduct the experiment.
2.1 Participants Twenty-four (24) employees of a large research and development corporation volunteered to participate in an experiment evaluating structured argumentation methods. Twenty were male, four female. All participants were interested in intelligence analysis, with 12 of the participants having intelligence analysis experience (ranging from 1 to 18 years, with a median of 9.5 years).
Participants’ ages ranged from 27 to 63, with a median of 47.50. Of the 23 participants who indicated their education, 22 had completed college, with 12 having a masters’ degree, three a Ph.D, and one an M.D. Sixteen of the participants majored in math, physics, computer science or engineering.
2.2 Procedures The entire experiment was conducted via email, and all data was collected within a two-month period. Participants were randomly assigned so that there were 12 in the ACH condition and 12 in the non-ACH condition. All participants started with an email providing a general description of what they would do and a request to complete all materials within one sitting, which was estimated to be (and was) two hours or less. The specific procedures for the ACH and non-ACH groups are described next.