Home / General / Replication Crisis in Psychology: Part Four

Replication Crisis in Psychology: Part Four

By Katie Surrence

On May 23, 2016

At 8:39 am

In General

1687 Views

Parts one, two, and three.

On: How should scientists respond to failed replications? What kinds of responses contribute to a progressive science?

Another major news story surrounding replication in the last few months was the failure to replicate the ego depletion effect. It’s an interesting case because the psychologist famous for naming and discovering the ego depletion effect, Roy Baumeister, both demonstrated some exemplary behavior in response to a replication failure, and has maybe also engaged in the kind of motivated reasoning that might be dangerous (at least in the absence of the exemplary behavior).

Ego depletion is the theory that willpower is a depletable resource, and that having to exercise self-control in one domain will make you less able to exercise it that domain, or any other, right afterward. One of the original ego depletion studies asked people to resist eating cookies to cause depletion, and then the dependent measure was how long they persisted at an unsolvable anagram test. Since then ego depletion has been replicated in a number of different types of experiments (and a good thing, too, because there are a lot of other explanations for why you might be more inclined to persist on an anagram test after having eaten cookies rather than radishes, including just goodwill towards the people who gave you cookies).

The ego depletion phenomenon was then selected for a registered replication because a couple of meta-analyses (studies that combine other studies) cast doubt on the reality of the effect. The replicators felt that this cookie task involved too much experimenter involvement for a large scale replication among many labs, and that anagram solving was too culture specific, so they chose another ego depletion manipulation from the literature. In the first part of the study, in the depletion condition:

participants were presented with a series of words on a video screen and required to press a button when a word with the letter ‘e’ was displayed and withhold the response if the ‘e’ was next to or one letter away from a vowel. The no depletion version was matched in all respects with the exception that participants were required to press a button whenever a word with the letter ‘e’ was displayed, with no stipulation to ever withhold their response to an ‘e’.

In the second part of the study, to test whether willpower had been depleted, they showed people sets of numbers like 013, and participants had to say the position of one of the numbers. Sometimes the position matched the value, like ‘1’ in 133, and sometimes it didn’t, like ‘1’ in 013. Then they measured the variability in reaction time which, they say, is a good measure of your ability to pay attention and suppress off-task thoughts. When 24 labs ran this procedure, the results didn’t provide any evidence for an effect greater than zero (for either the variability in reaction time or the average reaction time).

Baumeister and Vohs respond that this wasn’t a good test of the effect (although they didn’t voice that concern publically beforehand). They argue that the original task required forming a habit: cross out every ‘e’ on a piece of paper, and then later breaking the habit when new rules were introduced. Without properly forming the habit, breaking the habit can’t be depleting. They also argue that because participants reported that the ‘e’ task was frustrating, but not fatiguing, it didn’t really cause ego depletion.

Michael Inzlicht, one of the participants in the replication project and an ego depletion researcher, finds this explanation unsatisfactory:

The problem is that the depletion literature is littered with initial effortful tasks that are as effortful—and often, much less effortful—that the ones used in the replication. First, the registered replication was, well, a direct replication of a published depletion paper using this precise task; this version of the crossing out ‘e’s task has also been used in at least one other published paper. If this initial task is not sufficiently effortful to evoke downstream consequences on control, how did these two publications find what they did? Second, even a casual glance at the depletion literature reveals initial effortful tasks that appear no more effortful than the initial task used in the replication. These include letting your mind wander for a few moments on any topic save for white bears, having a structured conversation with a Black confederate, writing a short paragraph without using the letters ‘A’ or ‘N’, recalling a time when one was a victim of prejudice, or taking the perspective of a hungry waiter unable to eat food.
Remarkably, I know of two papers where the initial effortful task involved performing twenty incongruent Stroop trials. Having conducted many many Stroop studies, I can assure you that twenty Stroop trials, which might take less than one minute, requires noticeably less effort than the initial task used in the registered replication; yet, these studies had significant downstream consequences.
So, while I agree that the original crossing-out ‘e’s task would have been preferable, there is no principled reason to think that the task is not effortful enough to produce depletion. It is for this reason that all us experts signed off on this replication and thought it was a fair test. It is for this reason that 23 out of 24 labs predicted a significant result.
It is also for this reason that 23 out of 24 labs, including my own, need to update our beliefs.

So, here is the exemplary part. Baumeister and Vohs conclude:

Clearly, though, this debacle shifts the burden of proof onto those of us who believe ego depletion effects are genuine. We will organize a pre-registered, multi-site replication project next year, using well-tested procedures (ones that actually involve self-regulation). We herewith preregister the hypothesis the depleted participants will perform worse on subsequent, ostensibly unrelated self-regulation tests than nondepleted participants, as a great many other studies have found.

That sounds like what psychologists who believe in the reality of their effect would do. What remains frustrating is the number of post-hoc rationalizations psychologists can bring to bear for why a replication failed. Post-hoc reasoning is fine if you go on to test it, but they can make claims unfalsifiable if theory can be endlessly modified to fit the results. Baumeister and Vohs believed the replication would work beforehand; they say so. If it had worked they certainly would have counted it as confirmation of ego depletion.

If you are strongly motivated not to update your beliefs, there’s really no end to the number of post-hoc explanations you can come up with for why some experiment didn’t work. That’s what’s frustrating about Lisa Feldman Barrett arguing that context explains the failures to replicate in the RPP. It’s not that she’s necessarily wrong. It’s that you can’t prove her wrong. When couldn’t you hypothesize that some contextual variable you haven’t yet explicitly examined explains a replication failure?

Recently, Andrew Gelman quoted Paul Meehl, a critic of psychology research famous in the field:

It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc in order to avoid the latter’s modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relation-ships of theories and hypotheses to evidence. Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modem statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.

Baumeister and Vohs’ theory of why the replication failed seems like a possible instance of Meehl’s endless auxiliary hypotheses generation. But at least they are doing exactly what they should, and preregistering their own replication attempt to test their own assertion that their favored procedures will produce data that support their theory.