Wednesday, June 1, 2016

Digging Out Root Cause

There’s been a push in the area where we live to require certain organizations to do “root cause analysis” for major incidents. Most of this push is for high-hazard industries, such as refining and chemical manufacturing, which makes sense, since if something goes wrong in these areas it can have very serious consequences. So state and local regulators are proposing and enacting regulations to force organizations to conduct a root cause analysis for major events (and perhaps some other events). You can view an example of these regulations here and here.

The underlying assumptions for this push are basically that (a) the current processes used by these chemical plants are flawed, and (b) doing a root cause analysis is a superior method for doing investigations. Now, we haven’t worked with all of the plants covered by the regulations, so we can’t really speak to the first assumption (a).

However, is root cause analysis really a good way to investigate accidents? At first glance the answer seems to be ‘yes’. The name “root cause analysis” implies a process that goes deeper into the organizational processes. And that is something we all want – a process that goes deeper. This should help us better identify and correct problems right?

The question is not how deep the investigation goes. Rather, the question is whether the investigation gives us a clear enough picture of our operations to help us learn and improve future performance. Does root cause analysis do this?

Unfortunately, the answer is no. Now, this isn’t to say that there aren’t people who use root cause analysis that are improving operations after an accident. It’s just that we aren’t convinced that the root cause analysis is the reason for the improvement. In fact, we would argue that if organizations moved beyond traditional root cause analysis methodologies they may experience even more improvement than they would have otherwise.

The biggest issue with the search for root cause(es) is that it is so deceptively subjective and arbitrary. To understand what we mean, let’s look at the basic idea behind root cause analysis – to identify the root causes of a fault or accident. Root causes are defined as those factors that, if removed, would have prevented the accident. On the surface this seems to be very logical and objective. Just identify those things that, if removed, would have prevented the event from occurring.

In reality though, this is so subjective and prone to bias that it’s scary. Lets look deeper at the logic here. Every effect has a cause. Therefore, every accident (which is an effect) has a cause. Every cause is also an effect though. So you just keep looking at each cause as an effect and then determining what the cause of the cause was until you get to the so-called “root cause”. But doesn’t this also mean that so-called “root causes” also are effects that have causes? Why wouldn’t the cause of the root cause be the actual root cause? How do we pick one or the other? Wouldn’t, in reality, the root cause of all accidents be Creation or the Big Bang (respectively, depending on your worldview)? As Jens Rasmussen points out, what we end up calling “root cause(es)” are often merely the points where we decided to stop investigating. This means that “root cause” is not an objective element that we find, but rather something we create when the investigation stops. This means root cause analysis is entirely subject to the whims and biases of the investigators.

As an example, lets apply the root cause analysis thinking to an instance of someone getting hurt at work. What are the factors that, if removed, would have prevented the accident?
  • Getting up that morning.
  • Driving to work.
  • Not getting into a serious accident on the way to work.
  • The business being profitable the previous year, allowing the business to remain open.
  • The employee was hired.
  • The employee being born.
  • The person who invented the technology that the employee was working on being born. 

Obviously we’re being a little absurd, but the main reason the above are not typically considered “root causes” in a root cause analysis is because the organization chooses to do nothing about them (in one or two cases there’s nothing the organization could even do). But doesn’t that mean that root cause analysis is muddying the waters between learning and corrective action? If we only can call something a root cause if it’s something that we have the means and desire to fix, doesn’t that introduce huge potential for bias into the process?

The bottom line is that root cause analysis is distortion and subjectivity masked as clarity and objectivity. The world does not unfold in the way that root cause analysis has us looking at it. Our world is made up of a constant stream of events that are highly interconnected. Breaking them into bits is an extremely arbitrary process that will inevitably lead to an incomplete understanding of what we are looking at.

We recommend that the safety profession move beyond root cause analysis and begin to look at other methods for investigating failures. What that looks like may vary from place to place, situation to situation, but some general ideas to get you started include:
  1. Learn, then improve. Don’t identify fixes in your investigation until after you’re satisfied you’ve learned what you can from the event. The goal is to figure out how your system is working and why that led to a failure this time. If you go in there trying to fix before you understand you’ll blind yourself to innovative opportunities for improvement.
  2. Start back in time and move forward in your investigation. As much as possible we want to see the world the way those involved saw it. Processes that go backwards in time do the opposite, which will increase the potential for hindsight bias. Instead, go back a day, a week, a month or a year in time (or more), start there and move forward.
  3. Tell the story of the event. Rather than breaking the event up into parts and classifying them, put them all together into a coherent story. This will help you see how things worked together, which will help you not only understand each element more, but also the behavior of the whole system you’re looking at. Go up and out to look at the big picture, rather than down and in.
  4. Look for how things succeed to understand why they sometimes fail. People don’t break like machines. We do things that we believe will help us be successful. Therefore, don’t just look at failure in an investigation. Try to figure out why people’s behavior made sense to them, why it helped them to be more successful. Then you’ll understand the behavior more, which will then allow you to identify better fixes overall, if necessary.
  5. Get different perspectives involved. The best way to reduce bias, paradoxically, is to get more bias and diverse perspectives involved. If you get enough diverse bias involved the biases will begin to cancel each other out, leading you closer to what’s really going on. This means get employees, supervisors, engineers, etc. involved. Note that we said “involved”, not “interviewed as part of the investigation”. They should be involved both in the investigation as investigators, but also in identifying solutions to problems found.


Now, there are some who can do the above while still using some methodology that they call “root cause analysis”. That’s fine. You can call it whatever you want to call it, as long as you recognize that reality is not reducible to “root causes” and the process is more about understanding and managing bias rather than eliminating it. Accident investigation is a social process requiring empathy, collaboration and communication first. Everything else is secondary.

No comments:

Post a Comment