Tuesday, July 29, 2014

Gambling and the Dynamic Nature of Risk

The other day we were teaching a class at a client’s site, where the primary hazards tend to be health hazards from chemicals and fire hazards are minimal (but still possible), we asked the group the following question:

Does the fact that fires are so unlikely at this site make them more likely to happen?

One student was emphatic that the answer to this question is “no.” His argument made sense – if we estimated the risk to be, just choosing a random number, 1 in 10,000 then that is the number. Simple knowledge of the risk shouldn’t change it because risk is static – knowing a probability should not change the probability of an event.

This line of thinking is very rational and in line with the common approach to risk management. An analogy often used to understand traditional risk thinking is gambling. Take, for example, the game roulette (if you’re unfamiliar with the game, here’s a short explanation - click here). If you decide that you want to place all your money on the number 20, knowing that the likelihood of winning is 1 in 38 times (depending on the specific setup the casino uses) does not make you more or less likely to win. The probability is the same whether you realize it or not because your knowledge has no effect on the speed of the wheel, the speed with which the dealer spins the ball, the rotation of the Earth, or any of the other myriad factors that may effect where the ball lands.

So, why should the likelihood of having a fire in a chemical plant be any different?

The thing is, it is very different. As we said above, the probability, or risk, you could call it, in roulette is static. It doesn’t change from one spin to the next. But in real life that’s not what happens.

To illustrate this, consider a different game found in the casino – poker (if you’re unfamiliar with the game, here’s a short explanation - click here). Poker, as it is traditionally played, is a very different game that most other games found in the casino, because you are no longer playing against the casino, but instead are playing against other players. In some ways, the math underlying poker is not different than the math of other casino games, such as roulette – the odds of getting the card you need do not change whether you know them or not. However, poker is an interesting game, because poker requires players to make more decisions than almost any other game in the casino and these decisions have a very important effect on the outcome of the game. By introducing those decisions you introduce interactions between players that are complex and dynamic.

As an example, imagine you play a hand against a player where you bet with your strong poker hand and you win. The decisions made during that hand have consequences on your opponent. They have formed new impressions of you, how you play, their perception of how lucky you are, how unlucky they are, etc. All of this will cause them to make adjustments, consciously and/or unconsciously, changing the probability of a given outcome on the next hand. The probability of the cards coming a certain way doesn’t change, but your chance of winning or losing does.

And this is the case with risk in the safety world. We would like to believe that things are static and easy, but they are not. The minute we tell someone about a risk that they face we have changed the risks because the person will make an adjustment based on what you told them (and not always the change you want them to make).

So, for example, if someone at a chemical plant knows that the likelihood of a fire happening is low what effect will that have on them? Chances are they will let their guard down (understandably). They will be vigilant with other hazards, but not as much regarding fire hazards. They will be less likely to support measures to prevent fires, because they perceive the risk to be low enough already. Therefore, without the added vigilance and prevention measures that may have been there before, by knowing the probability of an event, the probability changes.

This isn’t to say that we should keep information about risks from people, or that the people who are less vigilant are bad or stupid (you could argue the opposite is true). Rather, it’s to say that we have to understand that when you introduce people into a system you introduce complexity. The system becomes dynamic because people are noted for their innate ability to adapt to their environment.

Our job is not to stop this from happening (because that’s impossible), but provide people with the tools need to make more informed adaptations. This includes providing a clear picture of the risks people face, understanding that the risks we faced yesterday may be different today, and may change again tomorrow, as well as making systems more tolerant of the performance adjustments people make. Furthermore, our tools for assessment of risks must not be based on static measures of risk. This means that many traditional risk assessment tools may no longer be applicable to the dynamic risks we face. New tools and approaches are needed, ones based on sociotechnical systems-thinking and complexity.

How does your organization cope with dynamic risks?

Wednesday, July 23, 2014

Drift to Danger

As we’ve discussed before (for example, here), many times what we think of as “human error,” or whatever other label you want to use, is often just a product of normal work – i.e. people doing what they feel is best, given the competing goals and tradeoffs they have in front of them.

As must as we’d like to take credit for this idea, we must admit that we’re standing on the shoulders of giants, forerunners in the field of safety science and human performance. One of the most important of those forerunners is Jens Rasmussen. Rasmussen is responsible for many ideas and concepts in safety science that are foundational to our understanding. One concept in particular we find very interesting and useful is Rasmussen’s drift to danger (you can view the entire article where the concept is described here) model, seen below in its original form.

As Box and Draper said, all models are wrong, some are useful, but we think this model is particularly useful and worth a closer look. Obviously there’s a lot going on in the above picture (we encourage you to read the entire article in the link above to get an explanation), so we have a more paired down version below, created by Johan Bergström (which can be seen in this presentation that is well worth your time).

Basically though, what the model illustrates is that at any given time there are numerous pressures competing for attention within an organization. On the upper right side you have the pressure to not go out of business. Every organization has a line (although we may not know where that line is) where on one side the business can continue to function and on the other side the business isn’t financially stable enough to continue to exist and must shut down. From a “risk management” perspective, the organization must get as far away as possible from that line as possible. So management is incentivized to push the organization towards great efficiency and productivity.

On the bottom right side you have the pressure not to work too hard. Certainly an organization can be extremely productive if their workers could do an infinite amount of work without having to rest. But that’s not reality. Everyone has a limit. Furthermore, people are inherently motivated to do the most work for the least amount of effort (however, they define these terms individually). You can call this laziness if you like, but it’s something we all do, even when it’s not required. So again, we have a line where on one side you have an acceptable amount of work and on the other side an unacceptable amount of work, and we constantly try to get away from the unacceptable amount of work.

So we have pressure from one side toward efficiency and we have pressure from another side toward least effort. Note though that both the efficiency and least effort pressures, while they are pushing against each other to an extent, they are also pushing in the same direction to a degree, reflecting that the goals of efficiency and least effort overlap to an extent.

On the far left we have the “boundary of functionally acceptable behavior,” or you could say the safety boundary (although one could argue that this is not technically accurate). Note that, left unchecked, the pressures toward efficiency and least effort would blast right past the safety boundary. Fortunately, most people who come to work don’t want to die or get anyone killed, so people are motivated to not get too close to the safety boundary. The problem though is that (a) the safety boundary changes sometimes as our systems change over time, (b) the safety boundary is not always clearly defined, and (c) other pressures for efficiency and least effort can take up our attention unexpectedly (such as during economic downturns) and create increased pressure away from that boundary, pushing us closer to the safety boundary.

The important thing to remember is that these pushes toward the safety boundary are not “dumb,” “evil,” or any other adjective we typically put on such things. Rather, they are a function of normal behavior. This is what Scott Snook called “practical drift.” And that’s what makes them so tricky to deal with, because when something is normal it is hard to identify it as a problem.  As Michael Roberto pointed out, what at first gets excepted, quickly becomes accepted, and soon after becomes expected.

So what can we do about this? The most important lesson from the Rasmussen’s drift to danger model is that identifying how close one is to the safety boundary requires a new way of thinking about safety. We can’t define safety by the absence of negatives (i.e. incidents, risk, etc.), because you cannot identify how close you are to danger using those metrics. Instead, we have to come up with new metrics based on positive capacities to adapt to changing environments that the organization and the people find themselves in. Further, relying on traditional tools, such as hazard hunts, behavior observations, and risk assessments usually will typically not find such drift because they are often based on static environments (i.e. the job never changes), instead of the complex, dynamic work systems our employees operate in. Instead, we must understand how work works. Traditional safety practice must be expanded to include understanding elements of the organization not traditionally covered under the umbrella of safety management, because all of these things have an effect on safety.

Is your organization drifting to danger?

Tuesday, July 15, 2014

The Tumultuous Relationship Between Procedures and Safety

During a recent discussion with an operations manager at a chemical plant where we do work, we got to talking about procedures and their effect on safety. He told us about how they have detailed procedures for their most dangerous tasks, for example, the loading or unloading of rail cars containing sulfuric acid. These procedures are designed so that engineers, managers, and, to the credit of this organization, some employees who actually do the job identify the correct, one best method to do that job. Once identified the employees are monitored semi-regularly to ensure that they are following the procedure, with any variation from the procedure assumed to be unsafe.

The obvious underlying assumption to all of this is that procedures create safety. Or, to put it another way, if only people did work the way we planned work to be done there would not be any problems.

But is this true?

Intuitively, yes. We can all cite examples where someone violated a work rule or a procedure, they got injured, and if they had not violated the work rule or procedure they would not have been injured. Therefore, the violation led to the injury. So if a violation leads to an injury (i.e. a lack of “safety”) then following the procedure must lead to safety. Therefore, if we want increase safety we need more procedures and more people following those procedures without deviation.

Seems like sound logic. But there is one problem – this isn’t the whole picture.

As safety professionals we often get a skewed perspective on the world. Why? Because we traditionally only focus on failure. And when you focus on failure you clearly see all of the things that led to that failure (because of hindsight bias). So, as we said above, we see that someone was injured (a failure) and we see that one of the proximate causes was a violation, and we infer the potential effectiveness of procedures as a result.

If we take a step back though, and stop looking at only failure, and look at success (as recommended by Safety-II), we see something interesting – violations of procedures most of the time lead to no injuries. Take driving for example – certainly obeying the traffic laws should lead to safety on the roads right? But if you go out and just watch people driving you’ll notice two things – a lot of violations of the traffic laws, and very few accidents.

Now we’re not saying that we should throw all rules and procedures out the window and start driving like maniacs. What we are saying though is that perhaps our belief that the violation of procedures or the committing of “unsafe acts” being the cause of incidents is misguided. The link between following procedures and being safe is unproven and, as a result, suspect. Just because procedural violations are correlated with incidents does not mean that one causes the other. Correlation does not equal causation.

This sort of begs the question – what does lead to safety? Well, that’s a very complex question, but the short answer is – people adapting to their environment. To give an example, in talking with the operations manager from above we explored the procedure for the railcar unloaders and identified that the single most important and potentially hazardous step, the proper way to unbolt the dome of the railcar, is the one that the procedure spends the least time on. What did the unloaders do? They adapted. They created informal methods for unbolting the domes based on the type of railcar they were working with. Essentially, when we looked closer it wasn’t the procedure that was making the job safe, it was the people who inherently wanted to be safe, adapting to an unsafe environment and finding the best way, based on the situation.

So, what can we do to help our people make more informed adaptations to their environment? Well, some basic steps to get your started include:
  • Stop being surprised by procedural violations. There will always be a difference between how we imagine work gets done and how it actually gets done. Our job is to find those gaps and understand why they are there, rather than just blaming workers for not doing things “safely.”
  • Stop only looking at why things fail. It gives you a very skewed perspective on the organization that could lead to false assumptions and conclusions. Start looking at why things succeed in your organization and try to enhance that, in addition to looking at and trying to prevent failure.
  • If you must write a procedure, have the people who will actually be doing the work write the procedure. You can certainly be involved to ensure regulatory compliance and what not. But don’t assume you know how to do other peoples’ jobs.
  • Write procedures in such as way as to maximize their effectiveness, as outlined in this blog here.

Tuesday, July 8, 2014

Playing the Blame Game Higher Up the Corporate Ladder

Obviously in this blog we’ve talked a lot about the need to stop playing the blame game after events (here, here, here and here, for example) and usually in those cases we’re referring to the need to stop blaming workers. The good news is that when we have these discussions with managers, safety professionals, and workers most seem to get it – most appear to understand how futile it is to blame workers for events and see the need to start learning from these events by looking deeper into the work system that employees are operating in. Although there are exceptions to this (some people still don’t get it), this trend is great and very encouraging.

However, as the focus is taken off of the individuals at the sharp end of the wedge, there is another disturbing trend developing – blaming the organization or, more specifically, blaming the managers. We saw this in particular in a recent talk we gave where we discussed the famous case of the BP Texas City Refinery Explosion (we wrote a blog about part of the incident here). One of the attendees brought up the point that the incident was caused by human failure, but not on the part of the operators, but on the part of the managers all the way up the corporate ladder.

Although we understand how the attendee came to this conclusion, the problem with this line of thinking is that it is not that different than blaming the workers. If we blame management for an event all we’re really doing is playing the blame higher up the corporate ladder, not really understanding what went wrong.

Think of it this way – one of the reasons we take the emphasis off of human error is because workers aren’t intending to hurt themselves or others and they are trying to do a good job (at least what they perceive to be a good job). Essentially, the actions of the workers made sense to them at the time. They thought that what they did would help them achieve their goals and wouldn’t make anything significantly bad happen. So, rather than blaming them for a mistake, we should try to understand why it made sense for them to do what they did in that moment.

Makes sense, right? So why wouldn’t the same line of thinking apply to those higher up the corporate ladder?

CEOs usually don’t want to be responsible for getting people killed and they usually want to do good for their company and when their workers get hurt or killed that is, almost by definition, bad for their company. CEOs act in a way that they believe will help them achieve their goals and won’t make anything significantly bad happen. So, rather than blaming them, shouldn’t we also try to understand why it made sense for them to do what they did?

Now, we aren’t saying that managers, supervisors, leaders, etc. don’t have more responsibility and more ability to do good and bad. We aren’t even saying that those on the high end of the corporate hierarchy shouldn’t have a bit more to answer for after an event. What we are saying though is that if we really want to make progress in safety we need to understand the conditions that cause humans to do things that contribute to events. Simply saying that these managers, supervisors, and leaders are immoral, evil, greedy, or even using more seemingly sterile terms such was “poor leadership” or “inadequate supervision” are not useful because they presuppose that all we need to do is replace the poor leader and inadequate supervisor and then we’ll be safe. Essentially we’re saying that our system is unsafe because we have unsafe humans and if we only get rid of them everything will be right in the world. This is simply untrue and is profoundly unhelpful.

Instead, just as with workers, we need to try to understand the systems that those high on the corporate ladder work in. We need to get into the messy details of normal work at all levels within the organization and see the world from their perspective. Only then can we be in a place to make effective interventions that facilitate safe and successful performance.