soveran

Human Error

There’s a book called Human Error, by James Reason, that has been very influential in industries like nuclear power plants and aviation. It’s full of research information and citations, which make it a great read for people that want to know all the details. For those who just want an overview, a good recommendation is The Design of Everyday Things by Don Norman. The following paragraphs are the result of applying some of those ideas to software design.

Having an accurate mental model of a system is a great tool for preventing errors: if you know how a system works —if you are familiar with its internal design—, you will be less likely to use it in the wrong way. But having an accurate mental model is also important for fixing errors within the system: when something goes wrong, you will have an idea of what part could be at fault. Closely related is the concept of situational awareness: it refers to the capture and processing of external conditions that may affect the system. Our mental model must take those variables into consideration.

As an example of the risks carried by a defective mental model, here’s an exchange between a pilot in his small plane and the ground controller. The pilot is alone and in distress, probably because of bad weather conditions.

Pilot: Mayday Mayday Mayday! N9815L, I’m in trouble! Mayday Mayday Mayday!

Controller: N9815L, Fort Dodge, go ahead.

Pilot: I have no idea where I’m going, I’m going to crash! Mayday Mayday Mayday!

Controller: 9815L, say last known position.

Pilot: I have no idea, I have no idea! Mayday Mayday Mayday! I’m gonna crash!

Controller: 9815L, say altitude.

Pilot: I’m rolling! I’m rolling! I’m rolling! I’m rolling! Oh my god! Help! Help!

Controller: Calling Fort Dodge, release the stick, go forward on the stick then slowly back again.

Pilot: I’m level! I’m straight and level! I don’t know where I am!

The conversation continues and the pilot lands safely, you can listen to it here. Having a good mental model of the situation allowed the ground controller to provide life-saving help with just a few directives.

That kind of intuition comes from experience, but there are two shortcuts for attaining it: good habits and simplicity. By having a simpler system, we have better chances of forming a good mental model. By having good habits, we avoid a good deal of error scenarios and we have better chances to recall the proper set of rules for coping with a crisis. A simpler system is easier to understand, as complexity is the main barrier for creating an accurate mental model.

In programming, there are several tools for measuring complexity. The term Software Complexity refers to the relationship between code and programmer: it has to do with how difficult it is for a person to understand what the code does. It’s unrelated to the computational complexity, and it has nothing to do with how well designed the code is. A novice programmer can create a program that is simple, elegant and wrong. An advanced programmer can create a program that is correct, but complex and hard to understand. For the purpose of creating an accurate mental model, even the program’s correctness is of secondary importance: code that is understandable can be fixed. Of course, we shouldn’t allow incorrect code, and moving forward we’ll assume we are discussing code that works, code that is clear and readable, and we will focus on its software complexity.

There’s no perfect metric for software complexity. That said, some metrics have good empirical correlation with what we experience when trying to understand some program. Two metrics worth mentioning are Cyclomatic Complexity (McCabe, 1976) and Halstead Code Volume (Halstead, 1977). There are tools in almost every programming language for measuring those values. A paper about correlations between internal software metrics and software dependability (van der Meulen & Revilla, 2007) demonstrates that software complexity is proportional to the lines of code of a program. The correlation is linear, positive and almost perfect. Counting the lines of code provides a great estimation of how difficult it is to understand some code. Again, there’s no perfect metric, and the experience of the programmer plays a key role. All things being equal, the less code you have to read the easier it will be for you to understand what it does.

We write code in order to solve problems, but the code we write is itself a new problem: it will use memory, CPU cycles, and its complexity will get in the way if somebody —or your future self— needs to understand it. By solving problems with code, we are trading an original problem for an artificial one. This new problem should be easier to deal with; if that’s not the case, we are not making any progress.

How complex can a program be? How complex should it be? There’s no definitive answer, but we can test the complexity boundaries. For any correct program —a program that yields the right result—, we can create a more complex version that’s still correct. In fact, there’s no limit to the amount of complexity we can add to a program, and it will be correct as long as it keeps producing the right result.

Given two programs that yield the same result, the simpler one is to be preferred because it helps us build a good mental model. Of secondary importance is the frugal use of resources: the program with better performance and less memory usage is to be preferred. It means that in many situations, we should be ready to sacrifice some performance if the result is a simpler program. Luckily, simpler programs tend to be more efficient in their use of resources.

Why is our industry producing tools with such high level of complexity? Could it be the case that we are overconfident in our mental models in the face of complex programs? The pilot in that story, I suppose, was confident his mental model was good enough. In the same way, every other month we read the post mortem report by a company that had a half-day downtime, all the while thinking, I assume, that their mental model was adequate. In retrospect, when analyzing accidents, we can verify that mental models of complex systems like distributed programs, airplanes and power plants, are usually wrong.

A while ago, I tried my best to explain the benefits of simple tools in a presentation about minimalism in programming. While I couldn’t tell for sure, my impression was that people already aligned with those ideals where the ones who enjoyed the presentation, and I didn’t manage to convince anyone else. The next speaker in line couldn’t restrain himself and said: I also love minimalism, and that’s why I love Rails!. Rails is a project that makes a lot of money for a lot of people. It has fans everywhere, and it’s handy for starting a project if you already know how to use it. But I would never call it minimalist. Why did my presentation provoke such reaction? Could it be that our lack of scientific rigor promotes that kind of relativistic approach? Maybe there is some kind of emotional attachment that blocks any criticism. In any case, I wish I knew better.

In 2011, Rich Hickey delivered a great presentation called Simple Made Easy where he explained the difference between simple and complex, but most importantly he explored why we usually conflate the concepts simple and easy: something can be objectively simpler, but only subjectively easier. He did a reprise at RailsConf 2012, also worth watching. Understanding the distinction between simple and easy is key for suppressing any knee-jerk reaction when someone denounces a tool as too complex.

What happens if we don’t strive for simplicity? An article by Leslie Lamport discuses the future of computing, and wonders if it will be like logic or like biology. He uses logic to label programs that are simple and can be analyzed and proved correct. On the other hand, biology refers to systems so complex that we no longer understand how they work, and we can only focus on how they behave. Dealing with something we don’t fully understand can lead to irrational behavior, he argues. A person that doesn’t understand how the human body works may use homeopathy. But that same person would never use something homeopathy-like with a simpler system: nobody uses homeopathy to fix a broken car. With software, something similar happens: some people may select a tool because it gets constantly updated, a defect often regarded as a virtue. Others may reject a tool because it didn’t gather enough stars on GitHub, while the rational behavior would be to read the code, understand what it does, and reject it if it doesn’t work for their use case.

We are good at making mistakes, and extremely bad at being precise. If we give up trying to understand, we are setting up the perfect trap for ourselves.