Programming Challenges

Challenge #6: Don't just fix the bug

(First posted May 26, 2000)

In the Jargon File, in the entry for "firewall code", there's a sentence which I think I suggested:

Wise programmers often change code to fix a bug twice: once to fix the bug, and once to insert a firewall which would have arrested the bug before it did quite as much damage.

In retrospect, this observation doesn't go far enough. Two fixes for one bug aren't always enough: a single bug might reveal arbitrarily many things that need fixing, changing, or improving.

Before exploring this idea further, it's worth explicitly mentioning a certain sober admission which underlies the notion of "firewall code", namely the fact that most code does contain bugs. If we could assume that essentially all code were bug-free, we wouldn't have to worry so much about firewalls and other robustness strategies; it would be a waste of time and effort to think about and fine-tune our code's behavior in the face of bugs. But the sad truth is that we can't be so confident: in any program of realistic size there are latent bugs which haven't been revealed yet, and there will be new bugs introduced next week or next month when someone adds a feature or attempts to make some other change. Therefore, it is worthwhile to devote effort in an attempt to lessen the impact of bugs. (At any rate, the title of this Challenge is "Don't just fix the bug", so we're presupposing that a bug has been found, and it's probably not the last bug.)

One way to lessen the impact of bugs is to program defensively (for example, by terminating an i-goes-from-1-to-10 loop when i is greater than or equal to 10, not just when it's exactly equal to 10). Another is to insert the aforementioned firewalls, to arrest damage before it gets too far; these firewalls are often implemented using the assert() macro. Still another is to write code a degree or two more generally than you might at first be tempted to assume you can get away with. But those are all topics for another day.

Today's challenge has to do with the aftermath of a really nasty bug, one which makes you wish you'd programmed even more defensively, so that the bug didn't cause quite so much damage, or so that it didn't cause such a mysterious failure that it was almost impossible even to track down, let alone fix. The challenge is to go beyond just fixing the nasty bug, and to additionally use the bug as a tool to help discover additional bugs, or bugs waiting to happen. Rather than just being an incentive to code more carefully next time, the bug can help you discover multiple parts of your code which could stand to be written more carefully, now.

Suppose you have one of these nasty bugs. (It's probably not that much of a supposition: you probably have had at least one of them; I know I have.) Suppose it's a bug that doesn't cause an immediate crash, but instead leaves various data structures in a corrupt or inconsistent state, such that far-flung pieces of the rest of the program gradually choke and die as the pollution spreads. Bugs like these are indeed nasty, but they provide a rare opportunity to do some destructive stress testing on the rest of the program, to find out what other weaknesses or unerupted bugs it has. Suppose that the root cause of the bug is eventually determined to be in module X, where a certain key data structure is sometimes left in a corrupt state. We can legitimately ask, however, if module Y's behavior when faced with this bad data was reasonable, or whether module Y could somehow have coped with it more gracefully, or at least diagnosed the problem explicitly (perhaps via failed assertions) rather than crashing mysteriously, or propagating the corruption to module Z. And what about module Z? How did it fare?

If module Y quietly propagated the corruption, and if module Z eventually crashed in a mysterious way, there may end up being three separate bugs to fix. Module Z should have dealt with the problem gracefully (if only by failing definitively, with an explicit message explaining why) rather than crashing mysteriously. Module Y should have dealt with the problem gracefully (if only by failing definitively) rather than propagating bad data. Finally, module X should obviously not have created the bad data in the first place.

So, we are finally to the meat of this Challenge: when a cascading bug like this crops up, try to discover all three (or all N) of the problems, and to take the time to fix all of them. Don't just fix the one bug that triggered the rest.

Fixing all the bugs typically requires tackling them in the correct order, which is the reverse order of their propagation. Ideally, you want to fix the bug in Z first, then Y, then X. If you were to find and fix the bug in module X first, the symptoms in modules Y and Z would go away, so you couldn't even find the bugs or weaknesses in those modules any more, let alone fix them. But if those other modules remain weak, they might fail again next week, when a new bug in module W pulls the rug out from under them in a similar way.

Though it may have been extremely frustrating for that bug in module X to have done so much damage, there's one respect in which it actually did you a favor, by helping you discover the hidden weaknesses in modules Y and Z. Taking the time to strengthen modules Y and Z against these kinds of damage emanating from modules X or W, and in the general case, strengthening all of the code against damage emanating from other parts of the code, results in programs that are more robust. Robust programs deal more gracefully with bugs when they arise, limiting the damage caused by the bugs and making the bugs easier to find and fix.

It's fair to ask whether we're going overboard by essentially planning for bugs in this way, trying to make various parts of a program strong and robust even in the face of bugs in other parts of the program. Is this a waste of time? Rather than spending time making sure that modules Y and Z behave properly even when modules W or X do not, wouldn't it be just as effective simply to make sure that modules W and X always do perform properly? By spending time proactively working around further bugs that haven't even come up yet, and by justifying our actions by claiming that it's inevitable that those bugs will come up eventually, are we somehow condoning those bugs, or weakening in our resolve to avoid having so many bugs in the first place?

I believe that we are not justifying or condoning bugs, or weakening in our resolve. Writing code so as to be robust even in the face of (its own) bugs is not a concession of defeat. When we work at making our code resilient in the face of bugs, we are not implying that the bugs are acceptable, or that we're too lazy or inept to find and fix the bugs. There are analogies in the real world: when we wear seatbelts (or install air bags) in a car, we're not saying that crashes are okay; when we install fire sprinklers in a building, we're not saying that fires are okay. All we're saying is that when accidents happen, no matter how unacceptable those accidents may be, it's even more unacceptable for the consequences of the accident to be worse than they need to have been, when the possibility of the accident, and its consequences, could have been anticipated.

In the case of software under development (and, if we consider maintenance, most software is "under development" forever), it's not only "bugs" we have to worry about being robust in the face of. Often, the problematic situation (that is, the situation that it would be nice for some piece of code to be robust in the face of) is not a bug, but rather a change in some other aspect of the program, a change which, in the process of adding some feature or fixing some other bug, inadvertently demolishes an assumption or exposes a weakness in some other part of the code. (That is, if module A, which has been working apparently perfectly for years, suddenly stops working because of a change to module B, a change which was required in order to fix some recently-discovered bug in module B, but a change which module A was not robust in the face of, is the "bug" that caused module A to fail the change to module B, or was the bug in A itself, all along?)

I should close, however, with one caution. There is one way in which it is possible to go too far in one's attempts to make code robust, and that is when it's somehow possible (or even desirable, or a requirement) for a piece of code to actually tolerate failure, to correct and proceed in the face of bad data (that is, neither to fail mysteriously nor definitively, but to succeed in spite of prior errors). The danger here is that if a piece of code does too good a job of cleaning up prior errors, it may actually cover them up! It's one thing to be robust in the face of an error, to proceed in spite of it, but you don't generally want to completely camouflage the fact that an error occurred at all; you don't want to prevent people from noticing and fixing the error. Therefore, when you find that you're able to continue in spite of an error you've detected, it's generally a good idea to report or log the error in some way, so that it can't be overlooked or ignored.

This page by Steve Summit // about these challenges / previous