undefined behavior

[This article was originally posted on March 21, 1995, in the midst of a thread on undefined behavior. I have edited the text slightly for this web page.]

Newsgroups: comp.lang.c
From: scs@eskimo.com (Steve Summit)
Subject: Re: Quick C test - whats the correct answer???
Message-ID: <D5swGz.4Cv@eskimo.com>
Summary: why is undefined behavior undefined?
References: <fjm.58.000B63C1@ti.com> <D5JLGy.A63@tigadmin.ml.com>
Date: Tue, 21 Mar 1995 17:26:59 GMT

In article <D5JLGy.A63@tigadmin.ml.com>, Jim Frohnhofer (jimf@nottingham.ml.com) writes:

> In article 000B63C1@ti.com, fjm@ti.com (Fred J. McCall) writes:
>> That's right, and that's why you should listen to him (and to me, and to
>> everyone else telling you this).  The REASON it is undefined is BECAUSE THE
>> STANDARD SAYS SO.
>
> I may regret this, but I'll jump in anyway.  I accept that the behaviour
> of such a construct is undefined simply because the Standard says so.  But
> isn't it legitimate for me to ask why the Standard leaves it undefined?

It is, but you should probably be careful how you ask it. If you're not careful (you were careful, but most people aren't), it ends up sounding like you don't believe that something is undefined, or that you believe -- and your compiler happens to agree -- that it has a sensible meaning. But as long as we're very clear in our heads that we're asking the abstract, intellectual question "Why are these things undefined?", and not anything at all practical like "How will these things behave in practice?", we may proceed.

The first few answers I give won't be the ones you're looking for, but towards the end I may present one which you'll find more satisfying.

First of all, you might want to take a look at who's saying what. I'll grant that statements such as Fred's above can be annoying. I agree completely that it's usually very good to know the reasons behind things. But if the people who have been posting to comp.lang.c the longest, and who have been programming in C for the longest and with the most success, keep saying "it's undefined, it's undefined, quit worrying about why, just don't do it, it's undefined", they might have a good reason, even if they don't -- or can't -- say what is, and that might be a good enough reason for you, too.

I've been programming in C for, oh, 15 years now. For at least 10 of those years, I've been regarded as somewhat of an expert. For going on 5 of those years, I've been maintaining the comp.lang.c FAQ list. I am someone who is usually incapable of learning abstract facts unless I understand the reasons behind them. When I used to study for math tests, I wouldn't memorize formulas, I'd just make sure that I could rederive them if I needed them. Yet for most of the 15 years I've been programming in C, I simply could not have told you why i=i++ is undefined. It's an ugly expression; it's meaningless to me; I don't know what it does; I don't want to know what it does; I'd never write it; I don't understand why anyone would write it; it's a mystery to me why anyone cares what it does; if I ever encountered it in code I was maintaining, I'd rewrite it. When I was learning C from K&R1 in 1980 or whenever it was, one of their nice little sentences, which they only say once, and which if you miss you're sunk, leaped up off the page and wrapped itself around my brain and has never let go:

Naturally, it is necessary to know what things to avoid, but if you don't know how they are done on various machines, that innocence may help to protect you.

As it happens, it's possible to read K&R too carefully here: the discussion at the end of chapter 2 about a[i] = i++ suggests that it's implementation-defined, and the word "unspecified" appears in K&R2, while ANSI says that the behavior is undefined. The sentence I've quoted above suggests not knowing how things are done on various machines, while in fact what we really want to know is that maybe they can't be done on various machines. Nevertheless, the message -- that a bit of innocence may do you good -- is in this case a good one.

That's my first answer. I realize that it's wholly unsatisfying to anyone who's still interested in this question. On to answer number two.

> As far as I know, it was created by a committee not brought down by Moses
> from the mountain top.  If I want to become a better C programmer, won't
> it help me to know why the committee acted as it did.

Perhaps, but again, only if you're very careful.

Let's say you're wondering why i++ * i++ is undefined. Someone explains that it's because no one felt like defining which order the side effects happen in. That's a nice answer: it's a bit incomplete and perhaps a bit incorrect, but it's certainly easy to understand, and since you insisted on an answer you could understand (as opposed to something inscrutable like "it's just undefined; don't do it"), it's the kind of answer you're likely to get.

So next week, you find yourself writing something like

	i = 7;
	printf("%d\n", i++ * i++);

and you say to yourself, "Whoah, that might be undefined. How did that explanation go again? It's undefined... because nobody felt like saying... which order the side effects happened in. So either the first i gets incremented first, and it's 7 times 8 is 56, or the second i gets incremented first, and it's 8 times 7 is 56. Hey! It may be undefined, but I get a predictable answer, so I can use the code after all!"

So in this case, knowing a reason why has not made you a better programmer, it has made you a worse programmer, because some day when you're not around to defend it (or when you are around, but you don't have time to debug it), that code is going to print 49, or maybe 42, or maybe "Floating point exception -- core dumped".

If, on the other hand, you didn't know why it was undefined, just that it was undefined, you would have instead written

	printf("%d\n", i * (i+1));
	i += 2;

(or whatever it was that you meant), and your code would have been portable, robust, and well-defined.

Now, some of you may be thinking that

	printf("%d\n", i++ * i++);

is a ridiculous example which no one would ever write. That may be true, but it's the example I use in the FAQ list (and it's the oldest of the undefined-evaluation-order questions in the FAQ list) because it illustrates, I hope better than a "real" example would, the kind of contortions that people do get into when they start thinking too hard (but not too carefully) about undefined expressions. There was an actual question posted to comp.lang.c years ago (which I could probably find in my archives if I looked hard enough) in which the poster used essentially the same argument: "the evaluation order may be undefined, but no matter which order the compiler picks, I'll get the answer I want", and that was the inspiration for the i++ * i++ question (4.2 on the current list). (Remember, it's not the evaluation order that's undefined, it's the entire expression that's undefined.)

Two paragraphs back, I suggested that the programmer who did not know why something was undefined might have been better off than the programmer who did. I am not stating (not in the current climate, anyway) that you should not know why things are undefined. But if you insist on knowing, you are going to have to get the full story, not just some simplistic justifications. And you are going to have to be excruciatingly careful that you don't use your knowledge of why something is undefined to try to second-guess how a certain undefined construct (which you've decided you simply have to use in your program) is going to behave. Undefined behavior is slippery stuff: it really, truly is undefined; it really, truly can do anything at all; yet some people (particularly those who are always clamoring to know why things are undefined) are always trying to salvage some remnants of definedness out of an undefined situation, and are always getting themselves in trouble, and end up convincing the people who have to come along and pick up the mess that yes, it really was a mistake trying to explain why it was undefined, and we probably should have just said it was undefined because we said it was, after all.

This has been answer number two, and I realize it's still not satisfying, because I'm still suggesting that maybe you don't want to know why.

For answer number three, I'll quit beating around the bush and try to explain why, although I have to admit that, for me, the answers from here on are going to get less satisfying, and less nicely worded, because we get down into some realms that I don't usually think about (because I really have taken to heart K&R's advice about maintaining some innocence).

An international Standard like X3.159/ISO-9899 is a tremendously difficult document to write, even for a relatively simple programming language like C. When a Standard says something, it must say it definitively and unambiguously. (When it is inaccurate, it must be definitively inaccurate, and when it contains any areas of doubt or uncertainty, they must be rigidly defined areas of doubt or uncertainty.) The Standard must withstand intense scrutiny, for many years, from all sorts of observers, including language lawyers, professional nitpickers, grudging ex-users of other languages, semicompetent implementors of spiffola new compilers for scruffola old computers, and xenophobic adherents of other sociopolitical systems. (If you think that abstract constructions like programming languages are inherently removed from sociopolitical concerns, you haven't followed the crafting of any international Standards.)

Since it's so hard to specify things precisely enough for a Standard like this, the wise Standard-writer (especially for a Standard that hews to existing practice) won't specify anything more than is necessary. Features that everyone will use (or that someone might be reasonably expected to use) must be specified precisely, but features that no sane person could be expected to use in 6,000 years might be relegated to the dustbin, instead. (Naturally, since we're being precise, we'll have to precisely define the boundaries of the parts we've decided to be imprecise about; the paraphrase above of Vroomfondel's demand from The Hitchhiker's Guide to the Galaxy is not facetious.)

In practice, you can only tell what a computer program has done when it causes a side effect. In a nutshell, then, what a Standard for a programming language does is to define the mapping between the programs you write and the side effects they produce. Consequently, we are very concerned with side effects: how they're expressed in our programs, and how the Standard defines them to behave.

The fragment

	int i;
	i = 7;
	printf("%d\n", i);

contains two obvious side effects, and it is blatantly obvious what their effect should be, and what the Standard says agrees with what you think they should do.

The fragment

	int a[10] = {0}, i = 1;
	i = i++;
	a[i] = i++;
	printf("%d %d %d %d\n", i, a[1], a[2], a[3]);

contains a few more side effects, but (I assert) it is not obvious how they should behave. What does the second line do? In the third line, do we decide which cell of a[] we assign to before or after we increment i again? Plenty of people can probably come up with plenty of opinions of how this fragment ought to work, but they probably won't all agree with each other, as the situation is not nearly so clear-cut.

Furthermore, a Standard obviously cannot talk about individual program fragments like i = i++ and a[i] = i++, because there are an infinite number of those. It must make general prescriptions about how the various primitive elements of the language behave, which we then use in combination to determine how real programs behave. If you think you know what a[i] = i++ should do, you can't just say that; instead you have to come up with a general statement which says how all expressions which modify and then use the same object should act, and you have to convince yourself that this rule is appropriate not only for the a[i] = i++ case you thought about but also for all the other expressions (infinitely many of them, remember) that you did not think about, and you have to convince a bunch of other people of your conviction.

The people writing the C Standard decided that they could not do that. Instead, they decided that expressions such as a[i] = i++ and i++ * i++ and i = i++ would be undefined. They decided this because these expressions are too hard to define and too stupid to be worth defining. They came up with some definitions (no small feat in itself) of which expressions this undefinedness applies to. You've seen these definitions; they're the ones that say

Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be accessed only to determine the value to be stored.

Now, I'll grant that these definitions are at least as hard to understand as expressions such as a[i] = i++. But there are plenty of more complicated expressions which we can't begin to make sense of (or even decide if they make any sense) without these definitions. Perhaps I'll say more about them later, but for the moment we're still trying to answer the question of why some things (such as the expressions not defined by the two definitions quoted above) are undefined, and my opinion is, as I stated above: because they're too hard to define, and too stupid to be worth defining. (If you're still counting, call this answer number three.)

Now, you may think that I'm being overly pessimistic. You may think that it's easy to define what a[i] = i++ or i++ * i++ or i = i++ should do. Perhaps you think they should be evaluated in exactly the order suggested by precedence and associativity. Perhaps you think that i++ should increment i immediately after giving up i's value and before evaluating any of the rest of the expression. (These rules would say that the incremented value of i is used to decide which cell of a[] to assign to, and that the left-hand i++ in i++ * i++ happens first, and that i = i++ ends up -- here's a surprise -- being exactly equivalent to i = i, unless perhaps if i is volatile.) But -- and you're going to have to take my word for it here, because I'm taking other people's word for it -- these hypothetical "well-defined" expression rules, though they're easy enough to state and probably precise enough and probably comprehensive for the cases we're interested in, would result in significantly poorer performance than pre-ANSI C traditionally had and that we're used to. Optimizing compilers would have to generate code which evaluated expressions in little itsy bitsy steps, in lockstep with the precedence- and associativity-based parse. They would not be able to rearrange parts of the expression to make the best use of the target machine's instruction repertoire or available registers or pipelining or parallelization or whatnot.

When people talk about why it's good that an expression like i++ * i++ is undefined, they usually speak of modern, parallel machines which might get really confused if they try to do, not the left-hand i++ first or the right-hand one first, but instead both at once somehow. Instead of that example, let's imagine a very simple CPU, analogous to a four-function calculator with some memory registers. Let's play compiler, and imagine what buttons on the calculator we'll push, operating under the "well-defined" rules of the previous paragraph.

Since the (hypothetical) "well-defined" rules tell us exactly which order to evaluate the expression in, our task is straightforward. First, the left-hand side of the multiplication operator: the value we want to multiply is i's previous value. Whoops, before we can think about doing the multiplication, the rules say we have to do the increment. Whoops, once we do the increment, we'll have lost the previous value. So we recall from i, store the value in a temporary register, add one to to the value, and store it back in i. Now we can recall from the temporary, and do the multiplication... no. Because first we have to do the same thing on the right-hand side: recall from i, store it in a second temporary, add one to it, and store it back in i. Now, finally, we can recall the two values from the two temporaries and do the multiplication.

If we remove the (still hypothetical, just for the purposes of this article, not part of Standard C) "well-defined" rules, and revert to ANSI's rules, under which we only have to make sure that the increments to i happen sometime before the next sequence point, look how much easier things become: Recall from i. Make a note to increment its value later. Multiply it by: recall from i again, make a note to increment it later. Now we've got the product, do whatever we have to with it. We've still got these two notes to increment i, so handle them, and we're done.

I'd never actually gone through the analysis in this level of detail until just now, but this is exactly how the example

	int i = 7;
	printf("%d\n", i++ * i++);

from question 4.2 of the FAQ list could print 49 instead of 56. (No, I'm not going to come up with a scenario under which it could print 42.)

So, if you're still with me (and after 300 lines, I can see how you might not be, but if you're one of the people who's been insisting on a reason why, you better be :-) ), here is my fourth, last answer: these things are undefined because if you made them defined, compilers would have to be paranoid and generate lockstep code which would be bulkier or slower or both.

Having come this far, I have to repeat that this last answer isn't one I'm particularly pleased with, partly because I'm not a code generation expert, and partly because I don't usually worry about efficiency very much. (On the other hand, it doesn't matter whether I'm pleased with it, or whether you're pleased with it either; it is one of the reasons.) You're probably also still harboring some doubts; you may be thinking that a compiler would only have to use the slow, bulky, lockstep code for expressions with multiple side effects, which many people (even some of the Doubting Thomases) agree that we shouldn't be writing anyway. Perhaps you're right. Perhaps we could have our cake and eat it too; perhaps we could have blazingly efficient code generated for polite little expressions and still have defined behavior for rogue ones. Perhaps we could, but not under the current Standard: it still does say that the rogue expressions are undefined. But the C Standard is under revision: perhaps, if this is important enough to you, you can convince the committee to pronounce defined behavior upon the rogue expressions, too. Good luck.

Steve Summit
scs@eskimo.com