Demons May Fly Out Of Your Nose

Language standards give guarantees about valid program behaviour. Olve Maudel discovers what happens if you break your end of the bargain.

C is one of the most widely used programming languages of all time. It is still the preferred language in many domains. With C you get direct access to the hardware and you only get exactly what you ask for. Often this is what you need. But, as we all know, with great power comes great responsibility. One of the responsibilities is to learn and understand the contract between yourself and the compiler. If you break the contract, then anything can, and will, happen.

Consider the C snippet in Listing 1. You might say that this program prints "347", but are you sure about that? Could the code print "437" instead? Do you understand C well enough to defend your answer? Most programmers assume that expressions are evaluated in a certain order, usually from left to right. This is a valid assumption for most modern programming languages, and if you rewrite the code above into, say Java, C#, Python or Ruby, then you are guaranteed to get "347". C is unlike modern programming languages in many ways. One of them being that in C the evaluation order of expressions is mostly unspecified. For the code snippet above, both "347" and "437" are valid answers.

#include <stdio.h>

int a() { printf("3"); return 3; }
int b() { printf("4"); return 4; }

int main(void)
{
  int c = a() + b();
  printf("%d\n", c);
}

Listing 1

Is this a bug in the language specification? No, it is a feature, an important feature actually. With loose evaluation, the compiler is able to create very efficient code on a wide range of hardware platforms. This is exactly one of the design goals for C – portability and speed.

Sequence points

Let’s take a look at another C snippet:

  int  v[]  = {2,4,6,8,10,12};
  int i = 1;
  int n  = ++i  + v[++i];
  // what is the value of n?

A similar code snippet in, say Java, is perfectly OK, well defined, and the value of n will be 2+ v [3]=2+8=10. Most C compilers will happily create an executable from this code snippet, you will typically not get any warnings, and when you execute the code you might get 9, 10, 11, 12, 42 or whatever. This is a classic example of undefined behaviour in C. Not only can you end up with any random value, but when you have undefined behaviour anywhere in your code, the whole program is invalid – anything can happen! Really! One of the basic design principles in C is that the compiler is allowed to assume that you as a programmer only write correct code and therefore the compiler can, and will, make all kinds of ‘shortcuts’ when compiling your code. Or, in other words, it will certainly not waste any CPU-cycles to create a safety net around your code. If you break the contract, then the whole execution state of the program will be corrupt. In a one million line C program, a snippet like the one above invalidates the entire program. In theory, the program might end up formatting your hard drive or shut down the cooling system of the nuclear reactor, or, as they say on comp.std.c [ JargonFile ] – “ When the compiler encounters [a given undefined construct] it is legal for it to make demons fly out of your nose ”.

You may argue that you never write code like this (of course you don’t, I believe you). However, it is not enough to just say that you never write code like this, it is not enough to know that this is invalid code, you need to have an understanding of your programming language that is deep enough to also explain why this is invalid C code. To program correct C you need a deep understanding of the language [ Maudal11 ].

If you reflect on the fact that the evaluation order of expressions in C is mostly unspecified, you might come up with a reasonable argument for why

  int n = ++i + v[++i];

gives unpredictable result. As you cannot assume a certain evaluation order, then you do not know if the side effect of the first ++i will happen before or after evaluating v[++i] . There is a contract between the programmer and the compiler that has now been broken. The contract, as defined in the C standard, says that if you update a variable twice between two sequence points, then you get undefined behaviour. Sequence points are like heart beats of a C program. There will be a sequence point at the end of a full expression, before a function is entered in a function call, and a few other places, but sequence points are surprisingly sparse in C. There is often a lot of code that needs to be evaluated between two sequence points. At a sequence point in the code, all side effects of previous evaluations will be completed, but between sequence points you are not allowed to make any assumptions about the state of involved variables. This also means that in C, unlike most other languages, the following expression leads to undefined behaviour

  v[i] = i++;

because the assignment operator does not represent a sequence point in C.

Having few sequence points gives the compiler much freedom when optimizing the code. The compiler is allowed, and expected, to just evaluate the expression in a way that is optimal for the current hardware without even considering the possibility that a variable might be updated twice between sequence points. A complex expression for example, can result in the compiler building up a long pipeline of powerful machine instructions to be churned through the CPU in the most efficient way happily ignoring all potential conflicting side effects.

Signed integer overflow

In C, a signed integer overflow also gives undefined behaviour. Hardware architectures handle integer overflow in different ways and C responds by saying that if that happens, then the compiler is not to blame. So the following function can give undefined behaviour and invalidate the whole program:

  int the_answer(int seed)
  {
    int answer = seed + 42;
    return answer - seed;
  }

It is not the function itself that is the issue here, suppose this function is called with

  the_answer(INT_MAX)

then the contract is broken and the compiler is not responsible for the consequences. Of course, in this particular case, the compiler will probably not be able to see what is going on. Benign code will be created and if you know the underlying hardware well enough, you might get a result you expected. The whole program might work just fine – until you try to compile it for another hardware platform, or you change the optimization level.

C has many similar cases of undefined behaviour. Is this a good thing as well? Is it a feature? Let’s turn it around instead. Can you imagine how many extra machine code instructions would be needed to make edge cases like this portable across many hardware platforms? Perhaps you would need a virtual machine that the code could run on? Perhaps you would need fancy exception handling mechanisms that could flag run time errors like this? Well, there are languages that do these kinds of things, but C is not one of them. If you appreciate execution speed and direct access to native hardware, then declaring certain corner cases as undefined behaviour is often a good solution.

Surprising results

Believe it or not, there are examples of compilers that do pull pranks on you when encountering undefined behaviour [ Wikipedia ], but no reports so far confirms that a compiler actually try to make nasal demons fly out of your nose. The compilers usually try to be friendly and do the best they can. However, sometimes the compiler makes assumptions that can give very surprising results.

Here is a dramatic illustration of what might happen when breaking the rules of the language. As you know, in C all variables with non-static storage duration must be given an explicit value before they can be used, otherwise you break the contract and the compiler is no longer to blame for the consequences.

  bool b;
  if (b)
    printf("b is true\n");
  if (!b)
    printf("b is false\n");

Suppose you found this somewhere in your code. You immediately see that this is undefined behaviour since b is not properly initialized before use, and from a theoretical point of view anything can happen now (for example, nasal demons). You might, however, from a practical point of view, expect that the code will always either print b is true or b is false , but even that is not guaranteed. A compiler might use a byte in memory to represent a bool, and at the same time assume that this byte in memory, since it is a bool, can only have the bit pattern 00000000 or 00000001. With that assumption the compiler might generate machine instructions similar to the pseudo-code in Listing 2, which assumes that $b is either 0 or 1 .

load_a       $b                ; load value of b into register A
compare_a    0                 ; compare register A to 0
jump_equal   label1            ; skip next statement if A == 0
call         print_b_is_true   ; print "b is true"
label1:
  load_a     $b                ; load value of b into register A
  xor_a      1                 ; xor register A with 1
  compare_a  0                 ; compare register A to 0
  jump_equal label2            ; skip next statement if A == 0
  call       print_b_is_false  ; print "b is false"
label2:

Listing 1

Try to follow the code: If the value of b is 1, then the code will print "b is true" , if b is 0 then it will print "b is false" , but if b is a random value, say 42, then this code snippet will print:

  b is true
  b is false

Indeed, what I just described, is exactly what happens with a very popular C compiler if the bit pattern of the byte used to represent bool b happens to become anything but 0 or 1. (see [ Shroyer12 ] for more info)

This never happens in real code? Does it? Yes it happens! All the time! You typically observe the effects of undefined behaviour when you increase the optimization level, change the compiler or just update to a new version of the same compiler. If you have code that works just fine in one optimization level, but not when you change the optimization level – then you might want to start looking for undefined behaviour in your code. The compiler is not only allowed, but also expected, to generate code that is as efficient as legally possible. This means for example, that it will not, and should not, create any machine code instructions to check for or compensate for the possibility of invalid data.

ConclusIon

To program correct C you need to have a deep understanding of the programming language. You need to understand the contract between the programmer and the compiler, because if you break the contract then the compiler can, and probably will, create code that give unexpected results.

References

[JargonFile] ‘nasal demons’, http://www.catb.org/jargon/html/N/nasal-demons.html

[Maudal11] ‘Deep C (and C++)’ http://www.slideshare.net/olvemaudal/deep-c

[Shroyer12] Mark Shroyer ‘Both true and false: a Zen moment with C’ http://http://markshroyer.com/2012/06/c-both-true-and-false/

[Wikipedia] ‘Undefined behavior’, http://en.wikipedia.org/wiki/Undefined_behavior