I’ve been getting a lot of requests from people to talk about the recent Excel bug. For those of you who haven’t heard about this, in Excel 2007, floating point calculations that should result in a number very, *very* close to either 65,535 or 65,536 are displaying their result as 100,000. It’s only in the display though – the underlying number is actually represented correctly, so if you subtract 2 from 65,536, you’ll get the correct answer of 65,534 – not 99,998.

I can’t give you the specifics – because without seeing the Excel code, I can’t tell exactly what they got wrong. But I’ve got some pretty good suspicions, so I’ll do my best to explain the background that leads to the problem, and what I think is probably going on. (Another excellent explanation if this is in the Wolfram blog post that I mentioned in my fast arithmetic fractals post this weekend.)

When you’re working with numbers on in a program like excel, you’re using something called floating point numbers. Floating point is annoying, seriously annoying. It’s an *approximation* of real numbers using a finite precision in a way that allows arithmetic operations to be done reasonably quickly. (It’s still a whole lot slower than working with integers, but it’s usually pretty good.)

Without going into the gory details, what floating point means is that the number is represented by a fixed-precision number in a strange form of scientific notation. So, for example, in 4-digit base-10 floating point, the number 600 is actually represented as 0.6000×10^{3}; 31.4159 is represented as 0.3142×10^{3}. Notice what happened in that second case – we lost two digits, because the 4-digit float representation didn’t have enough precision to represent all of the digits.

To make matters worse, floating point doesn’t using base-10. It’s binary: so numbers are really represented by a base-2 fraction and a base-2 exponent. So, for example, 1/4 is really represented as

0.1×2^{-1}.

There are a couple of unfortunate side-effects of this:

- The one which most people have seen in some form, is that almost every floating point computation accumulates errors: because of the finite precision,

every step performs a round-off, which can add a bit of error. The more you do, the larger the accumulated effects of roundoff errors can become. The typical example of this is on many calculators, if you do 2 – (sqrt(2)^2), you’ll get something odd like 0.00000000012. - The order in which you perform a computation can have a huge impact on the precision of the result.

For example, in base-10 4 digit floating point, with 1-digit exponents, if you have (0.1E-9)×(0.3E0)×(0.4E8), if you do the first pair of numbers first, it evaluates to

0E0×0.4E8=0. If you do it in the second order, it evaluates to 0.1E-9×0.12E8=0.12E-2. - In order to display a value to a user, you need to convert it from base-2 scientific notation

to a standard base-10 representation. There are a bunch of problems here: doing the conversion is

potentially quite slow, involving multiple divisions (with the attendant round-off errors); and

you want to present the number to the user correctly; while the base-10 conversion might result in 1.9999999999999999999999999999999, you probably want to output that as 2.

The last is the problem in Excel. Converting to print format is very slow – so people devote enormous amounts of effort to finding algorithms that make it faster. Every corner that can reasonably be cut in order to save a bit of time is cut. At the same time, you want to get the number out correctly – so along with doing your clever optimization, you’re always keeping your eyes open for one of the cases where the round-off errors should be corrected – like rounding 65,534.999999 to 65,535.

The problem with that is that floating point representation is very complicated, and there are an incredible number of different numbers that can be represented.

It’s very easy to write code which *seems* correct, and which works perfectly on nearly every floating point value, but which contains an error which will manifest on a dozen bit patterns. You can test that code on billions of different values, and still miss the crucial couple that reveal the problem.

It looks like the Excel bug is one of those cases. My suspicion is that there’s an unfortunate interaction between the code that tries to prevent generating numbers like 1.99999999999999999 instead of 2, and the code that does the optimized conversion. What it looks like is that the rounding-error correction is probably over-doing its job; it’s doing the roundoff and presenting its result back The output code in what looks like it’s already base-10 format.

In general, I like ragging on Microsoft as much (if not more than) you average Mac user. But I can’t say that I really blame them too much for this one. Floating point arithmetic and conversion is a nightmare – it’s enormously complex, and the demand for speed is extraordinary. Slowing things down a tiny bit – taking an extra microsecond per conversion – can have a huge impact on the performance of the system. That kind of code is under constant pressure to squeeze out every last drop of performance. And errors like this are *so* easy to miss, while catching them

in testing is almost impossible. You can only reliably catch this kind of problem by doing a detailed analysis of the logic of the code, and all you need to do is miss one out of hundreds of different corner cases, and you’re hosed. It’s just so hard to get right that the only surprise is that they’ve made so few mistakes like this.