Page 36 - Developer
P. 36
INNER PRODUCT // AUTHORDAwsON
INNER PR ODUCT // b R UCE
Floating-Point Misses
Dealing with the Dangers of floating-point math
floating-point numbers are ubiquitous in games because they are a convenient way to handle the necessary math. Floats
gracefully handle overflow and underflow, and they have enough range and precision that we can sometimes forget how limited and
dangerous they are. Developers naturally want floats to be the same as the real numbers we studied in school, and floats maintain
the illusion quite well. But sometimes the illusion comes crashing down—and brings your game with it. In this article, we’ll take a
good long look at how floating-point math can make your life miserable—and some strategies for minimizing that misery.
32 bits, reinterPreted is accurate to about one part
» If you’re reading this, you should in a thousand. If you measure
already have a solid understanding someone’s height directly, then
of the float format, so we’ll keep your answer will be accurate to
the overview brief: A standard IEEE within about two millimeters. Now
float consists of a sign bit, 8-bit imagine that instead you decided
exponent, and 24-bit mantissa (see to stand the person on top of the
Figure 1). Yes, this adds up to 33 Empire State Building, measure
bits, but all that magically fits into a their height from the ground,
32-bit package, because the leading measure the height of the building,
one of the mantissa is, for numbers and then subtract the two numbers.
above FLT_MIN, implied instead of Instead of measuring 1.80 m
being explicitly stored. directly, you are now measuring
A handy feature of the 382.8 m and then subtracting
floating-point format is that 381.0 m, with both measurements
if you increment the 32-bit having an error of about .38 m. Your
representation of a float, then answer will be the person’s height,
you move to the next float away plus or minus about .76 m. The
from zero. Adjacent floats (of the loss of most of the top digits when
same sign) have adjacent integer subtracting similar numbers is
representations. Incrementing the known as catastrophic cancellation.
integer representation normally just
increments the mantissa, but if the addition oF dissiMilar
mantissa is all ones, incrementing » Now imagine that you have two
the 32-bit integer instead overflows numbers with dissimilar ranges—
the mantissa to zero and increments perhaps the height of the Empire
the exponent field. Due to the magic State Building and the height
of the implied leading one, this of a person—and both of these
still gives you the next float. This numbers are accurate to about one
technique works all the way from part in a thousand. If you add them
zero to infinity, and from negative a float is between one part in 8 Specifically, subtraction of numbers and store the result in a number
zero to negative infinity, and we’ll million and one part in 16 million. with similar magnitude (or, that is accurate to one part in a
discuss its application later. equivalently, addition of opposite thousand, then most of the digits of
The range of a float is generally don’t torMent signed numbers with similar the small number will be lost. If you
plenty large enough, but the the Math gods magnitude) loses a lot of precision, do this repeatedly, the cumulative
precision is a bit weak. A 24-bit » 24 bits of precision is enough and so does adding or subtracting a loss can grow arbitrarily big.
mantissa means (for instance) for a lot of purposes, but if you’re small number to a large number.
that for numbers above about 16 not careful you may inadvertently it’s the sig-Figs, baby
million, a float actually has less throw away most of that precision. subtraction oF siMilar » It surprises a lot of people
precision than a 32-bit integer. The It turns out that subtraction and » Imagine that you want to that multiplication and division
rule of thumb is that, over virtually addition are the most dangerous measure someone’s height with aren’t the source of more math
the entire range, the precision of operations for losing precision. a very long tape measure that errors—probably because those
Figure 1: the anatomy of a standard ieee float.
34 gAmE DEvElOPER | OCTObER 2012