Why do you need float/double?

Question

I was watching http://www.joelonsoftware.com/items/2011/06/27.html and laughed at Jon Skeet joke about 0.3 not being 0.3. I personally never had problems with floats/decimals/doubles but then I remember I learned 6502 very early and never needed floats in most of my programs. The only time I used it was for graphics and math where inaccurate numbers were ok and the output was for the screen and not to be stored (in a db, file) or dependent on.

My question is, where are places were you typically use floats/decimals/double? So I know to watch out for these gotchas. With money I use longs and store values by the cent, for speed of an object in a game I add ints and divide (or bitshift) the value to know if I need to move a pixel or not. (I made object move in the 6502 days, we had no divide nor floats but had shifts).

So I was mostly curious.

score 57 · Answer 1 · answered Jun 27 '11 at 20:46

You use them when you're describing a continuous value rather than a discrete one. It's not any more complicated to describe than that. Just don't make the mistake of assuming any value with a decimal point is continuous. If it changes all at once in chunks, like adding a penny, it's discrete.

score 30 · Answer 2 · edited Jun 28 '11 at 14:38

You really have two questions here.

Why does anyone need floating point math, anyway?

As Karl Bielefeldt points out, floating point numbers let you model continuous quantities - and you find those all over the place - not just in the physical world, but even places like business and finance.

I've used floating point math in many, many areas in my programming career: chemistry, working on AutoCAD, and even writing a Monte Carlo simulator to do financial predictions. In fact, there's a guy named David E. Shaw who's used applied floating-point based scientific modeling techniques to Wall Street to make billions.

And, of course, there's computer graphics. I consult on developing eye candy for user interfaces, and trying to do that nowadays without a solid understanding of floating point, trigonometry, calculus, and linear algebra, would be like showing up to a gun fight with a pocketknife.

Why would anyone need a float versus a double?

With IEEE 754 standard representations, a 32-bit float gives you about 7 decimal digits of accuracy, and exponents in the range 10^-38 to 10³⁸. A 64-bit double gives you about 15 decimal digits of accuracy, and exponents in the range 10^-307 to 10³⁰⁷.

It might seem like a float would be enough for what anybody would reasonably need, but it's not. For instance, many real-world quantities are measured in more than 7 decimal digits.

But more subtly, there's a problem colloquially called "roundoff error". Binary floating point representations are only valid for values whose fractional parts have a denominator that's a power of 2, like 1/2, 1/4, 3/4, etc. To represent other fractions, like 1/10, you "round" the value to the closest binary fraction, but it's a little wrong - that's the "roundoff error". Then when you do math on those inaccurate numbers, the inaccuracies in the results can be far worse than what you started with - sometimes the error percentages multiply, or even pile up exponentially.

Anyway, the more binary digits you have to work with, the closer your rounded-off binary representation will be to the number you're trying to represent, so its roundoff error will be smaller. Then when you do math on it, if you have a lot of digits to work with, you can do a lot more operations before the cumulative roundoff error piles up to where it's a problem.

Actually, 64-bit doubles with their 15 decimal digits aren't good enough for many applications. I was using 80-bit floating point numbers in 1985, and IEEE now defines a 128-bit (16-byte) floating point type, for which I can imagine uses.

leftaroundabout · Accepted Answer · 2011-06-28T11:00:59.080

Because they are, for most purposes, more accurate than integers.

Now how is that? "for speed of an object in a game..." this is a good example for such a case. Say you need to have some very fast objects, like bullets. To be able to describe their motion with integer speed variables, you need to make sure the speeds are in the range of the integer variables, that means you cannot have an arbitrarily fine raster.

But then, you might also want to describe some very slow objects, like the hour hand of a clock. As this is about 6 orders of magnitude slower than the bullet objects, the first ld(10⁶) ≈ 20 bits are zero, that rules out short int types from the start. Ok, today we have longs everywhere, which leave us with a still-comfortable 12 bits. But even then, the clock speed will be exact to only to four decimal places. That's not a very good clock... but it's certainly ok for a game. Just, you would not want to make the raster much coarser than it already is.

...which leads to problems if some day you should like to introduce a new, even faster type of object. There is no "headroom" left.

What happens if we choose a float type? Same size of 32 bits, but now you have a full 24 bits of precision, for all objects. That means, the clock has enough precision to stay in sync up-to-the-seconds for years. The bullets have no higher precision, but they only "live" for fractions of a second anyway, so it would be utterly useless if they had. And you do not get into any kind of trouble if you want to describe even much faster objects (why not speed of light? No problem) or much slower ones. You certainly won't need such things in a game, but you sometimes do in physics simulations.

And with floating-point numbers, you get this same precision always and without first having to cleverly choose some non-obvious raster. That is perhaps the most important point, as such choice necessities are very error-prone.

vartec · Answer 4 · 2011-06-27T22:11:08.970

It's a common misconception, that everywhere you're dealing with money, you should store it's value as integer (cents). While in some simple cases like on-line store it's true, if you have something more advanced it doesn't help much.

Let's have example: a developer makes $100,000 a year. What is his exact month's salary? Using integer you get result $8333.33 (¢833333), which multiplied by 12 is $99,999.96. Did keeping it as integer help? No, it didn't.

Do banks always use decimal/integer values? Well, they do for transactional part. But for example as soon as you start talking investment banking, with exception of keeping track of actual transactions, everything else is floats. Since it's all in-house code, you won't see it, but you can take a peak at QuantLib, which is essentially the same (except much cleaner ;-).

Why use floats? Because using decimal doesn't help at all when you're using functions like square root, logarithms, powers with non-integer exponents etc. And of course floats are way faster than Decimal types.

ChrisF · Answer 5 · 2011-06-27T20:37:28.870

What you have described are perfectly good work arounds for situations where you control all the inputs and outputs.

In the real word it isn't the case. You'll need to be able to cope with systems that supply you their data as some real value to some degree of precision and will expect you to return the data in the same format. In such cases you will encounter these problems.

In fact you will encounter these problems even if you use the tricks you list. When calculating 17.5% tax on a price you're going to get fractional cents whether you store the value as dollars or cents. You have to get the rounding correct as the tax man gets very upset if you don't pay him enough. Using the correct money types (whatever they are in the language you are using) will save you from a world of pain.

score 3 · Answer 6 · answered Jun 27 '11 at 20:24

"God created the whole numbers, everything else is Man’s work." – Leopold Kronecker (1886).

By definition, you don't need any other kinds of numbers. Turing completeness for a programming language is based on the simple relationships among the various kinds of numbers. If you can work with whole numbers (a/k/a natural numbers), you can do anything.

The question is kind of specious because you don't need them. Perhaps you want places where it's convenient or optimal or cheaper or something?

supercat · Answer 7 · 2014-02-17T18:56:21.593

The primary advantage of floating-point types is that from a run-time perspective, two or three formats (I wish more languages supported 80-bit formats) will sufficient for the fast majority of computational purposes. If programming languages could easily support a family of fixed-point types, the hardware complexity required for a given level of performance would often be lower with fixed-point types than with floating-point. Unfortunately, providing such support is far from "easy".

For a programming language to satisfy 98% of applications' numerical needs efficiently, it would have to include dozens of types, and provide define operations for what may be hundreds of combinations; further, even if a programming language had wonderful fixed-point support, some applications would still need to maintain roughly-constant relative precision over a sufficiently large range as to require floating-point. Given that floating-point math is going to be necessary on some occasions in any event, having hardware vendors focus on math performance with two or three floating-point formats, and having code use those formats whenever they work reasonably well, will generally achieve better "bang for the buck" than would trying to optimize the behavior of fixed-point math.

Incidentally, fixed-point math was more advantageous with 8-bit and 16-bit processors than with 32-bit ones. On an 8-bit processor, in a situation where 32 bits wouldn't quite suffice, a 40-bit type would only cost 25% more space and 25-50% more time than the 32-bit type, and would require 37.5% less space and 37.5-60% less time than a 64-bit type. On a 32-bit platform, if a 32-bit type won't suffice for something, there's often little reason to use anything less than 64 bits. If a 48-bit fixed-point type would be adequate, a 64-bit "double" will work just as well as would the fixed-point type.

score 1 · Answer 8 · answered Jun 27 '11 at 20:40

In a sentence, floating-point decimal types encapsulate the conversion to and from integer values (which is all the computer knows how to deal with at the binary level; there is no decimal point in binary) providing a logical, generally easy-to-understand interface for calculations of decimal numbers.

Frankly, saying that you don't need floats because you know how to do decimal math using integers is like saying you know how to do arithmetic longhand, so why use a calculator? So you know the concept; bravo. Doesn't mean you have to exercise that knowledge all the time. It is often faster, cheaper, and more understandable to a non-binary-whiz to simply say 3.5 + 4.6 = 8.1 rather than converting the sig figs to an integer quantity.

score 0 · Answer 9 · answered Jul 24 '24 at 18:46

Floats make a lot of sense if you look at them from a measurement/experimental science/engineering background. They are essentially the computerised version of "standard notation" that kids are taught about in science classes. Generally in measurement the magnitude of your errors are proportionate to the magnitude of the value being measured.

Fixed point generally means you need to micromanage your scale factors. If your numbers are too small then your results become very inaccurate. If your numbers are two large they overflow your integer type.

This gets worse if what you are implementing is not a concrete application where you know the physical meaning of the values, but some generic bit of maths functionality that will be used in multiple applications with the application author assigning meaning to the numbers. What should the argument and result types for a fixed-point square or square-root function be? what about an arbitrary power function?

Floating point, in most cases, frees you from needing to micromanage your scale factors. A single precision floating point number gives you about 7 digits of precision, which most of the time is far better than your experimental accuracy. Double precision is good enough for nearly any real-world measurement.

That said there are a number of gotcha's with floats.

Many numbers that are nice and clean in decimal don't have a perfect representation as binary fractions. This is most acute with money, but it can come up in other situations too.
Coordinates can be problematic, because they get less accurate the further you get from the centre of the coordinate system.
Repeated addition of a small value to a large value won't always produce the results you expect. This often comes up with things like movement where a "speed" value is repeatedly added to a "distance" or "position" value.
You can get different results based on compiler optimisations, target architecture and global FPU state. This can be problematic if you need you results to be reproducible.

score 0 · Answer 10 · edited May 23 '17 at 12:40

Generally, you should be very careful of using them. Understanding the loss of precision that can arise from even simple calculations is a challenge. For example, averaging a list of numbers like this is a very bad idea:

double average(List<Double> data) {
  double ans = 0;
  for(Double d : data) {
    ans += d;
  }
  return ans / data.size();
}

The reason is that, for sufficiently large lists, you basically lose all the data points when ans gets large enough (see e.g. this). The problem with this code is that for small lists, it'll probably just work --- it's only at scale that it breaks.

Personally, I think you should only use them when: a) the calculation really must be fast; b) you don't care that the result is likely to be way off (unless you really know what you're doing).

score -1 · Answer 11 · answered Jun 27 '11 at 20:41

One thought is that you would use the float or double representations when you need to deal with values outside the integer range.

Today's architectures (roughly) have an signed integer range of +/- 2,147,483,647 (32 bit) or +/- 9,223,372,036,854,775,807 (64 bit). Unsigned extends that by a factor of 2.

IEEE 754 floats (roughly) go from +/- 1.4 × 10^−45 to 3.4 × 10^38. Double extends that range to +/- 5×10−324 ±2.225×10^−308 with lots of conditions and specifics omitted here.

Of course, the most stunningly obvious reason is that you might need to represent -0 ;-)

score -1 · Answer 12 · edited May 23 '17 at 12:40

-1

The usual reason is because they are fast as the JVM typically uses underlying hardware support (unless you use strictfp).

See https://stackoverflow.com/questions/517915/when-to-use-strictfp-keyword-in-java for what strictfp implies.

edited May 23 '17 at 12:40

Community

1

answered Jun 27 '11 at 21:47

score -2 · Answer 13 · answered Jun 28 '11 at 15:39

That's why we need 256bit operating systems.

The Plank length (the smallest distance you can measure) = 10^-35m
The observable universe is 14Bn parsecs across = 10^25m
So you can measure anything in units of the Plank length as integers if you have only 200 bits of precision.

Why do you need float/double?

13 Answers13

Linked

Related