How often do CPUs make calculation errors?

Question

In Dijkstra's Notes on Structured Programming he talks a lot about the provability of computer programs as abstract entities. As a corollary, he remarks how testing isn't enough. E.g., he points out the fact that it would be impossible to test a multiplication function f(x,y) = x*y for any large values of x and y across the entire ranges of x and y. My question concerns his misc. remarks on "lousy hardware". I know the essay was written in the 1970s when computer hardware was less reliable, but computers still aren't perfect, so they must make calculation mistakes sometimes. Does anybody know how often this happens or if there are any statistics on this?

score 14 · Accepted Answer · edited May 23 '17 at 12:40

Real/Actual errors in a CPU's design aside, I think you are looking for this SO Question: Cosmic Rays. What is the probability they will affect a program. I can't get quotes from it because SO is blocked again at work here (sigh).

Ignoring the above, I seem to recall there was some FPU calculation bugs in early Pentiums, so they certainly are not infallible.

I have no hard evidence at hand, but my gut tells me you should probably be more concerned about bits of Cache/RAM/Disk being corrupt then the calculation being incorrect.

score 7 · Answer 2 · answered Mar 04 '11 at 00:33

A big issue in answering this question these days is that CPU manufacturers wrap the errata for the chip in a NDA ( NonDisclosure Agreement). Intel do this, IIRC.

Many less secretive manufacturers issue corrections to the data sheet, but don't tell you what changed, so unless you fancy comparing all 300 pages, you'll have a hard time telling.

There have been plenty of bad instructions in CPU's, watching a Linux kernel report which ones it finds on boot is moderately interesting.

Very related is the paper Google on memory errors, they are more common than you think. "DRAM errors in the Wild:A Large-Scale Field Study" Schoeder, Pinheiro and Weber Originally published in ACM SIGMETRICS in 2009. Republished in Communications of the ACM Feb 2011

What all these memory errors mean for you question, is that without ECC memory, you will get wrong calculations anyway.

score 5 · Answer 3 · answered Jan 06 '11 at 05:59

Back when I worked for a hardware vendor it was claimed that no CPU ever built was bugfree. And that is just logic bugs. Usually the manufacturer finds most of them and either respins the chip, or finds BIOS settings that work around them. But in addition to the fact the stuff like cosmic rays occasionally flip a bit in memory (and memory usually has parity bits or SECDED circuitry to save your bacon), there is always a finite chance that a bit will be read incorrectly. Note that bits aren't real logical zeroes and ones,but noisy things like voltages and currents , and given finite noise in the system there is always the chance that a wrong bit will be read. In the old days (as a app programmer), I found a few HW bugs -both of the bad logic kind, and of the unit X in CPU Y is occasionally giving me a bad result type, time to get the HW guys to replace a chip variety. Actual circuits do drift with time and use, and if yours is getting ready to fail, you could start picking up bit errors, especially if you are overclocking, or otherwise exceeding the recommended operating range.

It is a real issue for supercomputing, where computations invlolving 1e18 or more floating point operations are contemplated.

guest · Answer 4 · 2014-08-19T03:00:40.707

The following content may be about calculation errors in GPUs.

Given enough time, Intel i7-3610QM and an Nvidia GeForce GTX 660 will disagree with one another given the same instructions. (cuda 5.5, compute_20,sm_20)

So, one is left to conclude that one of the two makes an error.

During a particle simulation feasibility study benchmark I noticed that after a thousand or so double precision transformations (transformations including sin, cos, multiplication, division, addition and subtraction) errors started creeping in.

I'll give you a small excerpt of numbers to compare (first number is always CPU, second GPU)

-1.4906010142701069
-1.4906010142701074

-161011564.55005690
-161011564.55005693

-0.13829959396003652
-0.13829959396003658

-16925804.720949132
-16925804.720949136

-36.506235247679221
-36.506235247679228

-3.3870884719850887
-3.3870884719850896

(note that not every transformation sequence yields an error)

While the maximum error is almost negligible (0.0000000000000401%) it still exists, and does contribute to cumulative error.

Now this error might be due to a difference in implementation of one of the intrinsic libraries. Indeed, it looks like the GPU prefers to round down or truncate where the CPU rounds up. Curiously too, this only seems to happen on negative numbers.

But the point is that identical instructions are not necessarily guaranteed to return identical results, even on digital machines.

I hope this contributed.

EDIT as a sidenote: In the case of GPU arithmetic errors, this (ctrl+f "First GPU with ECC Memory Support") could also be of interest, although not necessarily relevant to the errors above.

Pemdas · Answer 5 · 2011-01-06T03:45:38.077

In terms of what you consider the actual "CPU" (execution units, pipeline..ect) it pretty much never happens. There was a known issue with one of the Pentium flavors a while back, but that is the only one that I have ever heard of. Now, if you consider the chips sets that are built into the processors or at least the same packaging such USB controllers, TSEC, DMA controller or memory controller then there are plenty of errata out there. I doubt there is any sort of statistical data about that though.

score 0 · Answer 6 · answered Mar 04 '11 at 00:56

Another "lousy hardware" issue to consider in this context is that floating point hardware is inherently "lossy": it has limited precision, and with sufficiently large numbers (refer back to the original Dijkstra quote) you will not be able to distinguish between x and x + 1, or even x + 1000000. You can get "infinite" precision floating point libraries, but they are slow and ultimately still limited by available memory.

In short, Dijkstra was working in the realm of theory, and real hardware/software doesn't match up to theoretical ideals very well. (Remember, the original "Turing machine" specified an infinite paper tape.)

How often do CPUs make calculation errors?

6 Answers6