21

DRAM chips are very tightly packed. Research has shown that neighboring bits can be flipped at random.

  • What is the probability of the bug triggering at random in a server-grade DRAM chip with ECC (the CMU-Intel paper cites e.g. the number 9.4x10^-14 for an unknown chip for one failure in a year's time)?
  • How do I know whether the bug is fixed before buying memory?
  • What should I do to counter malicious attempts to do privilege escalation by e.g. tenants or unprivileged users on e.g. CentOS 7?

References:

Deer Hunter
  • 1,110

3 Answers3

18

The CMU-Intel paper you cited shows (on page 5) that the error rate depends heavily on the part number / manufacturing date of the DRAM module and varies by a factor of 10-1000. There are also some indications that the problem is much less pronounced in recently (2014) manufactured chips.

The number '9.4x10^-14' that you cited was used in the context of a proposed theoretical mitigation mechanism called "PARA" (that might be similar to an existing mitigation mechanism pTRR (pseudo Target Row Refresh)) and is irrelevant to your question, because PARA has nothing to do with ECC.

A second CMU-Intel paper (page 10) mentions the effects of different ECC algorithms on error reduction (factor 10^2 to 10^5, possibly much more with sophisticated memory tests and "guardbanding").

ECC effectively turns the Row Hammer exploit into a DOS attack. 1bit errors will be corrected by ECC, and as soon as a non-correctable 2bit error is detected the system will halt (assuming SECDED ECC).

A solution is to buy hardware that supports pTRR or TRR. See current blog post from Cisco about Row Hammer. At least some manufacturers seem to have one of these mitigation mechanisms built into their DRAM modules, but keep it deeply hidden in their specs. To answer your question: ask the vendor.

Faster refresh rates (32ms instead of 64ms) and aggressive Patrol Scrub intervals help, too, but would have a performance impact. But I don't know any server hardware that actually allows finetuning these parameters.

I guess there's not much you can do on the operating system side except terminating suspicous processes with constant high cpu usage and high cache misses.

GregL
  • 9,870
Daniel
  • 204
  • 2
  • 2
4

The situation still seems quite unclear so I don't think your questions can be answered directly, but here is some relatively recent information as a partial answer. For news, follow the rowhammer-discuss mailing list.

I'm not sure it is possible at present with public information to avoid buying vulnerable RAM, nor to easily predict failure rates in existing hardware. Manufacturers have not been open with information about how their products are affected. It is possible to test memory already purchased using software tools, but you should be aware that running those tools for significant periods (hours) can permanently degrade RAM and cause faults in running software.

"Unnamed memory companies" have reportedly attempted to pay a bribe in return for Passmark Software not releasing a rowhammer test in their Memtest86 tool.

Intel Skylake hardware has been reported to be more vulnerable, not less, to rowhammer because of the addition of the addition of a new clflushopt instruction. This has already been exploited in rowhammer.js

Daniel Gruss answers some questions here about mitigation as of December 2015 (coauthor of the rowhammer.js paper) in this talk:

  1. While some ECC RAM is less vulnerable than non-ECC RAM to rowhammer, other ECC RAM is more vulnerable than non-ECC RAM (link to question in video)
  2. Switching to a faster refresh rate is sufficient to prevent rowhammer with most but not all hardware - but not all BIOSes allow changing the refresh rate (link to question in video).

As a countermeasure, it may be possible to detect rowhammer attacks in progress, but I don't know that that has been done.

0

The accepted answer, unfortunately, has outdated information. TLDR: SECDED ECC will not save you and neither will pTRR or TRR. And no, faster refresh rate will not help either. :)

SECDED ECC (Single Error Correction and Double Error Detection Error Correcting Codes)

The accepted answer says:

ECC effectively turns the Row Hammer exploit into a DOS attack. 1bit errors will be corrected by ECC, and as soon as a non-correctable 2bit error is detected the system will halt (assuming SECDED ECC).

This inforamtion is outdated. More recent research has repeatedly demonstrated that Rowhammer attacks on ECC memory are possible. Most notably in 2019 VUSec published ECCPloit "the new attack to reliably flip bits that completely bypass ECC protection" (emphasis mine).

TRR and pTRR (Target Row Refresh and Pseudo Target Row Refresh)

The accepted answer says:

A solution is to buy hardware that supports pTRR or TRR

This recommendation is unfortunately outdated too.

Notable papers:

In theory just because past implementations of TRR were proven vulnerable does not rule out the possibility that a secure implementation could exist in the future. I would, hovever, advice that any new implementation of TRR should be treated with scepticism - we can only trust it if it was proven secure not just against previously-published hammering patters, but for the general case of an arbitrary attacker-controlled hammering pattern.

Faster refresh rates

The accepted answer says:

Faster refresh rates (32ms instead of 64ms) ... help, too

No they don't. The TRRespass paper demonstrated bit flips even when both TRR and double refresh are used at the same time.

To answer the original question

Assuming the sources I have linked above can be trusted and are not mad ravings of paranoid tinfoil-wearers here is my interpretation of the current state of the art:

  • What is the probability of the bug triggering at random in a server-grade DRAM chip with ECC?
  • I don't know - most sources I found focus on intentional exploitation rather than accidental bit flips.
  • How do I know whether the bug is fixed before buying memory?
  • Easy: the bug has probably not been fixed and even if running prevously published proofs of concept on your system produces no bit flips someone will probably eventually reverse-engineer the magic sauce and find a hammering pattern that bypasses it. :)
  • What should I do to counter malicious attempts to do privilege escalation by e.g. tenants or unprivileged users on e.g. CentOS 7?
  • You can't (until new mitigations are implemented and are proven to be secure)