Why is CPU cache memory so fast?

Question

What makes CPU cache memory so much faster than main memory? I can see some benefit in a tiered cache system. It makes sense that a smaller cache is faster to search. But there must be more to it.

Robert Harvey · Accepted Answer · 2014-03-31T00:27:50.270

In the case of a CPU cache, it is faster because it's on the same die as the processor. In other words, the requested data doesn't have to be bussed over to the processor; it's already there.

In the case of the cache on a hard drive, it's faster because it's in solid state memory, and not still on the rotating platters.

In the case of the cache on a web site, it's faster because the data has already been retrieved from the database (which, in some cases, could be located anywhere in the world).

So it's about locality, mostly. Cache eliminates the data transfer step.

Locality is a fancy way of saying data that is "close together," either in time or space. Caching with a smaller, faster (but generally more expensive) memory works because typically a relatively small amount of the overall data is the data that is being accessed the most often.

Further Reading
Cache (Computing) on Wikipedia

old_timer · Answer 2 · 2014-04-01T15:12:52.003

It is faster because both it is closer and because it is SRAM not DRAM.

SRAM is and can be considerably faster than DRAM the values are kept statically (the S in SRAM) so they don't have to be refreshed which takes away cycles. DRAM is dynamic, like tiny rechargeable batteries, you have to regularly recharge the ones so they don't drain away and become zeros. This steals cycle time in addition to how you have to access the bits, etc.

Being on the same die as or nearer the processor reduces the round trip, both L1 and L2 are faster than DRAM from an access perspective.

SRAM is faster to access than DRAM taken apples to apples, and the caches are usually on chip or closer or on faster busses than the DRAM making the access time faster as well.

Matthew Finlay · Answer 3 · 2014-03-31T05:04:14.487

One thing that should be mentioned explicitly is the impact of the speed of light. In this video Grace Hopper shows a piece of wire about a foot long, which is how far an electrical signal can travel in one nanosecond*. If a CPU is operating at 3GHz, then that implies a distance of 4" per clock cycle. This is a hard physical limit on memory access speeds. This is a large part of why being close to CPU (as L1 cache is), allows memory to be faster.

EDIT *actually how far light can travel in a vacuum, the distance through copper/silicon is less.

score 5 · Answer 4 · edited Apr 13 '17 at 12:32

Other answers already covered all the relevant bits: locality (and the associated data transfer cost, bus width and clock, and so on); speed of light (again, associated to transfer costs and bus width and throughput); different memory technology (SRAM vs.DRAM). All of this seen in the light of cost/performance balance.

One bit that was left out and it's just mentioned in Darkhogg comment: larger caches have better hit rates but longer latency. Multiple levels of cache where introduced also to address this tradeoff.

There is an excellent question and answer on this point on electronics SE

From the answers, it seems to me that a point to be highlighted is: the logic which performs all the required operations for a cache read is not that simple (especially if the cache is set-associative, like most caches today). It requires gates, and logic. So, even if we rule out cost and die space

If someone would try to implement a ridiculously large L1 cache, the logic which performs all the required operations for a cache read would also become large. At some point, the propagation delay through all this logic would be too long and the operations which had taken just a single clock cycle beforehand would have to be split into several clock cycles. This will rise the latency.

score 5 · Answer 5 · answered Feb 10 '16 at 23:01

There are a lot of good points raised in the other answers, but one factor appears to be missing: address decoding latency.

The following is a vast oversimplification of how memory address decoding works, but it gives a good idea of why large DRAM chips ar generally quite slow.

When the processor needs to access memory, it sends a command to the memory chip to select the specific word it wants to use. This command is called a Column Address Select (we'll ignore row addresses for now). The memory chip now has to activate the column requested, which it does by sending the address down a cascade of logic gates to make a single write that connects to all the cells in the column. Depending on how it's implemented, there will be a certain amount of delay for each bit of address until the result comes out the other end. This is called the CAS latency of the memory. Because those bits have to be examined sequentially, this process takes a lot longer than a processor cycle (which usually has only a few transistors in sequence to wait for). It also takes a lot longer than a bus cycle (which is usually a few times slower than a processor cycle). A CAS command on a typical memory chip is likely to take on the order of 5ns (IIRC - it's been a while since I looked at timings), which is more than an order of magnitude slower than a, processor cycle.

Fortunately, we break addresses into three parts (column, row, and bank) which allows each part to be smaller and process those parts concurrently, otherwise the latency would be even longer.

Processor cache, however, does not have this problem. Not only is it much smaller, so address translation is an easier job, it actually doesn't need to translate more than a small fragment of the address (in some variants, none of it at all) because it is associative. That means that along side each cached line of memory, there are extra memory cells that store part (or all) of the address. Obviously this makes the cache even more expensive, but it means that all of the cells can be queried to see whether they have the particular line of memory we want simultaneously, and then the only one (hopefully) that has the right data will dump it onto a bus that connects the entire memory to the main processor core. This happens in less than a cycle, because it is much simpler.

score -2 · Answer 6 · edited Feb 10 '16 at 14:32

One of the philosophies I studied was the obtain-maximum-throughput-in-minimum- hardware movement when we talk about any cache based memory, be it CPU cache, buffer cache or memory cache for that purpose. The basic motive is achieved when there is the least or no hardware movement for retrieving/reading/writing data and the operation is completed faster.

The data transfers from disk -> main memory (RAM)(temporary storage) -> CPU cache (smaller temporary storage near the CPU for frequently accessed data) -> CPU (processing).

The CPU cache is a smaller, faster memory space which stores copies of the data from the most recently used main memory locations.

The buffer cache is a main memory area which stores copies of the data from the most recently used disk locations.

The browser cache is directory or similar space which stores copies of the data from the most recently visited websites by users.

Reference: How Computer Memory Works

Why is CPU cache memory so fast?

6 Answers6

Linked

Related