Intel Nehalem/SB/IB/Haswell CPUs, cache vs TLB

Question

On the Nehalem+ architecture Intel CPUs what is the interaction between the L1 cache, L2 cache, L1 DTLB and L2 DTLB? On all the images I have found there isnt a clear explanation whether the CPU looks at the L1 DTLB first, then goes to the L1D cache, or whether it only goes to the L1 DTLB cache if there is a cache miss in the L1D or the L2 cache?

score 2 · Answer 1 · answered Mar 22 '14 at 03:34

Let's walk through what actually happens when an instruction like mov eax,[ds:ebx*2+4] is executed.

First the CPU has to calculate the virtual address. The virtual address would be ebx*2+4.

Next the CPU adds the segment register's base address to the virtual address to get a linear address. The base address for each segment register is stored in the CPU, and for most segment registers (for most OSs) it's zero, so this can happen fast. Note: There's also some protection checks involved here, but ignore that.

Once the CPU has a linear address it needs to be converted into a physical address (e.g. an actual RAM location). To do this everything is broken up into (4 KiB or possibly larger) pages, and there's several layers of paging structures. For example (long mode), CPU would use the PML4 (Page Map Level 4) to find the PDPT (Page Directory Pointer Table), then use a PDPT entry to find the PD (Page Directory), then use a PD entry to find the PT (Page Table). The page table entry contains the physical address of the page. By adding the "offset within page" from the virtual address to the physical address of the page it gets the final physical address. Note: There's also some protection checks involved at each step here, but ignore that.

All these table lookups take time. To speed it up the CPU caches the "virtual address to physical address" translations (including the final/resulting permissions for protection checks) in the TLB (Translation Look-aside Buffer).

Once the CPU knows the physical address, it can read the data from that physical address. This can be slow too, so the CPU uses other caches to cache the actual data.

Now; the problem with caches is that (due to "technical stuff") the larger they are the slower they are. To get the best performance and larger caches, they have multiple levels of caches - e.g. very small very fast L1, larger/slower L2, etc. For the caches close to the CPU, it can also help to have different caches for instructions and data (different working sets, different access patterns, etc). This is why there's L1I and L1D, then a (unified) L2 and maybe a (unified) L3.

Of course the same applies to TLBs - there can be multiple levels of TLBs and different TLBs for instructions and data (plus some more caches specifically for the high level paging structures to reduce TLB miss overhead).

Getting back to those "virtual address to physical address" translations - if the translation is not cached in the TLB, then the CPU needs to fetch (potentially) lots of tables from RAM to do the translation. A TLB miss causes several fetches, and those fetches might (or might not) be satisfied by data still in the L2 or L3 cache.

score 0 · Answer 2 · answered Mar 22 '14 at 02:51

A common method is to access both in parallel. This method is described in the Wikipedia article: http://en.wikipedia.org/wiki/Translation_lookaside_buffer.

The whole point of a lookaside buffer is that the CPU looks there as well as somewhere else, typically using different parts of the same address. Either can miss, and what happens next depends on the architecture.

Intel Nehalem/SB/IB/Haswell CPUs, cache vs TLB

2 Answers2