In a recent CACM article [1], the authors present a way to improve scalability o
ID: 648383 • Letter: I
Question
In a recent CACM article [1], the authors present a way to improve scalability of shared and coherent caches. The core ingredient is assuming the caches are inclusive, that is higher-level caches (e.g. L3, one global cache) contain all blocks which are stored in their descendant lower-level caches (e.g. L1, one cache per core).
Typically, higher-level caches are larger than their respective descendant caches together. For instance, some models of the Intel Core i7 series with four cores have an 8MB shared cache (L3) and 256KB private caches (L2), that is the shared cache can hold eight times as many blocks as the private caches in total.
This seems to suggest that whenever the shared cache has to evict a block (in order to load a new block) it can find a block that is shared with none of the private caches
Explanation / Answer
Perhaps I can shed some light on associativity. These caches aren't just open blocks of memory, so don't think of them as some kind of generic container. Each block of memory has a real address (whether this is physical or virtual doesn't matter, just assume it is a fixed address). At each level of cache this memory can only be cached at very specific locations in the cache.
For example, memory address X can be stored in the L1 cache at location Y0. This is a direct mapping. But suppose it can also be stored at location Y1, then it is 2-way associative (which is fairly common). Being storable in N-locations makes it N-way associative. When the chip chooses a cache location it checks only these N places -- it does not scan the entire cache (which would be too slow/complicated).
This should clarify what they mean by saying the shared cache must have an aggregate higher associativity than the lower caches. So if L1 is 2-way, L2 is 4-way, and there are 4 processors, they shared cache must be > 24-way associative (I don't have the article, but this sounds like what they mean). This follows then, that at least one of these shared locations must not be cached in any of the private caches.
I can't say for sure what their "pathological" case is, but I can explain a common one, one that can affect things like binary search. Say the L1 cache is 1-way associative and 4KB in size. The locations where memory can be stored is likely a simple mod of the address (X mod 4K). Say then you have a large block of data which is a power of 2 (which is common). In a binary search you always chop the size in half, so a search on a 256K block would have you check address 128K, 64K, 32K, 16K, 8K, 4K, and the lower. But you should see a problem here. Those first 6 addresses all map to the exact same cache location. So each access requires an eviction and load, despite you accessing only a tiny amount of memory.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.