Wherein the OS and user get more control over the L1,L2 and L3 caches, mostly for performance.

Traditionally, the L1, L2 and L3 SRAM caches are meant to be managed solely by the CPU. After all, from the perspective of the user’s program, the abstraction available for non-persistent data is that of memory; caches are there purely for performance. They could be managed by the user or even the OS, but from ‘What Every Programmer Should Know About Memory’ by Ulrich Drepper:

… the gains of having fast memory would be eaten up completely by the overhead of administering the resources. So, instead of putting the SRAM under the control of the OS or user, it becomes a resource which is transparently used and administered by the processors.

Recently, I used the cpuid command to retrieve detailed cache and memory details for my system. Contrary to the above description, I did come across what seemed to be support for OS and user level cache control instructions (some present, most meant for future systems and one that’s already been deprecated). Here was the cpuid output (slightly formatted, non-relevant lines have been filtered out):

---------------------------------------------------------
CLFLUSH line size                               = 0x8 (8)
CLFLUSH instruction                             = true
---------------------------------------------------------
CLFLUSHOPT instruction                          = true
---------------------------------------------------------
CLDEMOTE supports cache line demote             = false
---------------------------------------------------------
CLWB instruction                                = false
PCOMMIT instruction                             = false
---------------------------------------------------------
CLZERO instruction                              = false
---------------------------------------------------------
WBNOINVD instruction                            = false

WBINVD/INVD acts on lower caches (L1d cache)    = false
WBINVD/INVD acts on lower caches (L1i cache)    = false
WBINVD/INVD acts on lower caches (L2 cache)     = false
WBINVD/INVD acts on lower caches (L3)           = false
---------------------------------------------------------
PREFETCHWT1                                     = false
---------------------------------------------------------
L1D_FLUSH: IA32_FLUSH_CMD MSR                   = true
---------------------------------------------------------

Before going further, let’s define a linear address since this term will come up when describing some of these commands. A linear address is the virtual address generated by a program at a given instance that’s then segmented by CPU right before it’s translated into the physical address. Without getting into too much technical minutiae, a program that’s loaded consists of the code segment, data segment, the stack region plus more - segmentation therefore entails figuring out which segment an address points to. For more details, check the wikipedia page, on x86 memory segmentation.

Back to the output above: Most of the the description’s are sourced from Felix Cloutier’s x86 reference which in turn is based on Intel’s x86 reference:

  • clflush <address>: Flushes a cache line: “Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the linear address specified with the memory operand. If that cache line contains modified data at any level of the cache hierarchy, that data is written back to memory” [1].
  • cflushopt <address>: similar to clflush but is more optimized, or rather cflushopt has less overhead than clflush by providing less guarantees.
  • cldemote <address>: A performance hint that demotes a cache line from the level closest to a CPU core to the next lowest one. On usage, it “may accelerate subsequent accesses to the line by other cores in the same coherence domain, especially if the line was written by the core that demotes the line” [1]. Furthermore “Unlike CLFLUSH, CLFLUSHOPT, and CLWB instructions, CLDEMOTE is not guaranteed to write back modified data to memory” [1].
  • clwb <address>: Similar to clflush but upon writing back a modified cache line to memory, it does not flush it away from any/all of the cache levels - so as to speed up any subsequent reads. Upon getting written, the line’s state is changed to ‘non-modified’ as per the cache coherency protocol.
  • clzero: AMD specific instruction that clears out the cache line by setting every byte to zero (as per this wikichip) entry. It’s meant for handling certain memory errors.
  • invd: priviliged instruction that “Invalidates (flushes) the processor’s internal caches and issues a special-function bus cycle that directs external caches to also flush themselves. Data held in internal caches is not written back to main memory”. [1]. Unlike most previous commmands listed, this one’s probably not meant for improving peformance.
  • wbinvd & wbnoinvd: both are priviliged instructions that write back all modified cache lines to memory. wbinvd invalidates the cache lines meaning future reads/writes will have to retrieve the value from main memory. wbnoinvd does not invalidate the cache line.
  • prefetchw <address> & prefetchh <address>: a performance hint for prefetching a cache line from main memory.
  • pcommit: For ensuring writes to persistent memory have been committed. It’s been deprecated though. For more details, check out this stackoverflow answer and this Dan Luu post where he discusses both clwb and pcommit instructions.
  • L1D flushing (via writing to the IA32_FLUSH_CMD msr: this feature seems to be meant for security mitigation (see this discussion).

As you can see from my cpuid output, most of these cache control instructions and features are not universally supported yet (at least on consumer-grade systems). Their presence though shows that there’s a need to give the OS and user more control over L1, L2, L3 caching, be it for performance or for security reasons. It remains to be seen how commonplace they become. For the time being, I’ll end with this excerpt from John McCalpin’s post titled: “Intel’s future “CLDEMOTE” instruction”:

Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power. More of the same is not an answer — new thinking is required…

Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.

References

  1. Felix Cloutier’s x86 reference