[HN Gopher] Beating the L1 cache with value speculation
       ___________________________________________________________________
        
       Beating the L1 cache with value speculation
        
       Author : rostayob
       Score  : 116 points
       Date   : 2021-07-23 11:44 UTC (11 hours ago)
        
 (HTM) web link (mazzo.li)
 (TXT) w3m dump (mazzo.li)
        
       | asoylu wrote:
       | This works when list nodes are allocated subsequently without
       | other irrelevant allocations intervened. If your case doesn't fit
       | this constraint, value speculation doesn't seem to work. Any
       | comments?
        
       | drewg123 wrote:
       | How realistic is it to have a linked list in contiguous memory
       | that stays nicely ordered? In my experience, lists are normally
       | cobbled together with nodes that are allocated widely separated
       | in time and hence I have no knowledge that I can use to predict
       | the address of the next pointer.
       | 
       | If you're in a scenario where this optimization works, it seems
       | like it would be better to just use an array.
        
         | dragontamer wrote:
         | Are you familiar with Donald Knuth's "Dancing Links" ??
         | 
         | Its a covering-problem solver (NP-complete) that uses linked
         | lists highly-efficiently. All nodes are "statically allocated"
         | (at least, with respect to the algorithm), because the
         | covering-problem takes up a constant amount of space.
         | 
         | The linked-list "Dances", pointing at other nodes to represent
         | which combination of covers are attempted. So any link could
         | potentially point to any other link later in its list. However,
         | the number of nodes themselves is constant and set before the
         | algorithm runs.
         | 
         | As such, the linked-list is fully X1->X2->X3->X4... at the
         | start of the algorithm. (where X1, X2, X3 are contiguously
         | allocated in memory). If X3 is proven to "not be a good guess"
         | for our covering problem, then it is cut out of the linked list
         | (X1->X2->X4).
         | 
         | All links remain in order, and are ordered on a 2-dimensions
         | (So not only X1->X2->X3... but also X1->Y1->Z1...). Where X, Y
         | and Z are the compound-elements trying to cover locations 1, 2,
         | 3, ...
         | 
         | ------------
         | 
         | Most simple malloc implementations also have the free-list as a
         | simple linked-list that is in fact, ordered in memory. (The
         | first node in the free-list is the lowest numbered RAM spot,
         | while the final node in the free-list is in the highest-
         | numbered RAM spot).
         | 
         | The FAT32 linked-list across a hard drive is also nicely
         | ordered (and when it isn't, the user will perform
         | "defragmentation" to reorder the list and optimize the hard
         | drive).
         | 
         | -----------
         | 
         | IMO, the "nicely ordered Linked List" is uncommon, but...
         | common enough that its worth discussion. And sure, maybe we
         | don't use FAT32 filesystems today very often, but that's the
         | technology I used growing up lol. I know the methodology
         | works!!
        
           | virtue3 wrote:
           | I think you need to be aware of why such an implementation
           | (fat32) had to exist. As I believe (I also grew up with
           | fat32) that it basically vanished when HDDs with caches in
           | the MB started to show up. Journaling became a thing; actual
           | caches became a thing so the HDD could actually lay things
           | out well by utilizing said cache.
           | 
           | NTFS still needed to occasionally be defragmented but it
           | allocated more of a buffer to prevent having to do so at all.
           | 
           | I think; in general; You're very spot on about utilizing
           | continuous arrays of memory for linked lists as they will
           | almost always result in better performance characteristic
           | (unless you really need insertion/removal in order and it's
           | very write heavy).
        
         | ot wrote:
         | You can get help from the allocator. For example jemalloc has
         | an experimental API called "batch allocation" where it returns
         | a list of allocations of the same size, and it makes its best
         | effort to make them contiguous (not guaranteed though):
         | https://github.com/jemalloc/jemalloc/blob/12cd13cd418512d9e7...
         | 
         | The idea is that you may have a data structure that you want to
         | be able to modify (in which case you need to free individual
         | nodes) but in most cases you do not.
         | 
         | Then you could use batch allocation, and pair it with this
         | optimization.
        
         | Filligree wrote:
         | It's fairly standard for a compacting GC to rearrange lists
         | this way, because memory latency was a concern even back in the
         | 80s.
         | 
         | If you're using linked lists in a language without a compacting
         | GC, then you _almost certainly_ want something else instead.
         | 
         | If you're using them in a language with one, then you still
         | usually want something else, but at least the cost isn't as
         | high. Also you might be using Haskell or something, where other
         | constraints make data-structures like arrays painful to use.
        
         | adrianN wrote:
         | That depends on the allocator you use. If you allocate your
         | nodes from an arena that belong to the list, the resulting
         | layout is usually pretty good.
        
       | thereare5lights wrote:
       | Would this introduce spectre like vulnerabilities?
        
       | kardos wrote:
       | This works because we've arranged for the nodes to be sequential
       | in memory, if we have the luxury of doing that then we could use
       | an array instead of a linked list. Is there a scenario where this
       | value speculation trick would work, where we /don't/ have this
       | luxury?
        
         | twoodfin wrote:
         | Given the nature of branch prediction, I'd expect it would
         | still help dramatically in the case where the list was _almost_
         | entirely sequential, which seems a bit more plausible.
        
           | kardos wrote:
           | Agree; if a mispredict costs 15-20 cycles and a predict saves
           | 4 cycles, does that mean we break even around 1/4-1/3
           | sequential?
        
         | munchbunny wrote:
         | I could imagine behaviors like this happening with object
         | pools, though it does still feel a bit contrived.
        
       | codetrotter wrote:
       | How do they get the cycles/elem and instrs/cycle counts? Is it
       | via the perf utility? Or something they calculate? Or they get it
       | some other way?
        
         | mhh__ wrote:
         | Some tools you can do this with:
         | 
         | The perf command line utility, the perf_event_open system call,
         | the libpapi high level API, pmc.h on FreeBSD (IIRC).
         | 
         | Now, some more clever analysis: These can tell you the why
         | rather than the what, they collect the same information but can
         | present it graphically and compute TMAM statistics for example
         | (Top-down microarchitecture analysis method).
         | 
         | vTune, Intel Advior, AMD uProf, and a few others. vTune is the
         | best for x86, but only supports Intel fully.
        
         | dragontamer wrote:
         | Personally, I prefer to just use the RDTSC assembly instruction
         | (Read TimeStamp Counter), which provides the 64-bit number of
         | clocks your core has ticked.
         | 
         | Both Windows and Linux provide more robust timers (in
         | particular: if your thread takes longer than 10ms, there's a
         | chance your thread will sleep to give other threads a shot at
         | the CPU). So if you're timing something longer than 10ms, you
         | probably want to use OS timers instead.
         | 
         | -------------
         | 
         | There were a few programs where I couldn't add rdtsc easily to
         | the code (in particular: I was trying to test something so fast
         | that rdtsc took up the bulk of the time). In these cases, I
         | went into the BIOS, disabled "turbo" on my CPU, locking my
         | computer to 3.4GHz.
         | 
         | From there, I took the Windows timer and measured 1-billion
         | events, and then divided by 3.4-Billion (3.4GHz == 3.4-billion
         | clocks per second).
         | 
         | ---------
         | 
         | I don't know the specific methodology that the blogpost used.
         | But there's many easy ways to do this task.
        
           | secondcoming wrote:
           | There's a bit more to it than turning a flag off in your
           | BIOS.
           | 
           | Intel have a document about it [0]
           | 
           | [0] https://www.intel.com/content/dam/www/public/us/en/docume
           | nts...
        
           | mhh__ wrote:
           | > , which provides the 64-bit number of clocks your core has
           | ticked.
           | 
           | Not quite
        
             | dragontamer wrote:
             | I mean... computers are complex these days. I could type up
             | like 3 paragraphs that more specifically describes what is
             | going on but is that really helpful?
             | 
             | Yeah, pipelines and out-of-order exeuction makes the
             | definition a bit difficult. If you want to ensure that all
             | previous instructions are done executing, you need lfence,
             | and if you want to prevent future instructions from filling
             | in the pipelines you'll need an mfence.
             | 
             | There are many clocks (even within a core). The turbo-clock
             | is different from the standard clock. I forget exactly
             | which clock rdtsc uses, but I do know that under some
             | processors under certain conditions, you'll get weird
             | results.
             | 
             | Different processors may have different interpretations of
             | "clock" (mostly due to turbo and/or sleeping behavior).
             | Etc. etc. I don't recall the details, but these different
             | clock states could vary as much as 2.2GHz to 4GHz on my
             | processor (P1? Turbo? I forget the exact name...)
             | 
             | ---------------
             | 
             | But all in all, you get a 64-bit number that describes the
             | number of clock-ticks --- for some "definition" of clock
             | tick that differs between processors... and for some
             | definition of "now" (in the case of out-of-order execution
             | and/or pipelined execution, the "now" is a bit ambiguous,
             | as previous instructions may have not finished executing
             | yet and future instructions may already be executing).
             | 
             | If you really want to know, read the processor manual
             | specific to the microarchitecture (since different
             | microarchitectures could change these definitions)
        
               | titzer wrote:
               | > If you want to ensure that all previous instructions
               | are done executing, you need lfence, and if you want to
               | prevent future instructions from filling in the pipelines
               | you'll need an mfence.
               | 
               | LFENCE does not serialize, nor MFENCE. CPUID, however, is
               | documented to as a serializing instruction and is the
               | recommended way to serialize, particularly with RDTSC.
               | 
               | > I don't recall the details, but these different clock
               | states could vary as much as 2.2GHz to 4GHz on my
               | processor (P1? Turbo? I forget the exact name...)
               | 
               | Oh heck, it's way more than that. I've measured ~5x
               | difference in clock cycle count for short loops using
               | RDTSC. Supposedly RDTSC returns "nominal" cycles that
               | advance at the same rate relative to the wall clock, but
               | TBH that doesn't smell right. OSes also try to
               | synchronize the absolute values of the various
               | processors, so jumping between CPUs isn't that bad.
        
         | jeffbee wrote:
         | The program is reading perf events.
         | 
         | The code is at
         | https://gist.github.com/bitonic/78887f5d3238bab5e31f3c5a41d4...
        
       | solidangle wrote:
       | ``` if (node != next) { node = node->next; } ``` How does this
       | work? Shouldn't this be ``` if (node != next) { node = next; }
       | ```?
        
         | [deleted]
        
         | xfer wrote:
         | while (node) {         value += node->value;         next =
         | node->next;
         | 
         | So it is the same thing.
        
         | tinus_hn wrote:
         | That looks weird, why would you need to test if two values are
         | equal if you're going to assign them to be equal anyway?
        
           | xfer wrote:
           | That is the trick.
           | 
           | In the happy path you are not assigning(`node=next`).
           | 
           | It is taken care of by `node++`, which removes the loop
           | dependency and the processor can use the full instruction
           | level parallelism.
        
             | ectopod wrote:
             | It looks like a bug. The unhappy path contains both
             | `node++` and `node=node->next`. Note that this is in the
             | code following "Let's go back to the code we showed for
             | value speculation in C:", which is actually different from
             | the preceding code it's supposed to be a copy of. I guess
             | it's a typo.
        
               | ectopod wrote:
               | The author has quietly fixed this discrepancy. You're
               | welcome, no thank you required!
        
             | tinus_hn wrote:
             | It looks as if the difference is in delaying the check for
             | next == NULL.
        
       | andy_ppp wrote:
       | I like it as an exercise for understanding the CPU but is this
       | stuff really useful in the real world - the example given is
       | "optimised" away by GCC/clang without hacks. My guess is that if
       | this is a good way to iterate over linked lists compilers would
       | have optimised for this by now? Either through cleverness or
       | hints added to your code.
        
         | 10000truths wrote:
         | For a compiler to properly do the optimization detailed here
         | would require it to have knowledge of runtime execution/memory
         | access patterns. This might be done through profile-guided
         | optimization, or by providing hints like __builtin_expect.
         | 
         | That said, compilers are not magic boxes. Relying on your
         | compiler for optimal codegen has diminishing returns as you
         | approach the point where microarchitectural details start
         | making a difference. Most people don't get to that point, so
         | it's rarely an issue.
        
         | dragontamer wrote:
         | Compilers are 'just' graph transformations applied to code
         | which probably leads to improved performance.
         | 
         | A compiler can therefore optimize away unnecessary
         | loads/stores, or recognize implicit parallelism and transform a
         | loop into a SIMD loop.
         | 
         | But a compiler will not change the logic of your code or add
         | extra steps.
        
           | matheusmoreira wrote:
           | > But a compiler will not change the logic of your code or
           | add extra steps.
           | 
           | Compilers can and do delete checks for null pointers, integer
           | overflow... The stuff they do when they're faced with simple
           | aliasing is honestly pretty crazy.
        
           | andy_ppp wrote:
           | I mean this is quite easy to disprove - what do you think
           | -funroll-loops is doing under the hood if not adding extra
           | steps?
        
             | dragontamer wrote:
             | A "loop" is just a cycle in a graph.
             | 
             | A -> B -> C -> A, hey look, its a cycle. Eventually A->D
             | (where D exits the loop). So really node A has an edge
             | going to B and D.
             | 
             | Knowing that A has an edge to B and D, an equivalent
             | transformation is A1->B1->C1->A2->B2->C2->A1, but also
             | A1->D and A2->D.
             | 
             | Both A1 and A2 have to still point at D. But A1 points to
             | B1, and A2 points to B2. Etc. etc. That's all your compiler
             | is doing, its reasoning about the graph and transforming it
             | with actually... very simple rules. (Unfortunately, those
             | rules are NP-complete so heuristics are used to make the
             | compile times reasonable. NP-complete seems to keep coming
             | about transformations associated with "simple" rules...)
             | 
             | Then, we might see that some logic could make A2 not
             | necessarily point to D (cutting out one possible jump
             | statement). And that's where we get our optimization from:
             | a simple reasoning about the graph which removes one
             | instruction per 2-loops. Unroll to 8 and we get to
             | 7-instructions (cmp/jump instructions) saved over every 8
             | loops.
             | 
             | An "optimizing" compiler just reasons about this graph and
             | all possible equivalent transformations (or at least, uses
             | a heuristic to reason over a portion of equivalent
             | transformations), and chooses such a graph transformation
             | that minimizes various cost estimates.
             | 
             | -------
             | 
             | That's the thing. Compilers can only transform the code-
             | graph in ways that are "equivalent". Now yes, things get a
             | bit weird with undefined behavior (and pointer-aliasing:
             | assuming a lack of aliasing is needed for C/C++ compilers
             | to make even the simplest of transformations. So that's a
             | bit of a corner case). But that's the gist of what they're
             | doing.
        
         | sly010 wrote:
         | This is not a 'static' optimization. It relies on knowledge
         | that the compiler cannot decide by looking at the code (whether
         | the linked list is _mostly_ laid out continuously).
         | 
         | It might still be a very useful optimization in a JIT runtime.
         | If your loop variable keeps increasing the same amount after
         | every lookup, maybe rewrite the loop to add speculation...
        
       | bob1029 wrote:
       | A quick googling brought me to the following paper:
       | 
       | "Data value speculation in superscalar processors"
       | 
       | https://www.sciencedirect.com/science/article/abs/pii/S01419...
       | 
       | It would appear there is an entire section dedicated to "Value
       | prediction in a realistic scenarios", but I am not curious enough
       | yet to pay money to access this document.
       | 
       | Edit: Found another that is accessible.
       | 
       | "Practical Data Value Speculation for Future High-end Processors"
       | 
       | http://class.ece.iastate.edu/tyagi/cpre581/papers/HPCA14Data...
        
       | swiley wrote:
       | This seems like it would cause a lot of those "I just wish the
       | damn machine did what I told it to: situations.
        
         | st_goliath wrote:
         | Not necessarily. After putting a lot of brain grease into
         | writing the code, it will probably work just fine.
         | 
         | But it will lead to a bunch of serious "WTF!?" situations when
         | other people look at the code and/or are supposed to make
         | changes at a later point.
         | 
         | And possibly stuff breaking in spectacular ways after
         | introducing "minor cleanup" patches. Or some weird performance
         | regressions, after touching completely unrelated code that
         | might affect the temporal order of the memory allocations in
         | questions, and so on...
         | 
         | Besides improving performance, doing this kind of trickery in
         | production code will probably also improve your job security.
        
       | pranith wrote:
       | Very well explained.. its a really interesting idea.
        
       | mhh__ wrote:
       | I was under the impression that high end CPUs already perform
       | value prediction while speculating?
        
         | gpderetta wrote:
         | There are some restricted form of value prediction in the wild:
         | Indirect branch prediction is fairly common; so is alias
         | prediction that will speculate whether two data addresses (at
         | least one if which is a store) do not alias; AMD memory
         | renaming also makes assumptions around stack addresses; outside
         | of these special cases I'm not aware on any cpu that will
         | actually generally speculate actual values or address.
         | 
         | On relaxed memory model machines naive value prediction would
         | break dependency ordering (i.e. memory order consume), so
         | things get very complicated.
        
           | titzer wrote:
           | What you posted is right, but I wouldn't classify indirect
           | branch prediction as value prediction, because it doesn't
           | supply a value to any other part of the processor (e.g.
           | substitute a predicted value into a register). It just steers
           | the frontend to where instructions should be fetched from, so
           | it is just branch prediction.
        
       | xfer wrote:
       | So the solution is to write assembly for an extremely common
       | micro-architecture design. Maybe the compilers should provide
       | more tools to exploit such parallelism. That asm block looks like
       | a weird hack.
        
       ___________________________________________________________________
       (page generated 2021-07-23 23:00 UTC)