[HN Gopher] Parsing Protobuf at 2+GB/S: How I Learned to Love Ta...
       ___________________________________________________________________
        
       Parsing Protobuf at 2+GB/S: How I Learned to Love Tail Calls in C
        
       Author : signa11
       Score  : 369 points
       Date   : 2021-04-25 10:04 UTC (12 hours ago)
        
 (HTM) web link (blog.reverberate.org)
 (TXT) w3m dump (blog.reverberate.org)
        
       | z77dj3kl wrote:
       | Very cool! OT but is there a list of protobuf benchmarks
       | somewhere across languages/implementations? How does it compare
       | to JSON or just dumping raw bytes?
        
       | slver wrote:
       | You'd get equivalent performance with "goto", but not depend on
       | implicit compiler optimizations to save you from stack overflow.
       | 
       | I like the idea of tail calls, but I don't like it being an
       | implicit "maybe" optimization of the compiler. Make it part of
       | the standard, make the syntax explicit, and then both hands in.
        
         | haberman wrote:
         | You really wouldn't. We tried many approaches, including
         | computed goto, but none of them were able to generate
         | satisfactory code when compared with tail calls:
         | https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
         | 
         | Also, it is not an implicit optimization when "musttail"
         | support is available in the compiler.
        
           | slver wrote:
           | Tailcall optimization is compiled as a JMP, so is goto, so
           | I'm curious where would the difference come from?
        
       | PostThisTooFast wrote:
       | One thing that appealed to us about protobufs is their alleged
       | change-tolerance (versioning).
       | 
       | Years later, this remains only alleged... not realized.
        
       | a-dub wrote:
       | even if you set aside the claimed gains over handwritten
       | imperative loops, being able to tell the compiler to fail unless
       | tail call conversion succeeds will be huge in terms of preventing
       | performance or stack depth bugs from surfacing when someone
       | accidentally makes a change that makes a tail call no longer
       | convertible.
       | 
       | if you ask me, it should be part of the language. the whole "just
       | make sure you shouldn't need a stack and the compiler will
       | probably take care of it" approach has bothered me for years.
        
         | alfiedotwtf wrote:
         | > the whole "just make sure you shouldn't need a stack and the
         | compiler will probably take care of it" approach has bothered
         | me for years.
         | 
         | Good point. I love Rust's safety guarantees, but you're right -
         | it doesn't take into account stack like everyone else
        
       | coolreader18 wrote:
       | It's funny, wasm3 was just on the front page with a tail-call
       | based interpreter model and now this is. Now I wanna do some
       | stuff with tail calls :)
        
       | khiner wrote:
       | This was a great read!
        
       | tux1968 wrote:
       | This is a really great result, but i'm curious why this isn't a
       | very common and standard compiler optimization, at least as an
       | option you can enable? It seems like the conditions where it can
       | be applied are pretty easy to identify for a compiler.
        
         | haberman wrote:
         | Tail calls are a very common optimization, both Clang and GCC
         | have been performing this optimization successfully for a
         | while. What is new is getting a _guarantee_ that applies to all
         | build modes, including non-optimized builds.
        
           | tux1968 wrote:
           | If you're interested in this optimization for performance
           | reasons, why would you want an otherwise non-optimized build?
           | It seems that the only important case is the optimized
           | build... where for some reason you're not getting this
           | optimization without explicitly asking for it.
           | 
           | So the question remains... why is the compiler optimization
           | missing this chance to optimize this tail call without it
           | being explicitly marked for optimization?
        
             | wffurr wrote:
             | If you write this style of code and don't get the
             | optimization, then your test code and debug sessions are
             | dog slow. Much much slower than the old giant switch style
             | parser.
             | 
             | Similarly, if you make a mistake that would cause the
             | optimization not to be applied, you'd rather get a compiler
             | error than suddenly have a 10X performance regression.
        
             | haberman wrote:
             | If the optimization is not performed, you blow the stack
             | because your stack usage is now O(n) instead of O(1).
             | 
             | That's why it's important to get the optimization in debug
             | mode.
        
               | vitus wrote:
               | What's the argument for building in debug mode vs
               | specifying -g to the compiler to build with debug
               | symbols?
               | 
               | I've previously encountered bugs that stubbornly refused
               | to reproduce in a non-optimized build.
               | 
               | One thing that comes to mind is `#ifdef DEBUG`-style
               | macros, although I'm curious if there's anything else I'm
               | missing. Omitted stack frames, e.g. from inlining,
               | perhaps?
        
             | joppy wrote:
             | One reason that this is not done by default is that it can
             | make debugging a surprise: since each function does not
             | leave a stack frame, stack traces are much less useful.
             | This is not so bad in the case of a tail-recursive function
             | which calls itself, but if proper tail calls are done
             | between multiple functions, you could have A call B tail
             | call C tail call D, and if D crashes the stack trace is
             | just (D called from A).
             | 
             | I'm sure there are some good ways around this problem, and
             | I would love to have proper tail calls available in most
             | languages.
        
       | tyingq wrote:
       | _" Applying this technique to protobuf parsing has yielded
       | amazing results: we have managed to demonstrate protobuf parsing
       | at over 2GB/s, more than double the previous state of the art."_
       | 
       | I was somewhat surprised to see that was state of the art for
       | protobufs. Simdjson boasts faster throughput, without the benefit
       | of the length encoded headers that are in protobufs. I looked for
       | examples of protobufs using SIMD, but could only find examples of
       | speeding up varint encode/decode.
        
         | haberman wrote:
         | Article author here. I dumped my test protobuf payload (7506
         | bytes) to JSON and got 19283 bytes.
         | 
         | So parsing this particular payload at 2GB/s would be equivalent
         | to parsing the JSON version at 5.1GB/s.
         | 
         | SIMD doesn't end up helping protobuf parsing too much. Due to
         | the varint-heavy nature of the format, it's hard to do very
         | much SIMD parallelism. Instead our approach focuses on going
         | for instruction-level parallelism, by trying to remove
         | instruction dependencies as much as possible. With this design
         | we get ~3.5 instructions per cycle in microbenchmarks (which
         | represents a best case scenario).
        
         | jeffbee wrote:
         | Making a flagrantly wasteful data format and then using its
         | bloated extent as the numerator in your benchmark will not
         | exactly be a fair comparison. If a protobuf has a packed,
         | repeated field that looks like \x0a\x02\x7f\x7f and json has
         | instead { "myFieldNameIsBob": [ 127, 127 ] } the JSON
         | interpreter has to be 20x faster just to stay even.
        
           | tyingq wrote:
           | That's true, would be interesting to see an "encoded entities
           | per second" comparison. Or maybe a comparison with mostly
           | stringy data where the size is probably comparable.
        
             | haberman wrote:
             | Article author here. I agree that would be a very
             | englightening benchmark. Protobuf can dump to JSON, so it
             | shouldn't be too much work to dump my benchmark data to
             | JSON and benchmark the parsing with simdjson. Maybe I'll
             | see if I can get this done while this article is still on
             | the front page. :)
        
               | tyingq wrote:
               | Ah, wow, that's great. Apples-to-apples since you can
               | dump the same data to JSON. And shows why HN remains a
               | unique place. Wonder out loud, and maybe get an answer
               | within an update to the article you just read :)
        
               | haberman wrote:
               | So I was not able to get simdjson going in time, but I
               | did dump my test payload to JSON. In protobuf the payload
               | was 7506 bytes, in JSON 19283 bytes.
               | 
               | So parsing this particular payload at 2GB/s would be
               | equivalent to parsing it at 5.1GB/s in JSON.
        
         | kelnos wrote:
         | I don't know if this is a fair comparison, as 2GB/s of protobuf
         | will parse a lot more _information_ than 2GB /s of JSON will,
         | since protobuf is a much more space-efficient way to encode
         | your data.
        
           | tyingq wrote:
           | See https://news.ycombinator.com/item?id=26934063 , we may
           | get a fairly sound comparison. Though I imagine the
           | comparison varies a lot depending on the data. As another
           | comment mentioned, a message with a lot of large strings
           | would lean heavily towards protobufs. And in that case, the
           | size wouldn't be much different.
        
             | beached_whale wrote:
             | I would just measure the size of the resulting data
             | structures. In a language like C/C++ it could be as simple
             | as memory usage before your parsing function and after
        
         | secondcoming wrote:
         | What benefit would length encoded headers provide other than to
         | reduce payload size? With JSON you just have to scan for
         | whitespace, whereas with protobuf you actually have to decode
         | the length field.
        
           | tyingq wrote:
           | Not having to look at every byte to split out the contents.
        
             | jeffbee wrote:
             | Right, a protobuf message that consists of a single 100MB
             | string will be "decoded" dramatically faster than 2GB/s,
             | because you only need visit the tag and the length and
             | store the address and length, aliasing the input.
             | 
             | It's really quite impossible to discuss data encoding
             | performance outside of a given schema.
        
         | jsnell wrote:
         | Isn't there also a significant difference in what the input is
         | being parsed to? My expectation is for a protobuf library to
         | parse messages to structs, with names resolved and giving
         | constant time field access. Simdjson parses a json object to an
         | iterator, with field access being linear time and requiring
         | string comparisons rather than just indexing to a known memory
         | offset.
         | 
         | I.e. it seems like simdjson trades off performance at access
         | time for making the parsing faster. Whether that tradeoff is
         | good depends on the access pattern.
        
           | touisteur wrote:
           | But the same could be true for protobuf. Decode fields only
           | when you need them, and 'parse' just to find the field
           | boundaries and cardinality. Did stuff like that for internal
           | protobuf-like tool and with precomputed message profiles you
           | can get amazing perf. Just get the last or first bit of most
           | bytes (vgather if not on AMD) and you can do some magic.
        
         | beached_whale wrote:
         | You cannot compare them. Decoding JSON is compressing it
         | essentially, and in order to compare you would need to look at
         | the resulting data structures and how long it takes. strings
         | are comparable, but an integer is bigger exp2( log10(
         | digit_count ) ) I think.
         | 
         | But yeah, I had the same first thought too.
        
       | rapidlua wrote:
       | I've opened an issue in LLVM bugzilla concerning jump not being
       | folded with address computation on x86 with a proposed fix. Would
       | love if it gets some attention.
       | https://bugs.llvm.org/show_bug.cgi?id=50042
       | 
       | Also been working on a C language extension to enable guaranteed
       | tail calls along with explicit control over registers used for
       | argument passing. Provided that callee-save registers are used
       | for arguments, calling fallback functions incurs no overhead.
       | 
       | https://github.com/rapidlua/barebone-c
        
         | haberman wrote:
         | Barebone C looks really interesting! I wonder if the same gains
         | could be achieved in a more architecture-independent way by
         | simply introducing a new calling convention where parameters
         | are passed in the registers that are normally callee-save.
         | 
         | This would let the fast path functions use all registers
         | without needing to preserve any of them. It would also let them
         | call fallback functions that use a normal calling convention
         | without needing to spill the parameters to the stack.
        
           | rapidlua wrote:
           | Thank you for the feedback! A new calling convention could
           | probably nail it for many use cases. Sometimes you want
           | something very special though. E.g. LuaJIT pre-decodes
           | instruction prior to dispatch to squeeze some work into
           | branch misprediction delay. There are limited callee save
           | registers available hence it is better to use volatile
           | registers for instruction arguments. It should be possible to
           | reuse the implementation on different architectures and only
           | tweak the register assignment.
        
       | neoteric wrote:
       | One thing I've been thinking about in C++ land is that just how
       | much the idiomatic usage of RAII actually prevents the compiler
       | from doing it's own tail call optimization. Any object
       | instantiated in automatic storage with a non-trivial destructor
       | is basically guaranteeing the compiler _can't_ emit a tailcall.
       | It's rather unfortunate, but perhaps worth it if the traideoff is
       | well understood. Example: https://godbolt.org/z/9WcYnc8YT
        
         | ufo wrote:
         | You can still use this trick in C++ if you ensure that the
         | return statements are outside the scopes that have the RAII
         | objects. It's awkward but it's better than nothing.
         | https://godbolt.org/z/cnbjaK85T                   void foo(int
         | x) {           {             MyObj obj;              // ...
         | }           return bar(x); // tail call         }
        
         | gumby wrote:
         | I am unable to understand this comment. You're saying that you
         | can't generate a tail call by returning from the middle of a
         | function which needs to clean things up. RAII* is merely
         | syntactic sugar to write this control flow and make it
         | mandatory.
         | 
         | Perhaps it's easier to think of tail-call semantics as simply
         | implementing iteration: it's another way of expressing a
         | for(...) loop. And if you used RAII in the body of your for
         | loop you would expect the same thing.
        
           | neoteric wrote:
           | Absolutely, RAII is an abstraction (and a useful one), but it
           | has a cost in that it prevents a form of useful optimization
           | because cleanup is required at the destruction of the stack
           | frame. You'd expect the same in C if you explicitly had to
           | call a cleanup function on return from a call.
           | 
           | What C++ does with RAII is make this tradeoff not obvious.
           | std::unique_ptr is a great example to show this: colloquially
           | a std::unique_ptr is "just a pointer", but it isn't in this
           | case because it's non-trivial destructor prevents TCO.
        
           | hvdijk wrote:
           | I can understand the comment: even if the cleanup can be done
           | before the call, allowing the call to be done as a tail call,
           | the compiler will not move up the cleanup code for you, and
           | having the cleanup code last is the default state. With local
           | variables defined inside for loops, they are always destroyed
           | before the next iteration.
        
         | skybrian wrote:
         | These tail-call functions are part of a program's inner loop.
         | It seems like we shouldn't expect allocation within an inner
         | loop to be fast? Other than local variables, that is.
         | 
         | In an interpreter, it seems like either you're allocating
         | outside the interpreter loop (the lifetime is that of the
         | interpreter) or it's within a particular (slow) instruction, or
         | it's part of the language being interpreted and the lifetime
         | can't be handled by RAII. There will be fast instructions that
         | don't allocate and slower ones that do.
         | 
         | Interpreters are a bit weird in that there are lots of
         | instructions that do almost nothing so the overhead of the loop
         | is significant, combined with having lots of complication
         | inside the loop and little ability to predict what instruction
         | comes next. This is unlike many loops that can be unrolled.
        
         | sesuximo wrote:
         | I suppose the compiler could reorder function calls if it can
         | prove there is no change in behavior? If so, then it could
         | hoist dtors above the call and emit a jump. I doubt any
         | compilers do this.
        
       | _3u10 wrote:
       | Aren't protobufs supposed to be faster than JSON?
       | 
       | I mean congrats but it doesn't seem that impressive given JSON
       | has been parsing at that speed for years.
        
       | alfiedotwtf wrote:
       | I've never thought of tail calls in depth before, so quick
       | question about them - the point here I'm getting is to minimise
       | the pushing/popping of registers to the stack from `call` By
       | using a `jmp` instead. That's all good... but what stack frame
       | does the jumped procedure use? I'm assuming a new stack frame is
       | still created but just without caller's pushed registers?
        
         | dreamcompiler wrote:
         | There's always some stack frame sitting there; typically the
         | one that called the main loop routine in the first place. At
         | some point a RET instruction will be executed, and control will
         | return to that stack frame's return point. The idea of tail
         | calls is to avoid creating a new stack frame when it doesn't
         | add any value, but stack frames still exist--they have to exist
         | when the last thing that happens in a routine is ordinary
         | instruction processing rather than a CALL.
        
       | pyler wrote:
       | about jne issue, LLVM does it intentionally.
       | 
       | Check line 1521
       | 
       | https://llvm.org/doxygen/BranchFolding_8cpp_source.html
        
         | ndesaulniers wrote:
         | Nice find! I wonder what they mean by changing branch
         | direction?
        
         | haberman wrote:
         | Wow thanks for the pointer! That seems unfortunate, I wonder if
         | there is a way to evaluate whether the extra jump is actually
         | worth it, and whether this optimization could be allowed.
        
       | neilv wrote:
       | Scheme has had tail calls as a basic idiom (some call it Proper
       | Implementation of Tail Calls; others call it Tail Call
       | Optimization) forever, and I keep them in mind any time I'm
       | implementing anything nontrivial in Scheme.
       | 
       | There's no special syntax -- there's just the notion of _tail
       | positions_ , from which tail calls can occur. For example, both
       | arms of an `if` form are tail position, if the `if` itself is.
       | And if you introduce a sequencing block in one of those arms, the
       | last position in the block is tail position. (A specialized IDE
       | like DrRacket can indicate tail positions, and DrRacket will even
       | hover arrows, tracing the tail positions back to the top of the
       | code.)
       | 
       | When implementing a state machine, for example, a state
       | transition is simply a function application (call) to the new
       | state (optionally with arguments), where the the called function
       | represents a state. It's satisfying when you realize how elegant
       | something that used to be messy is, and you can imagine it being
       | very fast (even if your Scheme implementation isn't quite up to
       | getting as fast as a C or Rust compiler that can recognize and
       | special-case tail calls.)
       | 
       | (For the state machine example, compare to more conventional
       | approaches in C ,of "state ID" variables, `switch` statements on
       | those, and record-keeping for additional state. Or doing it in
       | data, with state ID being index into arrays, again with any
       | additional recordkeeping. Or lots of function calls when you
       | didn't really need the time overhead and stack usage of function
       | calls with full returns.)
        
         | CodeArtisan wrote:
         | Any functional programming language shall have TCO since
         | statements are forbidden (everything is a function).
        
       | [deleted]
        
       | bla3 wrote:
       | It does make code harder to debug when something goes wrong.
       | Since tail calls create no stack frames, a call sequence f->g->h
       | where g tail calls h will show up as a stack of (f, h) in the
       | debugger. The fix is easy, just make MUSTTAIL expand to nothing
       | in debug builds. But it's something to keep in mind, and it means
       | your debug mode code will have different memory use
       | characteristics.
        
         | lupire wrote:
         | Programmer's have traditionally relied on the stack as an
         | (expensive yet incomplete) execution logging system, but the
         | east fix for that is to use an actually log structure instead
         | of the abusing the stack.
        
         | jhgb wrote:
         | Seems like not much of a fix if your code depends on it. In
         | Scheme such behavior would break standard compliance (and lots
         | of programs). And since presumably you'd use this attribute
         | _precisely_ when your code depended on it, disabling it may not
         | be feasible.
         | 
         | Fortunately the patterns generated by such code are no more
         | difficult to debug than simple loops in other languages, so the
         | lack of "history" of your computation seems no more concerning
         | with tail calls than it is with loops (or at least that's how I
         | always looked at it). And people don't seem to be avoiding
         | loops in programming merely for the reason of loops not keeping
         | track of their history for debugging purposes.
        
         | cryptonector wrote:
         | A while loop with a switch also creates no stack frames for
         | each operation interpreted.
        
         | ndesaulniers wrote:
         | If using DWARF for unwinding there is a tag IIRC that
         | represents that inlining has occurred.
        
         | haberman wrote:
         | That is true, but there is always https://rr-project.org for
         | easy reverse debugging if you're having trouble figuring out
         | where you came from.
         | 
         | If the alternative is to drop to assembly, a C-based approach
         | seems quite easy to debug. You can just add printf()
         | statements! Previously when I had been using assembly language
         | or a JIT, I had to resort to techniques like this:
         | https://blog.reverberate.org/2013/06/printf-debugging-in-ass...
        
           | jhgb wrote:
           | > but there is always https://rr-project.org for easy reverse
           | debugging if you're having trouble figuring out where you
           | came from
           | 
           | How did I not know about this? Thanks!
        
             | jpgvm wrote:
             | rr is really a hidden treasure. C/C++ wasn't the same for
             | me after discovering it.
             | 
             | It's sort of like the first time you learn about
             | interactive debugging. Everything you did before that
             | suddenly feels like "caveman coding" and you never want to
             | go back.
        
           | elcritch wrote:
           | Perhaps a "ring logger" approach could be useful. Append the
           | more useful bits of what would normally go into the stack
           | frame but without the allocation overhead. Seems like a few
           | memcpy's and modulo operations wouldn't hurt the code gen too
           | much.
        
           | kevincox wrote:
           | Of course RR is too expensive to be deployed in production
           | builds. So if you are getting core dumps from "production"
           | you won't have this information. So while RR helps it doesn't
           | completely mitigate the tradeoff.
        
             | [deleted]
        
             | touisteur wrote:
             | You may be interested in Intel PT snapshots in Linux core
             | dumps, that gdb knows how to interpret and will give you a
             | detailed branch trace. Less cool than rr but still very
             | interesting!
        
         | ufo wrote:
         | Only if the alternative is non-tail function calls. Often the
         | alternative to tail calls is a while loop, which also doesn't
         | leave a stack trace.
        
           | leoc wrote:
           | As so often, the moral is that pushing your problems in-band
           | doesn't make them go away.
        
         | OskarS wrote:
         | > The fix is easy, just make MUSTTAIL expand to nothing in
         | debug builds.
         | 
         | That wouldn't work. The final call is a recursive call to
         | dispatch, if that is not a tail call it will blow the stack
         | instantly.
        
         | pjmlp wrote:
         | Lisp and Scheme graphical debuggers do just fine tracking them.
        
       | [deleted]
        
       | CyberRabbi wrote:
       | Interesting but it seems like the essential function of the tail
       | call in this example is to create a strong hint to the compiler
       | which values should stay in registers. Isn't there a better way
       | to do that than relying on a seemingly unrelated optimization?
        
       | AndyKelley wrote:
       | Here's a related Zig proposal:
       | https://github.com/ziglang/zig/issues/8220
       | 
       | Relevant section pasted here:
       | 
       | > Other Possible Solution: Tail Calls
       | 
       | > Tail calls solve this problem. Each switch prong would return
       | foo() (tail call) and foo() at the end of its business would
       | inline call a function which would do the switch and then tail
       | call the next prong.
       | 
       | > This is reasonable in the sense that it is doable right now;
       | however there are some problems:                  * As far as I
       | understand, tail calls don't work on some architectures.
       | * (what are these? does anybody know?)             * I'm also
       | concerned about trying to debug when doing dispatch with tail
       | calls.             * It forces you to organize your logic into
       | functions. That's another jump that          maybe you did not
       | want in your hot path.
       | 
       | See also https://dino-lang.github.io/learn/dino.pdf, section 3.1
       | "Byte Code Dispatch".
        
         | haberman wrote:
         | It sounds like the main question in the proposal is: how can we
         | reliably get the compiler to generate threaded/replicated
         | dispatch?
         | 
         | While that is important, it is only one small part of what we
         | were trying to achieve with tail calls. Ultimately we found
         | that our tail call design significantly improved the register
         | allocation and overall code generation compared with computed
         | goto, see:
         | https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
        
           | seg_lol wrote:
           | I really want a timeline where we have tail calls everywhere
           | and this drama can go away. The non-tail call folks feel like
           | folks arguing for not using spaces in text.
           | 
           | https://en.wikipedia.org/wiki/Scriptio_continua
        
             | slaymaker1907 wrote:
             | I do think tail call optimization can make debugging/stack
             | traces confusing if you do general TCO. In the plain single
             | function case, TCO probably makes it easier to debug, but
             | it can get confusing if function A calls B and then B calls
             | C in tail position so the stack trace is just C|A. It's
             | probably not too bad if it is just one function getting
             | elided, but this get very difficult if multiple important
             | calls are being removed from the stack.
             | 
             | This could also be a case of insufficient tooling for
             | languages with TCO. Better support for reverse debugging
             | would make detailed stacks less necessary. Most of the
             | stack trace isn't that useful anyway when printing out
             | exceptions compared to good application logging.
        
               | spockz wrote:
               | Don't we have debug symbols or something like it to be
               | able to translate the optimised version into the thing
               | you saw in code? If you need to dive into the assembly
               | you are in for a hell of a ride anyway due to all kinds
               | of optimisations happening.
        
               | seg_lol wrote:
               | And tail recursive functions encode their state in the
               | args and the body. There isn't any information in the
               | stack anyway, by definition.
               | 
               | If one does need to debug mutually tail recursive
               | functions, step through them, this isn't a problem. The
               | _point_ is that they don 't consume stack space. If you
               | _do_ want a stack of invocations, then don 't write tail
               | recursive code?
               | 
               | This style of argumentation that the GP showed should
               | have a name.
               | 
               | BTW my original comment is at -1, hate but only in the
               | tail position.
        
             | haberman wrote:
             | I have to admit I had little interest in tail calls myself
             | until I saw what they could achieve performance-wise. Now
             | that I have seen that, I am sold. :)
        
               | seg_lol wrote:
               | Performance is only one small aspect, the things you can
               | express in tail recursive functions are beautiful and
               | compact. If feels like being in an inductive proof
               | playground.
        
       | SavantIdiot wrote:
       | Just here to say that the use of those #defines for huge
       | parameter lists makes sad. I realize that's a common pattern, but
       | if your call list is that big, how about a struct pointer?
        
         | haberman wrote:
         | That would defeat the entire purpose: they need to be six
         | parameters to ensure that they are all in registers.
         | 
         | However I was considering whether they could be a struct with
         | six members passed by value. If the ABI would pass that in six
         | registers we could get rid of the #defines, which I agree would
         | be nicer.
        
           | ufo wrote:
           | Unfortunately, I think that only structs with at most 2 words
           | are passed via registers. Larger structs are passed via the
           | stack.
           | 
           | https://godbolt.org/z/frof3xjhW
        
             | haberman wrote:
             | Bummer, thanks for the info.
        
       | londons_explore wrote:
       | This article suggests a big reason to use this approach is to
       | seperate hot and cold code.
       | 
       | I assume that's for good use of the CPU's instruction and
       | microcode caches.
       | 
       | Yet in a protobuf parser, I'm surprised there is enough code to
       | fill said caches, even if you put the hot and cold code together.
       | Protobuf just isn't that complicated!
       | 
       | Am I wrong?
        
         | tooltower wrote:
         | In particular, the author is talking about CPU registers being
         | spilled to memory, and the need for setting up or tearing down
         | stack frames. Those things can only be eliminated by the
         | compiler for extremely simple functions. The error-handling
         | often isn't.
        
         | haberman wrote:
         | > I assume that's for good use of the CPU's instruction and
         | microcode caches.
         | 
         | I don't think that is the reason. These are microbenchmark
         | results, where realistically all the code will be hot in caches
         | anyway.
         | 
         | The problem is that a compiler optimizes an entire function as
         | a whole. If you have slow paths in the same function as fast
         | paths, it can cause the fast paths to get worse code, even if
         | the slow paths are never executed!
         | 
         | You might hope that using __builtin_expect(), aka
         | LIKELY()/UNLIKELY() macros on the if statements would help.
         | They do help somewhat, but not as much as just putting the slow
         | paths in separate functions entirely.
        
           | touisteur wrote:
           | Similar if using the 'hot' attribute: separate functions or
           | separate labels... https://stackoverflow.com/a/15034687
        
       | zozbot234 wrote:
       | Is this really higher performance than the existing protobuf
       | support in Rust+serde? That uses macro programming to generate
       | code at compile time based on a high-level description of your
       | data format, so it can be quite fast and will certainly be a lot
       | safer than cowboy-coding raw C.
        
         | linkdd wrote:
         | Every single time there is a link about C/C++, I do a quick
         | "Ctrl+F Rust", and every single time there is some rant about
         | "but Rust is safer than C".
         | 
         | Every. Single. Time.
         | 
         | This article is about a new Clang extension:
         | 
         | > An exciting feature just landed in the main branch of the
         | Clang compiler. Using the [[clang::musttail]] or
         | __attribute__((musttail)) statement attributes, you can now get
         | guaranteed tail calls in C, C++, and Objective-C.
         | 
         | (literally the first line of the article)
         | 
         | What does Rust have to do with this? Nothing.
         | 
         | Does the author suggest that you should only use C for high
         | performance? No.
        
           | littlestymaar wrote:
           | > Every single time there is a link about C/C++, I do a quick
           | "Ctrl+F Rust", and every single time there is some rant about
           | "but Rust is safer than C".
           | 
           | If you had the same reflex with other languages when reading
           | other HN comment thread you'll realize that everytime there's
           | a thread about _language A_ there 's a ton of comments about
           | _language B or C_. Rust threads are full of people talking
           | about C++ or C, why is it supposed to be more OK? Go threads
           | are full of comment about Java, Python threads are full of Go
           | or Julia (depending on what kind of Python this is about).
           | This isn 't specific to Rust in any ways.
           | 
           | Yes, the GP isn't the most useful comment ever, Rust is hype
           | and there are overly enthusiastic people, but there are also
           | people who seem to be overly defensive about it (doing a text
           | search to count Rust occurrences, every time, really?) and
           | I'm not sure the latter is less childish than the former.
        
             | linkdd wrote:
             | > If you had the same reflex with other languages
             | 
             | I do, always, just out of curiosity because seeing peoples
             | view on different languages is always insightful.
             | 
             | > This isn't specific to Rust in any ways.
             | 
             | From what I've seen, Rust comments are the most
             | condescending.
             | 
             | > I'm not sure the latter is less childish than the former.
             | 
             | It is childish to look for specific information (in this
             | specific case: how C and Rust are compared by others) in
             | threads that grow to hundreds of messages?
             | 
             | FYI, I've also looked for C++ and Go comments in this
             | thread, and none of them were a condescending comment like
             | "C is bad, you should use Rust, C is evil and should not be
             | sent to production".
        
           | zozbot234 wrote:
           | > This article is about a new Clang extension:
           | 
           | > An exciting feature just landed in the main branch of the
           | Clang compiler. Using the [[clang::musttail]] or
           | __attribute__((musttail)) statement attributes, you can now
           | get guaranteed tail calls in C, C++, and Objective-C.
           | 
           | > What does Rust have to do with this? Nothing.
           | 
           | Rust has a planned keyword for this exact feature, namely
           | 'become'. So the author would be able to use it in Rust just
           | as well, as soon as support for it lands in an upstream LLVM
           | release.
           | 
           | Regardless, writing raw C as the article suggests in order to
           | parse a quasi-standard high-level format is cowboy coding.
           | It's a nice hack to be sure, but it's not the kind of code
           | that should be anywhere close to a real production workload.
           | Instead, this feature should be implemented as part of some
           | safe parser generator. Not necessarily written in Rust but
           | something that's at least as safe.
        
             | Cloudef wrote:
             | I think you might need a reality check and see how many
             | systems run C code in production.
        
               | linkdd wrote:
               | No cowboy would treat their servers like cattle.
               | 
               | Joke aside, the Linux kernel is mainly developed in C and
               | runs in production everywhere.
        
             | linkdd wrote:
             | > Rust has a planned keyword for this exact feature, namely
             | 'become'. So the author would be able to use it in Rust
             | just as well, as soon as support for it lands in an
             | upstream LLVM release.
             | 
             | The author is demonstrating a new Clang feature. Then,
             | again, what is the point of mentioning Rust?
             | 
             | > writing raw C as the article suggests in order to parse a
             | quasi-standard high-level format is cowboy coding.
             | 
             | So what? Cowboy coding lets you appreciate even more
             | languages where you don't need to do it, AND gives you a
             | better understanding of how they might work.
             | 
             | > it's not the kind of code that should be anywhere close
             | to a real production workload.
             | 
             | Again, not a single time in the article the author suggests
             | that.
        
       | ma2rten wrote:
       | Protobufs are very important for Google. A significant percentage
       | of all compute cycles is used on parsing protobufs. I am
       | surprised that the parsing is not doing using handwritten
       | assembly if it's possible to improve performance so much.
        
         | mkoubaa wrote:
         | Given that fact I'm wondering if google ever researched custom
         | chips or instruction sets for marshalling pbs, like the TPUs
         | they worked on for ML.
        
           | lupire wrote:
           | Problem is once you parse the protobuf, you have to
           | immediately do other computations on it in the same process.
           | No one needs to parse protobufs all day long like an ML model
           | or doing hashes for crypto.
        
             | jeffbee wrote:
             | That doesn't seem to preclude hardware assistance. For
             | example they have also explored hardware acceleration for
             | the tcmalloc fast path allocation by adding circuits to
             | general purpose CPUs. Arguably, Intel BMI descends from a
             | request that Google made years ago to speed up protobuf
             | decoding (and other workloads) on x86.
        
         | xxpor wrote:
         | Handwritten ASM for perf is almost never worth it in modern
         | times. C compiled with GCC/Clang will almost always be just as
         | fast or faster. You might use some inline ASM to use a specific
         | instruction if the compiler doesn't support generating it yet
         | (like AVX512 or AES), but even for that there's probably an
         | intrinsic available. You can still inspect the output to make
         | sure it's not doing anything stupid.
         | 
         | Plus it's C so it's infinitely more maintainable and way more
         | portable.
        
         | pjmlp wrote:
         | Yet Microsoft was able to make a .NET implementation faster
         | than the current Google's C++ one.
         | 
         | A proof that they don't care enough about protobuf parsing
         | performance.
         | 
         | https://devblogs.microsoft.com/aspnet/grpc-performance-impro...
        
           | haberman wrote:
           | From your link:
           | 
           | > Support for Protobuf buffer serialization was a multi-year
           | effort between Microsoft and Google engineers. Changes were
           | spread across multiple repositories.
           | 
           | The implementation being discussed there is the main C#
           | implementation from https://github.com/protocolbuffers/protob
           | uf/tree/master/csha...
        
             | pjmlp wrote:
             | I know, the point was that they don't care to improve the
             | C++ version.
        
               | boulos wrote:
               | haberman is part of the "they" here (he's on the core
               | protobuf team, does things like this).
        
               | haberman wrote:
               | The C++ parser has been completely rewritten in the last
               | few years to push the current code generation design to
               | its limits. The biggest obstacle in the way of optimizing
               | it further is legacy APIs, especially std::string for
               | string fields, which forces tons of small heap
               | allocations and frees and defeats the benefits of arena
               | allocation.
               | 
               | Changing very widely used APIs is not easy, these things
               | take time.
        
               | pjmlp wrote:
               | Interesting, thanks for the clarification, however why do
               | libraries depend on what should be implementation details
               | of generated code to start with?
               | 
               | Something that by good CI/CD practices shouldn't even be
               | part of checked in code.
        
               | haberman wrote:
               | It's exposed in the API. When you access a string field
               | in protobuf, say foo.my_str_field(), the return type is
               | const std::string&. We have to have a real std::string
               | internally to return a reference to it.
        
               | pjmlp wrote:
               | Thanks for the explanation.
        
         | barbazoo wrote:
         | In what kind of scenarios do they use Protobufs? I can think of
         | messaging systems, streaming, RPC, that sort of thing?
        
           | ch33zer wrote:
           | Everything at google is built on RPCs between systems using
           | protobufs
        
             | ma2rten wrote:
             | Protobuf is also used as a format to write data to disk.
        
               | alephnan wrote:
               | Why is this being down voted? It's accurate.
        
         | CoolGuySteve wrote:
         | Protobuf's abysmal performance, questionable integration into
         | the C++ type system, append-only expandability, and annoying
         | naming conventions and default values are why I usually try and
         | steer away from it.
         | 
         | As a lingua franca between interpreted languages it's about par
         | for the course but you'd think the fast language should be the
         | fast path (ie: zero parsing/marshalling overhead in Rust/C/C++,
         | no allocations) as you're usually not writing in these
         | languages for fun but because you need the thing to be fast.
         | 
         | It's also the kind of choice that comes back to bite you years
         | into a project if you started with something like Python and
         | then need to rewrite a component in a systems language to make
         | it faster. Now you not only have to rewrite your component but
         | change the serialization format too.
         | 
         | Unfortunately Protobuf gets a ton of mindshare because nobody
         | ever got fired for using a Google library. IMO it's just not
         | that good and you're inheriting a good chunk of Google's
         | technical debt when adopting it.
        
           | haberman wrote:
           | Zero parse wire formats definitely have benefits, but they
           | also have downsides such as significantly larger payloads,
           | more constrained APIs, and typically more constraints on how
           | the schema can evolve. They also have a wire size
           | proportional to the size of the schema (declared fields)
           | rather than proportional to the size of the data (present
           | fields), which makes them unsuitable for some of the cases
           | where protobuf is used.
           | 
           | With the techniques described in this article, protobuf
           | parsing speed is reasonably competitive, though if your
           | yardstick is zero-parse, it will never match up.
        
             | CoolGuySteve wrote:
             | Situations where wire/disk bandwidth are constrained are
             | usually better served by compressing the entire stream
             | rather than trying to integrate some run encoding into the
             | message format itself.
             | 
             | You only need to pay for decompression once to load the
             | message into ram rather than being forced to either make a
             | copy or pay for decoding all throughout the program
             | whenever fields are accessed. And if the link is bandwidth
             | constrained then the added latency of decompression is
             | probably negligible.
             | 
             | The separation of concerns between compression format and
             | encoding also allows specifically tuned compression
             | algorithms to be used, for example like when switching
             | zstd's many compression levels. Separating the compression
             | from encoding also lets you compress/decompress on another
             | processor core for higher throughput.
             | 
             | Meanwhile you can also do a one shot decompression or skip
             | compression of a stream for replay/analysis; when talking
             | over a low latency high bandwidth link/IPC; or when
             | serializing to/from an already compressed filesystem like
             | btrfs+zstd/lzo.
             | 
             | It's just more flexible this way with negligible drawbacks.
        
           | gorset wrote:
           | It's obviously possible to do protobuf with zero
           | parsing/marshalling if you stick to fixed length messages and
           | 4/8 byte fields. Not saying that's a good idea, since there
           | are simpler binary encodings out there when you need that
           | type of performance.
        
           | fnord123 wrote:
           | FWIW, the python protobuf library defaults to using the C++
           | implementation with bindings. So even if this is a blog post
           | about implementing protobuf in C, it can also help
           | implementations in other languages.
           | 
           | But yes, once you want real high performance, protobuf will
           | disappoint you when you benchmark and find it responsible for
           | all the CPU use. What are the options to reduce parsing
           | overhead? flatbuffers? xdr?
        
             | TheGuyWhoCodes wrote:
             | Flatbuffers, Cap'n'Proto and Apache Arrow comes to mind.
        
         | PostThisTooFast wrote:
         | So important that they haven't bothered to create a protobuf
         | generator for Kotlin, the primary development language for
         | their own mobile operating system.
        
       | ufo wrote:
       | A very interesting trick! In my opinion, the big takeaway here is
       | that if you are willing to write your C code in tail call style,
       | you can get a lot of control over which variables are stored in
       | machine registers. In the x86-64 ABI, the first 6 function
       | parameters are guaranteed to be passed via registers.
       | 
       | Obviously, this is architecture-specific. Other architectures may
       | use a different calling convention that ruins this kind of manual
       | register allocation. I'd imagine that it would be bad for
       | performance in 32-bit systems, where function arguments are
       | passed via the stack.
       | 
       | --------
       | 
       | > I think it's likely that all of the major language interpreters
       | written in C (Python, Ruby, PHP, Lua, etc.) could get significant
       | performance benefits by adopting this technique.
       | 
       | I know that at least Lua and Python use "computed gotos" in their
       | inner loops, which also helps the compiler generate better code.
       | The architecture-dependent nature of the tail-call trick could be
       | a problem here. Some of these interpreters also need to work well
       | on 32-bit systems.
        
         | haberman wrote:
         | Yep, that's exactly the right takeaway. :) I have verified this
         | on ARM64 also, but architectures that pass parameters on the
         | stack will make this technique not really worth it.
         | 
         | Re: PHP, Ruby, etc. yes, computed gotos help some, but I think
         | tail calls would help much more. That was our experience at
         | least. I wrote more about this here:
         | https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
         | 
         | Yes, there are portability issues with the tail call approach,
         | so there would need to be a fallback on non-x64/ARM64 platorms.
         | This would add complexity. But it's exciting to think that it
         | could unlock significant performance improvements.
        
           | spockz wrote:
           | If you like tail calls, look into CPS. Many forms of (pure)
           | code can be rewritten in such a way.
           | 
           | Everyone who writes Haskell quickly learns to write their
           | recursive functions so that they are (mutually) tail
           | recursive.
        
           | ufo wrote:
           | On the matter of portability, I wonder if it would be
           | possible to use some macro magic and/or a code generator to
           | convert the tail-call version back into a more traditional
           | while/switch loop, for the stack-based architectures.
           | 
           | While the tail call version is more architecture dependent,
           | it's nevertheless more portable than assembly language. It's
           | still C.
        
             | haberman wrote:
             | I haven't implemented the fallback yet, but my thought was
             | to keep most of the structure the same (dispatch function
             | tail calls to the function for an individual operation) but
             | then have the individual operation just return instead of
             | tail calling back to dispatch.
             | 
             | I think this would be portable while still avoiding some of
             | the performance problems associated with a traditional
             | switch (like the bounds check).
        
               | elcritch wrote:
               | I'm curious how this approach would fair for
               | microcontrollers, well cortex m0-m4's actually. They're
               | arm32 bit and may not have enough useful call registers.
               | Still parsing protobuf's or interpreting WASM faster
               | would result directly in better battery performance, etc.
        
         | tomp wrote:
         | Modern compilers can compile tail calls of unknown functions
         | (i.e. function pointers) to jumps; so instead of using
         | "computed gotos" (which IIRC is a GCC extension), one can use
         | ANSII C _and_ get more efficient code (because of control of
         | registers)
        
         | mappu wrote:
         | Regarding PHP specifically, the interpreter's opcodes are all
         | described in a meta language, and the actual PHP interpreter
         | can be code-generated in 4 different styles depending on what
         | is fastest for the target C compiler.
         | 
         | See ZEND_VM_KIND_CALL / SWITCH / GOTO / HYBRID in php-
         | src/Zend/zend_vm_gen.php
        
         | kevincox wrote:
         | > the first 6 function parameters are guaranteed to be passed
         | via registers.
         | 
         | This assumes that your function is being called via the regular
         | calling convention. By the as-if rule there is nothing
         | guaranteeing that an internal call is using the regular calling
         | convention or that it hasn't been inlined.
        
           | haberman wrote:
           | We use the "noinline" attribute to get control over inlining,
           | which gives a reasonable assurance that we'll get a normal
           | standard ABI call with six parameters in registers on
           | x64/ARM64.
        
           | anarazel wrote:
           | Are there cases where a compiler reasonably would internally
           | use a different calling convention that supports fewer
           | arguments passed via register passing?
           | 
           | Yours is still a valid point (no reason doesn't imply a
           | guarantee) even if there's none, to be clear. Just curious
           | because I can't really see any reason for a compiler to do
           | so.
           | 
           | I can imagine some theoretical cases where compiler
           | optimizations lead to additional arguments to be passed to a
           | function [version].
        
             | kevincox wrote:
             | I agree it is unlikely. I can imagine something like the
             | compiler realizing that some computation is pure, so it
             | caches the value across invocations of the function,
             | essentially adding addition arguments that push the actual
             | arguments onto the stack. Of course this is an unlikely
             | edge case but a reasonable set of possible optimizations
             | that would work together I'm a surprising way.
        
         | iamevn wrote:
         | Wouldn't you still be able to get a lesser improvement from the
         | tail call being able to overwrite the stack frame and jumping
         | instead of calling?
        
       ___________________________________________________________________
       (page generated 2021-04-25 23:00 UTC)