[HN Gopher] Parsing Protobuf at 2+GB/S: How I Learned to Love Ta... ___________________________________________________________________ Parsing Protobuf at 2+GB/S: How I Learned to Love Tail Calls in C Author : signa11 Score : 369 points Date : 2021-04-25 10:04 UTC (12 hours ago) (HTM) web link (blog.reverberate.org) (TXT) w3m dump (blog.reverberate.org) | z77dj3kl wrote: | Very cool! OT but is there a list of protobuf benchmarks | somewhere across languages/implementations? How does it compare | to JSON or just dumping raw bytes? | slver wrote: | You'd get equivalent performance with "goto", but not depend on | implicit compiler optimizations to save you from stack overflow. | | I like the idea of tail calls, but I don't like it being an | implicit "maybe" optimization of the compiler. Make it part of | the standard, make the syntax explicit, and then both hands in. | haberman wrote: | You really wouldn't. We tried many approaches, including | computed goto, but none of them were able to generate | satisfactory code when compared with tail calls: | https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html | | Also, it is not an implicit optimization when "musttail" | support is available in the compiler. | slver wrote: | Tailcall optimization is compiled as a JMP, so is goto, so | I'm curious where would the difference come from? | PostThisTooFast wrote: | One thing that appealed to us about protobufs is their alleged | change-tolerance (versioning). | | Years later, this remains only alleged... not realized. | a-dub wrote: | even if you set aside the claimed gains over handwritten | imperative loops, being able to tell the compiler to fail unless | tail call conversion succeeds will be huge in terms of preventing | performance or stack depth bugs from surfacing when someone | accidentally makes a change that makes a tail call no longer | convertible. | | if you ask me, it should be part of the language. the whole "just | make sure you shouldn't need a stack and the compiler will | probably take care of it" approach has bothered me for years. | alfiedotwtf wrote: | > the whole "just make sure you shouldn't need a stack and the | compiler will probably take care of it" approach has bothered | me for years. | | Good point. I love Rust's safety guarantees, but you're right - | it doesn't take into account stack like everyone else | coolreader18 wrote: | It's funny, wasm3 was just on the front page with a tail-call | based interpreter model and now this is. Now I wanna do some | stuff with tail calls :) | khiner wrote: | This was a great read! | tux1968 wrote: | This is a really great result, but i'm curious why this isn't a | very common and standard compiler optimization, at least as an | option you can enable? It seems like the conditions where it can | be applied are pretty easy to identify for a compiler. | haberman wrote: | Tail calls are a very common optimization, both Clang and GCC | have been performing this optimization successfully for a | while. What is new is getting a _guarantee_ that applies to all | build modes, including non-optimized builds. | tux1968 wrote: | If you're interested in this optimization for performance | reasons, why would you want an otherwise non-optimized build? | It seems that the only important case is the optimized | build... where for some reason you're not getting this | optimization without explicitly asking for it. | | So the question remains... why is the compiler optimization | missing this chance to optimize this tail call without it | being explicitly marked for optimization? | wffurr wrote: | If you write this style of code and don't get the | optimization, then your test code and debug sessions are | dog slow. Much much slower than the old giant switch style | parser. | | Similarly, if you make a mistake that would cause the | optimization not to be applied, you'd rather get a compiler | error than suddenly have a 10X performance regression. | haberman wrote: | If the optimization is not performed, you blow the stack | because your stack usage is now O(n) instead of O(1). | | That's why it's important to get the optimization in debug | mode. | vitus wrote: | What's the argument for building in debug mode vs | specifying -g to the compiler to build with debug | symbols? | | I've previously encountered bugs that stubbornly refused | to reproduce in a non-optimized build. | | One thing that comes to mind is `#ifdef DEBUG`-style | macros, although I'm curious if there's anything else I'm | missing. Omitted stack frames, e.g. from inlining, | perhaps? | joppy wrote: | One reason that this is not done by default is that it can | make debugging a surprise: since each function does not | leave a stack frame, stack traces are much less useful. | This is not so bad in the case of a tail-recursive function | which calls itself, but if proper tail calls are done | between multiple functions, you could have A call B tail | call C tail call D, and if D crashes the stack trace is | just (D called from A). | | I'm sure there are some good ways around this problem, and | I would love to have proper tail calls available in most | languages. | tyingq wrote: | _" Applying this technique to protobuf parsing has yielded | amazing results: we have managed to demonstrate protobuf parsing | at over 2GB/s, more than double the previous state of the art."_ | | I was somewhat surprised to see that was state of the art for | protobufs. Simdjson boasts faster throughput, without the benefit | of the length encoded headers that are in protobufs. I looked for | examples of protobufs using SIMD, but could only find examples of | speeding up varint encode/decode. | haberman wrote: | Article author here. I dumped my test protobuf payload (7506 | bytes) to JSON and got 19283 bytes. | | So parsing this particular payload at 2GB/s would be equivalent | to parsing the JSON version at 5.1GB/s. | | SIMD doesn't end up helping protobuf parsing too much. Due to | the varint-heavy nature of the format, it's hard to do very | much SIMD parallelism. Instead our approach focuses on going | for instruction-level parallelism, by trying to remove | instruction dependencies as much as possible. With this design | we get ~3.5 instructions per cycle in microbenchmarks (which | represents a best case scenario). | jeffbee wrote: | Making a flagrantly wasteful data format and then using its | bloated extent as the numerator in your benchmark will not | exactly be a fair comparison. If a protobuf has a packed, | repeated field that looks like \x0a\x02\x7f\x7f and json has | instead { "myFieldNameIsBob": [ 127, 127 ] } the JSON | interpreter has to be 20x faster just to stay even. | tyingq wrote: | That's true, would be interesting to see an "encoded entities | per second" comparison. Or maybe a comparison with mostly | stringy data where the size is probably comparable. | haberman wrote: | Article author here. I agree that would be a very | englightening benchmark. Protobuf can dump to JSON, so it | shouldn't be too much work to dump my benchmark data to | JSON and benchmark the parsing with simdjson. Maybe I'll | see if I can get this done while this article is still on | the front page. :) | tyingq wrote: | Ah, wow, that's great. Apples-to-apples since you can | dump the same data to JSON. And shows why HN remains a | unique place. Wonder out loud, and maybe get an answer | within an update to the article you just read :) | haberman wrote: | So I was not able to get simdjson going in time, but I | did dump my test payload to JSON. In protobuf the payload | was 7506 bytes, in JSON 19283 bytes. | | So parsing this particular payload at 2GB/s would be | equivalent to parsing it at 5.1GB/s in JSON. | kelnos wrote: | I don't know if this is a fair comparison, as 2GB/s of protobuf | will parse a lot more _information_ than 2GB /s of JSON will, | since protobuf is a much more space-efficient way to encode | your data. | tyingq wrote: | See https://news.ycombinator.com/item?id=26934063 , we may | get a fairly sound comparison. Though I imagine the | comparison varies a lot depending on the data. As another | comment mentioned, a message with a lot of large strings | would lean heavily towards protobufs. And in that case, the | size wouldn't be much different. | beached_whale wrote: | I would just measure the size of the resulting data | structures. In a language like C/C++ it could be as simple | as memory usage before your parsing function and after | secondcoming wrote: | What benefit would length encoded headers provide other than to | reduce payload size? With JSON you just have to scan for | whitespace, whereas with protobuf you actually have to decode | the length field. | tyingq wrote: | Not having to look at every byte to split out the contents. | jeffbee wrote: | Right, a protobuf message that consists of a single 100MB | string will be "decoded" dramatically faster than 2GB/s, | because you only need visit the tag and the length and | store the address and length, aliasing the input. | | It's really quite impossible to discuss data encoding | performance outside of a given schema. | jsnell wrote: | Isn't there also a significant difference in what the input is | being parsed to? My expectation is for a protobuf library to | parse messages to structs, with names resolved and giving | constant time field access. Simdjson parses a json object to an | iterator, with field access being linear time and requiring | string comparisons rather than just indexing to a known memory | offset. | | I.e. it seems like simdjson trades off performance at access | time for making the parsing faster. Whether that tradeoff is | good depends on the access pattern. | touisteur wrote: | But the same could be true for protobuf. Decode fields only | when you need them, and 'parse' just to find the field | boundaries and cardinality. Did stuff like that for internal | protobuf-like tool and with precomputed message profiles you | can get amazing perf. Just get the last or first bit of most | bytes (vgather if not on AMD) and you can do some magic. | beached_whale wrote: | You cannot compare them. Decoding JSON is compressing it | essentially, and in order to compare you would need to look at | the resulting data structures and how long it takes. strings | are comparable, but an integer is bigger exp2( log10( | digit_count ) ) I think. | | But yeah, I had the same first thought too. | rapidlua wrote: | I've opened an issue in LLVM bugzilla concerning jump not being | folded with address computation on x86 with a proposed fix. Would | love if it gets some attention. | https://bugs.llvm.org/show_bug.cgi?id=50042 | | Also been working on a C language extension to enable guaranteed | tail calls along with explicit control over registers used for | argument passing. Provided that callee-save registers are used | for arguments, calling fallback functions incurs no overhead. | | https://github.com/rapidlua/barebone-c | haberman wrote: | Barebone C looks really interesting! I wonder if the same gains | could be achieved in a more architecture-independent way by | simply introducing a new calling convention where parameters | are passed in the registers that are normally callee-save. | | This would let the fast path functions use all registers | without needing to preserve any of them. It would also let them | call fallback functions that use a normal calling convention | without needing to spill the parameters to the stack. | rapidlua wrote: | Thank you for the feedback! A new calling convention could | probably nail it for many use cases. Sometimes you want | something very special though. E.g. LuaJIT pre-decodes | instruction prior to dispatch to squeeze some work into | branch misprediction delay. There are limited callee save | registers available hence it is better to use volatile | registers for instruction arguments. It should be possible to | reuse the implementation on different architectures and only | tweak the register assignment. | neoteric wrote: | One thing I've been thinking about in C++ land is that just how | much the idiomatic usage of RAII actually prevents the compiler | from doing it's own tail call optimization. Any object | instantiated in automatic storage with a non-trivial destructor | is basically guaranteeing the compiler _can't_ emit a tailcall. | It's rather unfortunate, but perhaps worth it if the traideoff is | well understood. Example: https://godbolt.org/z/9WcYnc8YT | ufo wrote: | You can still use this trick in C++ if you ensure that the | return statements are outside the scopes that have the RAII | objects. It's awkward but it's better than nothing. | https://godbolt.org/z/cnbjaK85T void foo(int | x) { { MyObj obj; // ... | } return bar(x); // tail call } | gumby wrote: | I am unable to understand this comment. You're saying that you | can't generate a tail call by returning from the middle of a | function which needs to clean things up. RAII* is merely | syntactic sugar to write this control flow and make it | mandatory. | | Perhaps it's easier to think of tail-call semantics as simply | implementing iteration: it's another way of expressing a | for(...) loop. And if you used RAII in the body of your for | loop you would expect the same thing. | neoteric wrote: | Absolutely, RAII is an abstraction (and a useful one), but it | has a cost in that it prevents a form of useful optimization | because cleanup is required at the destruction of the stack | frame. You'd expect the same in C if you explicitly had to | call a cleanup function on return from a call. | | What C++ does with RAII is make this tradeoff not obvious. | std::unique_ptr is a great example to show this: colloquially | a std::unique_ptr is "just a pointer", but it isn't in this | case because it's non-trivial destructor prevents TCO. | hvdijk wrote: | I can understand the comment: even if the cleanup can be done | before the call, allowing the call to be done as a tail call, | the compiler will not move up the cleanup code for you, and | having the cleanup code last is the default state. With local | variables defined inside for loops, they are always destroyed | before the next iteration. | skybrian wrote: | These tail-call functions are part of a program's inner loop. | It seems like we shouldn't expect allocation within an inner | loop to be fast? Other than local variables, that is. | | In an interpreter, it seems like either you're allocating | outside the interpreter loop (the lifetime is that of the | interpreter) or it's within a particular (slow) instruction, or | it's part of the language being interpreted and the lifetime | can't be handled by RAII. There will be fast instructions that | don't allocate and slower ones that do. | | Interpreters are a bit weird in that there are lots of | instructions that do almost nothing so the overhead of the loop | is significant, combined with having lots of complication | inside the loop and little ability to predict what instruction | comes next. This is unlike many loops that can be unrolled. | sesuximo wrote: | I suppose the compiler could reorder function calls if it can | prove there is no change in behavior? If so, then it could | hoist dtors above the call and emit a jump. I doubt any | compilers do this. | _3u10 wrote: | Aren't protobufs supposed to be faster than JSON? | | I mean congrats but it doesn't seem that impressive given JSON | has been parsing at that speed for years. | alfiedotwtf wrote: | I've never thought of tail calls in depth before, so quick | question about them - the point here I'm getting is to minimise | the pushing/popping of registers to the stack from `call` By | using a `jmp` instead. That's all good... but what stack frame | does the jumped procedure use? I'm assuming a new stack frame is | still created but just without caller's pushed registers? | dreamcompiler wrote: | There's always some stack frame sitting there; typically the | one that called the main loop routine in the first place. At | some point a RET instruction will be executed, and control will | return to that stack frame's return point. The idea of tail | calls is to avoid creating a new stack frame when it doesn't | add any value, but stack frames still exist--they have to exist | when the last thing that happens in a routine is ordinary | instruction processing rather than a CALL. | pyler wrote: | about jne issue, LLVM does it intentionally. | | Check line 1521 | | https://llvm.org/doxygen/BranchFolding_8cpp_source.html | ndesaulniers wrote: | Nice find! I wonder what they mean by changing branch | direction? | haberman wrote: | Wow thanks for the pointer! That seems unfortunate, I wonder if | there is a way to evaluate whether the extra jump is actually | worth it, and whether this optimization could be allowed. | neilv wrote: | Scheme has had tail calls as a basic idiom (some call it Proper | Implementation of Tail Calls; others call it Tail Call | Optimization) forever, and I keep them in mind any time I'm | implementing anything nontrivial in Scheme. | | There's no special syntax -- there's just the notion of _tail | positions_ , from which tail calls can occur. For example, both | arms of an `if` form are tail position, if the `if` itself is. | And if you introduce a sequencing block in one of those arms, the | last position in the block is tail position. (A specialized IDE | like DrRacket can indicate tail positions, and DrRacket will even | hover arrows, tracing the tail positions back to the top of the | code.) | | When implementing a state machine, for example, a state | transition is simply a function application (call) to the new | state (optionally with arguments), where the the called function | represents a state. It's satisfying when you realize how elegant | something that used to be messy is, and you can imagine it being | very fast (even if your Scheme implementation isn't quite up to | getting as fast as a C or Rust compiler that can recognize and | special-case tail calls.) | | (For the state machine example, compare to more conventional | approaches in C ,of "state ID" variables, `switch` statements on | those, and record-keeping for additional state. Or doing it in | data, with state ID being index into arrays, again with any | additional recordkeeping. Or lots of function calls when you | didn't really need the time overhead and stack usage of function | calls with full returns.) | CodeArtisan wrote: | Any functional programming language shall have TCO since | statements are forbidden (everything is a function). | [deleted] | bla3 wrote: | It does make code harder to debug when something goes wrong. | Since tail calls create no stack frames, a call sequence f->g->h | where g tail calls h will show up as a stack of (f, h) in the | debugger. The fix is easy, just make MUSTTAIL expand to nothing | in debug builds. But it's something to keep in mind, and it means | your debug mode code will have different memory use | characteristics. | lupire wrote: | Programmer's have traditionally relied on the stack as an | (expensive yet incomplete) execution logging system, but the | east fix for that is to use an actually log structure instead | of the abusing the stack. | jhgb wrote: | Seems like not much of a fix if your code depends on it. In | Scheme such behavior would break standard compliance (and lots | of programs). And since presumably you'd use this attribute | _precisely_ when your code depended on it, disabling it may not | be feasible. | | Fortunately the patterns generated by such code are no more | difficult to debug than simple loops in other languages, so the | lack of "history" of your computation seems no more concerning | with tail calls than it is with loops (or at least that's how I | always looked at it). And people don't seem to be avoiding | loops in programming merely for the reason of loops not keeping | track of their history for debugging purposes. | cryptonector wrote: | A while loop with a switch also creates no stack frames for | each operation interpreted. | ndesaulniers wrote: | If using DWARF for unwinding there is a tag IIRC that | represents that inlining has occurred. | haberman wrote: | That is true, but there is always https://rr-project.org for | easy reverse debugging if you're having trouble figuring out | where you came from. | | If the alternative is to drop to assembly, a C-based approach | seems quite easy to debug. You can just add printf() | statements! Previously when I had been using assembly language | or a JIT, I had to resort to techniques like this: | https://blog.reverberate.org/2013/06/printf-debugging-in-ass... | jhgb wrote: | > but there is always https://rr-project.org for easy reverse | debugging if you're having trouble figuring out where you | came from | | How did I not know about this? Thanks! | jpgvm wrote: | rr is really a hidden treasure. C/C++ wasn't the same for | me after discovering it. | | It's sort of like the first time you learn about | interactive debugging. Everything you did before that | suddenly feels like "caveman coding" and you never want to | go back. | elcritch wrote: | Perhaps a "ring logger" approach could be useful. Append the | more useful bits of what would normally go into the stack | frame but without the allocation overhead. Seems like a few | memcpy's and modulo operations wouldn't hurt the code gen too | much. | kevincox wrote: | Of course RR is too expensive to be deployed in production | builds. So if you are getting core dumps from "production" | you won't have this information. So while RR helps it doesn't | completely mitigate the tradeoff. | [deleted] | touisteur wrote: | You may be interested in Intel PT snapshots in Linux core | dumps, that gdb knows how to interpret and will give you a | detailed branch trace. Less cool than rr but still very | interesting! | ufo wrote: | Only if the alternative is non-tail function calls. Often the | alternative to tail calls is a while loop, which also doesn't | leave a stack trace. | leoc wrote: | As so often, the moral is that pushing your problems in-band | doesn't make them go away. | OskarS wrote: | > The fix is easy, just make MUSTTAIL expand to nothing in | debug builds. | | That wouldn't work. The final call is a recursive call to | dispatch, if that is not a tail call it will blow the stack | instantly. | pjmlp wrote: | Lisp and Scheme graphical debuggers do just fine tracking them. | [deleted] | CyberRabbi wrote: | Interesting but it seems like the essential function of the tail | call in this example is to create a strong hint to the compiler | which values should stay in registers. Isn't there a better way | to do that than relying on a seemingly unrelated optimization? | AndyKelley wrote: | Here's a related Zig proposal: | https://github.com/ziglang/zig/issues/8220 | | Relevant section pasted here: | | > Other Possible Solution: Tail Calls | | > Tail calls solve this problem. Each switch prong would return | foo() (tail call) and foo() at the end of its business would | inline call a function which would do the switch and then tail | call the next prong. | | > This is reasonable in the sense that it is doable right now; | however there are some problems: * As far as I | understand, tail calls don't work on some architectures. | * (what are these? does anybody know?) * I'm also | concerned about trying to debug when doing dispatch with tail | calls. * It forces you to organize your logic into | functions. That's another jump that maybe you did not | want in your hot path. | | See also https://dino-lang.github.io/learn/dino.pdf, section 3.1 | "Byte Code Dispatch". | haberman wrote: | It sounds like the main question in the proposal is: how can we | reliably get the compiler to generate threaded/replicated | dispatch? | | While that is important, it is only one small part of what we | were trying to achieve with tail calls. Ultimately we found | that our tail call design significantly improved the register | allocation and overall code generation compared with computed | goto, see: | https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html | seg_lol wrote: | I really want a timeline where we have tail calls everywhere | and this drama can go away. The non-tail call folks feel like | folks arguing for not using spaces in text. | | https://en.wikipedia.org/wiki/Scriptio_continua | slaymaker1907 wrote: | I do think tail call optimization can make debugging/stack | traces confusing if you do general TCO. In the plain single | function case, TCO probably makes it easier to debug, but | it can get confusing if function A calls B and then B calls | C in tail position so the stack trace is just C|A. It's | probably not too bad if it is just one function getting | elided, but this get very difficult if multiple important | calls are being removed from the stack. | | This could also be a case of insufficient tooling for | languages with TCO. Better support for reverse debugging | would make detailed stacks less necessary. Most of the | stack trace isn't that useful anyway when printing out | exceptions compared to good application logging. | spockz wrote: | Don't we have debug symbols or something like it to be | able to translate the optimised version into the thing | you saw in code? If you need to dive into the assembly | you are in for a hell of a ride anyway due to all kinds | of optimisations happening. | seg_lol wrote: | And tail recursive functions encode their state in the | args and the body. There isn't any information in the | stack anyway, by definition. | | If one does need to debug mutually tail recursive | functions, step through them, this isn't a problem. The | _point_ is that they don 't consume stack space. If you | _do_ want a stack of invocations, then don 't write tail | recursive code? | | This style of argumentation that the GP showed should | have a name. | | BTW my original comment is at -1, hate but only in the | tail position. | haberman wrote: | I have to admit I had little interest in tail calls myself | until I saw what they could achieve performance-wise. Now | that I have seen that, I am sold. :) | seg_lol wrote: | Performance is only one small aspect, the things you can | express in tail recursive functions are beautiful and | compact. If feels like being in an inductive proof | playground. | SavantIdiot wrote: | Just here to say that the use of those #defines for huge | parameter lists makes sad. I realize that's a common pattern, but | if your call list is that big, how about a struct pointer? | haberman wrote: | That would defeat the entire purpose: they need to be six | parameters to ensure that they are all in registers. | | However I was considering whether they could be a struct with | six members passed by value. If the ABI would pass that in six | registers we could get rid of the #defines, which I agree would | be nicer. | ufo wrote: | Unfortunately, I think that only structs with at most 2 words | are passed via registers. Larger structs are passed via the | stack. | | https://godbolt.org/z/frof3xjhW | haberman wrote: | Bummer, thanks for the info. | londons_explore wrote: | This article suggests a big reason to use this approach is to | seperate hot and cold code. | | I assume that's for good use of the CPU's instruction and | microcode caches. | | Yet in a protobuf parser, I'm surprised there is enough code to | fill said caches, even if you put the hot and cold code together. | Protobuf just isn't that complicated! | | Am I wrong? | tooltower wrote: | In particular, the author is talking about CPU registers being | spilled to memory, and the need for setting up or tearing down | stack frames. Those things can only be eliminated by the | compiler for extremely simple functions. The error-handling | often isn't. | haberman wrote: | > I assume that's for good use of the CPU's instruction and | microcode caches. | | I don't think that is the reason. These are microbenchmark | results, where realistically all the code will be hot in caches | anyway. | | The problem is that a compiler optimizes an entire function as | a whole. If you have slow paths in the same function as fast | paths, it can cause the fast paths to get worse code, even if | the slow paths are never executed! | | You might hope that using __builtin_expect(), aka | LIKELY()/UNLIKELY() macros on the if statements would help. | They do help somewhat, but not as much as just putting the slow | paths in separate functions entirely. | touisteur wrote: | Similar if using the 'hot' attribute: separate functions or | separate labels... https://stackoverflow.com/a/15034687 | zozbot234 wrote: | Is this really higher performance than the existing protobuf | support in Rust+serde? That uses macro programming to generate | code at compile time based on a high-level description of your | data format, so it can be quite fast and will certainly be a lot | safer than cowboy-coding raw C. | linkdd wrote: | Every single time there is a link about C/C++, I do a quick | "Ctrl+F Rust", and every single time there is some rant about | "but Rust is safer than C". | | Every. Single. Time. | | This article is about a new Clang extension: | | > An exciting feature just landed in the main branch of the | Clang compiler. Using the [[clang::musttail]] or | __attribute__((musttail)) statement attributes, you can now get | guaranteed tail calls in C, C++, and Objective-C. | | (literally the first line of the article) | | What does Rust have to do with this? Nothing. | | Does the author suggest that you should only use C for high | performance? No. | littlestymaar wrote: | > Every single time there is a link about C/C++, I do a quick | "Ctrl+F Rust", and every single time there is some rant about | "but Rust is safer than C". | | If you had the same reflex with other languages when reading | other HN comment thread you'll realize that everytime there's | a thread about _language A_ there 's a ton of comments about | _language B or C_. Rust threads are full of people talking | about C++ or C, why is it supposed to be more OK? Go threads | are full of comment about Java, Python threads are full of Go | or Julia (depending on what kind of Python this is about). | This isn 't specific to Rust in any ways. | | Yes, the GP isn't the most useful comment ever, Rust is hype | and there are overly enthusiastic people, but there are also | people who seem to be overly defensive about it (doing a text | search to count Rust occurrences, every time, really?) and | I'm not sure the latter is less childish than the former. | linkdd wrote: | > If you had the same reflex with other languages | | I do, always, just out of curiosity because seeing peoples | view on different languages is always insightful. | | > This isn't specific to Rust in any ways. | | From what I've seen, Rust comments are the most | condescending. | | > I'm not sure the latter is less childish than the former. | | It is childish to look for specific information (in this | specific case: how C and Rust are compared by others) in | threads that grow to hundreds of messages? | | FYI, I've also looked for C++ and Go comments in this | thread, and none of them were a condescending comment like | "C is bad, you should use Rust, C is evil and should not be | sent to production". | zozbot234 wrote: | > This article is about a new Clang extension: | | > An exciting feature just landed in the main branch of the | Clang compiler. Using the [[clang::musttail]] or | __attribute__((musttail)) statement attributes, you can now | get guaranteed tail calls in C, C++, and Objective-C. | | > What does Rust have to do with this? Nothing. | | Rust has a planned keyword for this exact feature, namely | 'become'. So the author would be able to use it in Rust just | as well, as soon as support for it lands in an upstream LLVM | release. | | Regardless, writing raw C as the article suggests in order to | parse a quasi-standard high-level format is cowboy coding. | It's a nice hack to be sure, but it's not the kind of code | that should be anywhere close to a real production workload. | Instead, this feature should be implemented as part of some | safe parser generator. Not necessarily written in Rust but | something that's at least as safe. | Cloudef wrote: | I think you might need a reality check and see how many | systems run C code in production. | linkdd wrote: | No cowboy would treat their servers like cattle. | | Joke aside, the Linux kernel is mainly developed in C and | runs in production everywhere. | linkdd wrote: | > Rust has a planned keyword for this exact feature, namely | 'become'. So the author would be able to use it in Rust | just as well, as soon as support for it lands in an | upstream LLVM release. | | The author is demonstrating a new Clang feature. Then, | again, what is the point of mentioning Rust? | | > writing raw C as the article suggests in order to parse a | quasi-standard high-level format is cowboy coding. | | So what? Cowboy coding lets you appreciate even more | languages where you don't need to do it, AND gives you a | better understanding of how they might work. | | > it's not the kind of code that should be anywhere close | to a real production workload. | | Again, not a single time in the article the author suggests | that. | ma2rten wrote: | Protobufs are very important for Google. A significant percentage | of all compute cycles is used on parsing protobufs. I am | surprised that the parsing is not doing using handwritten | assembly if it's possible to improve performance so much. | mkoubaa wrote: | Given that fact I'm wondering if google ever researched custom | chips or instruction sets for marshalling pbs, like the TPUs | they worked on for ML. | lupire wrote: | Problem is once you parse the protobuf, you have to | immediately do other computations on it in the same process. | No one needs to parse protobufs all day long like an ML model | or doing hashes for crypto. | jeffbee wrote: | That doesn't seem to preclude hardware assistance. For | example they have also explored hardware acceleration for | the tcmalloc fast path allocation by adding circuits to | general purpose CPUs. Arguably, Intel BMI descends from a | request that Google made years ago to speed up protobuf | decoding (and other workloads) on x86. | xxpor wrote: | Handwritten ASM for perf is almost never worth it in modern | times. C compiled with GCC/Clang will almost always be just as | fast or faster. You might use some inline ASM to use a specific | instruction if the compiler doesn't support generating it yet | (like AVX512 or AES), but even for that there's probably an | intrinsic available. You can still inspect the output to make | sure it's not doing anything stupid. | | Plus it's C so it's infinitely more maintainable and way more | portable. | pjmlp wrote: | Yet Microsoft was able to make a .NET implementation faster | than the current Google's C++ one. | | A proof that they don't care enough about protobuf parsing | performance. | | https://devblogs.microsoft.com/aspnet/grpc-performance-impro... | haberman wrote: | From your link: | | > Support for Protobuf buffer serialization was a multi-year | effort between Microsoft and Google engineers. Changes were | spread across multiple repositories. | | The implementation being discussed there is the main C# | implementation from https://github.com/protocolbuffers/protob | uf/tree/master/csha... | pjmlp wrote: | I know, the point was that they don't care to improve the | C++ version. | boulos wrote: | haberman is part of the "they" here (he's on the core | protobuf team, does things like this). | haberman wrote: | The C++ parser has been completely rewritten in the last | few years to push the current code generation design to | its limits. The biggest obstacle in the way of optimizing | it further is legacy APIs, especially std::string for | string fields, which forces tons of small heap | allocations and frees and defeats the benefits of arena | allocation. | | Changing very widely used APIs is not easy, these things | take time. | pjmlp wrote: | Interesting, thanks for the clarification, however why do | libraries depend on what should be implementation details | of generated code to start with? | | Something that by good CI/CD practices shouldn't even be | part of checked in code. | haberman wrote: | It's exposed in the API. When you access a string field | in protobuf, say foo.my_str_field(), the return type is | const std::string&. We have to have a real std::string | internally to return a reference to it. | pjmlp wrote: | Thanks for the explanation. | barbazoo wrote: | In what kind of scenarios do they use Protobufs? I can think of | messaging systems, streaming, RPC, that sort of thing? | ch33zer wrote: | Everything at google is built on RPCs between systems using | protobufs | ma2rten wrote: | Protobuf is also used as a format to write data to disk. | alephnan wrote: | Why is this being down voted? It's accurate. | CoolGuySteve wrote: | Protobuf's abysmal performance, questionable integration into | the C++ type system, append-only expandability, and annoying | naming conventions and default values are why I usually try and | steer away from it. | | As a lingua franca between interpreted languages it's about par | for the course but you'd think the fast language should be the | fast path (ie: zero parsing/marshalling overhead in Rust/C/C++, | no allocations) as you're usually not writing in these | languages for fun but because you need the thing to be fast. | | It's also the kind of choice that comes back to bite you years | into a project if you started with something like Python and | then need to rewrite a component in a systems language to make | it faster. Now you not only have to rewrite your component but | change the serialization format too. | | Unfortunately Protobuf gets a ton of mindshare because nobody | ever got fired for using a Google library. IMO it's just not | that good and you're inheriting a good chunk of Google's | technical debt when adopting it. | haberman wrote: | Zero parse wire formats definitely have benefits, but they | also have downsides such as significantly larger payloads, | more constrained APIs, and typically more constraints on how | the schema can evolve. They also have a wire size | proportional to the size of the schema (declared fields) | rather than proportional to the size of the data (present | fields), which makes them unsuitable for some of the cases | where protobuf is used. | | With the techniques described in this article, protobuf | parsing speed is reasonably competitive, though if your | yardstick is zero-parse, it will never match up. | CoolGuySteve wrote: | Situations where wire/disk bandwidth are constrained are | usually better served by compressing the entire stream | rather than trying to integrate some run encoding into the | message format itself. | | You only need to pay for decompression once to load the | message into ram rather than being forced to either make a | copy or pay for decoding all throughout the program | whenever fields are accessed. And if the link is bandwidth | constrained then the added latency of decompression is | probably negligible. | | The separation of concerns between compression format and | encoding also allows specifically tuned compression | algorithms to be used, for example like when switching | zstd's many compression levels. Separating the compression | from encoding also lets you compress/decompress on another | processor core for higher throughput. | | Meanwhile you can also do a one shot decompression or skip | compression of a stream for replay/analysis; when talking | over a low latency high bandwidth link/IPC; or when | serializing to/from an already compressed filesystem like | btrfs+zstd/lzo. | | It's just more flexible this way with negligible drawbacks. | gorset wrote: | It's obviously possible to do protobuf with zero | parsing/marshalling if you stick to fixed length messages and | 4/8 byte fields. Not saying that's a good idea, since there | are simpler binary encodings out there when you need that | type of performance. | fnord123 wrote: | FWIW, the python protobuf library defaults to using the C++ | implementation with bindings. So even if this is a blog post | about implementing protobuf in C, it can also help | implementations in other languages. | | But yes, once you want real high performance, protobuf will | disappoint you when you benchmark and find it responsible for | all the CPU use. What are the options to reduce parsing | overhead? flatbuffers? xdr? | TheGuyWhoCodes wrote: | Flatbuffers, Cap'n'Proto and Apache Arrow comes to mind. | PostThisTooFast wrote: | So important that they haven't bothered to create a protobuf | generator for Kotlin, the primary development language for | their own mobile operating system. | ufo wrote: | A very interesting trick! In my opinion, the big takeaway here is | that if you are willing to write your C code in tail call style, | you can get a lot of control over which variables are stored in | machine registers. In the x86-64 ABI, the first 6 function | parameters are guaranteed to be passed via registers. | | Obviously, this is architecture-specific. Other architectures may | use a different calling convention that ruins this kind of manual | register allocation. I'd imagine that it would be bad for | performance in 32-bit systems, where function arguments are | passed via the stack. | | -------- | | > I think it's likely that all of the major language interpreters | written in C (Python, Ruby, PHP, Lua, etc.) could get significant | performance benefits by adopting this technique. | | I know that at least Lua and Python use "computed gotos" in their | inner loops, which also helps the compiler generate better code. | The architecture-dependent nature of the tail-call trick could be | a problem here. Some of these interpreters also need to work well | on 32-bit systems. | haberman wrote: | Yep, that's exactly the right takeaway. :) I have verified this | on ARM64 also, but architectures that pass parameters on the | stack will make this technique not really worth it. | | Re: PHP, Ruby, etc. yes, computed gotos help some, but I think | tail calls would help much more. That was our experience at | least. I wrote more about this here: | https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html | | Yes, there are portability issues with the tail call approach, | so there would need to be a fallback on non-x64/ARM64 platorms. | This would add complexity. But it's exciting to think that it | could unlock significant performance improvements. | spockz wrote: | If you like tail calls, look into CPS. Many forms of (pure) | code can be rewritten in such a way. | | Everyone who writes Haskell quickly learns to write their | recursive functions so that they are (mutually) tail | recursive. | ufo wrote: | On the matter of portability, I wonder if it would be | possible to use some macro magic and/or a code generator to | convert the tail-call version back into a more traditional | while/switch loop, for the stack-based architectures. | | While the tail call version is more architecture dependent, | it's nevertheless more portable than assembly language. It's | still C. | haberman wrote: | I haven't implemented the fallback yet, but my thought was | to keep most of the structure the same (dispatch function | tail calls to the function for an individual operation) but | then have the individual operation just return instead of | tail calling back to dispatch. | | I think this would be portable while still avoiding some of | the performance problems associated with a traditional | switch (like the bounds check). | elcritch wrote: | I'm curious how this approach would fair for | microcontrollers, well cortex m0-m4's actually. They're | arm32 bit and may not have enough useful call registers. | Still parsing protobuf's or interpreting WASM faster | would result directly in better battery performance, etc. | tomp wrote: | Modern compilers can compile tail calls of unknown functions | (i.e. function pointers) to jumps; so instead of using | "computed gotos" (which IIRC is a GCC extension), one can use | ANSII C _and_ get more efficient code (because of control of | registers) | mappu wrote: | Regarding PHP specifically, the interpreter's opcodes are all | described in a meta language, and the actual PHP interpreter | can be code-generated in 4 different styles depending on what | is fastest for the target C compiler. | | See ZEND_VM_KIND_CALL / SWITCH / GOTO / HYBRID in php- | src/Zend/zend_vm_gen.php | kevincox wrote: | > the first 6 function parameters are guaranteed to be passed | via registers. | | This assumes that your function is being called via the regular | calling convention. By the as-if rule there is nothing | guaranteeing that an internal call is using the regular calling | convention or that it hasn't been inlined. | haberman wrote: | We use the "noinline" attribute to get control over inlining, | which gives a reasonable assurance that we'll get a normal | standard ABI call with six parameters in registers on | x64/ARM64. | anarazel wrote: | Are there cases where a compiler reasonably would internally | use a different calling convention that supports fewer | arguments passed via register passing? | | Yours is still a valid point (no reason doesn't imply a | guarantee) even if there's none, to be clear. Just curious | because I can't really see any reason for a compiler to do | so. | | I can imagine some theoretical cases where compiler | optimizations lead to additional arguments to be passed to a | function [version]. | kevincox wrote: | I agree it is unlikely. I can imagine something like the | compiler realizing that some computation is pure, so it | caches the value across invocations of the function, | essentially adding addition arguments that push the actual | arguments onto the stack. Of course this is an unlikely | edge case but a reasonable set of possible optimizations | that would work together I'm a surprising way. | iamevn wrote: | Wouldn't you still be able to get a lesser improvement from the | tail call being able to overwrite the stack frame and jumping | instead of calling? ___________________________________________________________________ (page generated 2021-04-25 23:00 UTC)