[HN Gopher] The Protobuf Language Specification ___________________________________________________________________ The Protobuf Language Specification Author : akshayshah Score : 125 points Date : 2022-09-12 16:40 UTC (6 hours ago) (HTM) web link (buf.build) (TXT) w3m dump (buf.build) | rigelbm wrote: | Echoing some of the sentiment here: although this was certainly a | great effort and the result is awesome, that is NOT The Protobuf | Language Specification, for as long as the maintainers of the | Protobuf (protoc) project don't agree to follow it. | | This is certainly The Buf Language Specification, which is useful | in itself. Specs are like contracts. If I were to build a tool to | be compatible with Buf, I would definitely aim it to work with | this spec. | | The problem is that the Protobuf project simply didn't sign this | contract. Whatever it says is, sorry for the choice of word, a | bit pointless if I'm trying to build a tool compatible with | Protobuf, specially around forward compatibility. | | The industry does need better tooling around protobuf/efficient | RPC, and being dependent on a single company (i.e. Google) is | definitely not healthy. I hope you guys succeed in what your are | trying to doing. | mook wrote: | That's basically the equivalent of RubySpec -- reverse | engineered from MRI (the original Ruby implementation) for use | by Rubinius. It was adopted by the other alternative Ruby | implementations too. | | It looks like the original is now gone, but a fork has taken | over. Looking at some comments, fighting to get MRI to adopt it | may have burnt out the people behind it. | rigelbm wrote: | Actually, I thought about it twice and I retract what I said | about this specification being pointless for building tools | compatible with protobuf. Reasons: | | * The language itself is unlikely to change much given it's | been public for so long. A non-official spec that captures the | current implementation is probably going to survive for some | time. | | * There's no official spec (which I would prefer) for me to | base my tool on. This spec is about my only choice. The more | tools targeting this spec, the hardest would be for Google to | break compatibility with it, reinforcing my previous point. | | I will keep the parent comment for context, and I don't retract | the fact that I think the title is misleading. Otherwise, great | work!! | staticassertion wrote: | Agreed. If IDEs and alternative compilers are all building | off of the spec because it's the path of least resistance, | and there are no bugs _for a while_ , the defacto standard | impl is going to face serious scrutiny for parting from it. | | And, as you said, proto isn't in a great position to be | making crazy changes anyway. | jhumphries131 wrote: | Our aim is to make the spec accurately match Google's reference | compiler -- for as long as that is the source of truth, which | is hopefully not forever :) | | Even for those not using Buf, we expect this documentation to | be of interest to the community as it describes a large number | of facets of protoc that were previously undocumented (and | required examining the source for protoc or playing around with | test source code to see what it expects and what descriptors it | generates). | | If issues are found with this spec, it is true that we'll most | likely have to revise the spec to match the compiler, not the | other way around. But no software is perfect: some variations | will be due to bugs in protoc, which can be fixed in the | compiler to properly match the spec. Over time, we'd love to | see an outcome where a formal specification is the source of | truth. | | For now, our commitment is to make (and keep) this spec as | accurate as possible to describe the Protobuf language, not | some Buf dialect. | marsven_422 wrote: | cpurdy wrote: | oh .. cool .. pricing for protobuf | habitue wrote: | > we are standing on the shoulders of giants, those who have | built and battle-tested it, and brought it to its current mature | state | | I would rewrite this maybe to: | | > we are making Google's internal problems into everyone's | problems | | There are benefits to an IDL in the abstract, but an IDL for | everyone should be built with the benefit of hindsight looking at | the lessons of protobuf, ion, thrift, etc. Not just baking | Google's internal backwards compatibility obligations into a | formal spec everyone should follow. | | I think any time google takes an internal tool and flips the | "open source" bit on it, it turns out to be a bad match for the | rest of the world. When they instead take the time to build a new | system that learns from the internal tool, like Kubernetes | learned from Borg, I think the end result is significantly more | valuable. | orf wrote: | I quite like Protobuf definitions. I find them very easy to | read and I love the fact I can distribute them to a bunch of | different languages via a library. | | Are these Google's internal problems? Or, what google-internal | problems do protobufs solve that nobody else needs to care | about? | | Edit: to your edit, I find it hard to see a different way to do | things. | advisedwang wrote: | Is there a license on this spec? | akshayshah wrote: | Apache 2.0: | https://github.com/bufbuild/protobuf.com/blob/main/LICENSE | mmastrac wrote: | (removing my unfair characterization) | Master_Odin wrote: | A large corporate sponsor that has done a terrible job of | shepherding the protocol, maintaining docs, etc. I'm all for a | community push to divorce the protobuf | specification/implementation from Google and to have it be much | more community maintained, as it's clear that Google doesn't | seem to care to. | jhumphries131 wrote: | Google actually requested that the community contribute to a | real specification. | https://github.com/protocolbuffers/protobuf/issues/6188#issu... | | So we've taken the initiative. No rent-seeking. | peteradio wrote: | Never having had the opportunity to put protobuf into action but | having some interest I've had these questions: | | 1) What would you say is the best use case | | 2) and what unfortunate misusecases have you come across | jhumphries131 wrote: | The key "killer" use case is for describing RPC schemas. By | describing the schema in an IDL, you can then generate client | stubs and server interfaces in a variety of implementation | languages, allowing interop between heterogenous clients and | servers. | | They are also useful for describing domain models. This isn't | surprising since domain models usually find their way into RPC | schemas (since RPCs will often query or define model data). But | they can also be used in other cases, such as for persistence | and structured logging. | | Some misuse I have seen involves trying to make a protobuf | model the _only_ representation of a domain model: it is almost | inevitable that a physical model (a representation of your data | in a SQL database, for example) will need to vary from a | logical model, and trying to make a single protobuf | representation serve double duty can be a source of problems. | Making protobuf schemas conform to physical storage constraints | often makes for worse abstraction. The model becomes de- | normalized (which can make constraints and relationships harder | to model/enforce), and can even leak details that consumers | shouldn't know about or care about. | | Another misuse is using it for data that never leaves a process | -- a program's private, internal state. If a data structure | never needs to be serialized (to persist or send to another | process), then you're better off using native data structures | in the implementation language (which generally have far | greater flexibility/expressibility as well as better | performance). | shaftway wrote: | I do most of my work with protos in Java, and it's nice to have | a schema like this that will build a bunch of immutable POJOs | with builder classes and enough infrastructure to be able to do | some interesting reflection-style stuff on top of the | serialization / deserialization. | | The wide variety of client languages is really nice. I'm fairly | certain that I can parse a proto in any language I'm ever going | to use. | | The binary wire format is fairly straightforward, and is pretty | tight without using compression. Fields are byte-aligned, and | if you wanted to generate a binary proto message by | concatenating a few things together it isn't very hard. And | then you can use your proto definitions in whatever language | you want to parse it. You can even parse your proto definition | into a proto (Google provides the proto proto definitions, I | think I got that grammar right) and write tools that generate | whatever code you want easily. | | I think the text format is under-utilized. It's my go-to for | configuration files. You create the proto definitions with | whatever structures you want to structure your config settings, | and then use the official parser to parse a text file. It | supports comments (I'm throwing shade at you, JSON), and is | simpler than YAML, while adding that structure. You can also | use command line tools to validate the file as a pre-commit, or | translate it into a binary format if you don't want to rely on | the text format. | IshKebab wrote: | A good reference. I don't think it was really needed in the same | way that e.g. a JSON or C++ spec was, since the language is so | simple there's not much room for ambiguity. Definitely nice to | have anyway. | jhumphries131 wrote: | It is a simple language, being just an IDL (no expressions, no | logic, no state, no memory model, etc). | | However, you might be surprised about the room for ambiguity. | For example, there are several mistakes in the grammars on the | official developer site. And there is no place on the developer | site that, for example, clearly explains how option names are | formulated and interpreted. Even the way that relative | references are resolved is not thoroughly described; it is | probably the most complicated part of the spec because | coherence/consistency wasn't keenly considered when the | reference implementation in protoc was devised | (https://www.protobuf.com/docs/language-spec#relative- | referen...). | | So if you wanted to write a tool for the language, one that | could correctly parse and understand a source file, without the | details in this spec that tool will almost certainly be | incorrect and disagree with how protoc parses and understands | the same source. | IshKebab wrote: | It does make me hope that some protobuf libraries will | integrate their own compilers. The official one can be a bit | of a pain to install. This will definitely make that easier! | jupp0r wrote: | > Protobuf is the most stable and widely adopted IDL today | | I've run into this misconception so many times over the last | decade. Protobuf is much less than an IDL (intentionally so). | It's used to describe the data of an interface but is completely | unopinionated about all other aspects of an IDL. GRPC is a great | example of how to use ProtoBuf in an IDL, but it could be used | for other categories of interfaces (object oriented, etc). People | treating ProtoBuf as an IDL make the mistake of concentrating too | much on the data format and not about (imho) more important | aspects like pre and postconditions etc, that make an interface | an interface. | CobrastanJorji wrote: | This is interesting. So one company invents and maintains a | compiler, then a different company found that documentation to be | insufficient (big surprise), so wrote out a lengthy standard that | conformed to what the first company's compiler happened to do? | Seems very useful, but also seems risky and hard to maintain. | What happens when Google tightens a constraint or adds a new | feature next week? | Master_Odin wrote: | I think the hope is that this could be a situation like | CommonMark and Markdown, where google's implementation will | continue to exist, but that the community just totally moves | over to this new specification / tooling, and that anytime | someone says "protobuf", they don't even necessarily realize | that Google has a thing, they just know this specification. | numbsafari wrote: | ... except for whenever you've got to communicate or exist | inside the Google ecosystem. Now you've got "protobuf as per | Google" and "protobuf as per internet randos" with twice the | dependency graph. | | It's unfortunate, and not surprising, that Google hasn't made | a protobuf spec. I realize there are a ton of protobuf fans | out there but, personally, it just feels like a massive, | awkwardly maintained mess that I'm forced to live with for | (mostly) their benefit. | smcl wrote: | Trouble is that I don't think everyone's moved on with | Markdown, I still encounter different dialects that are | subtly different. Even within one suite of tools from a | single vendor ... [angry stares in the direction of | Atlassian] | eklitzke wrote: | Generally new features in protobuf are just new features, so if | a new feature is added then the worst case is the documentation | will be out of date/incomplete for a short period of time. | | For the most part the "constraint tightening" thing doesn't | affect the language specification, at least in my experience. | For example, there have been some changes in protobuf that | affect things like serialization order. A change like that can | break brittle tests that do things like checking that a | function produces some exact serialized string/byte sequence, | but they don't affect the semantics of the language. | jhumphries131 wrote: | We keep an eye on the protobuf repo, so if a change is made we | can both incorporate it into our products (https://buf.build) | and into this spec. | | A great outcome for the ecosystem would be that Google chooses | to engage with the community before making language changes. | And the best possible outcome would be that the authority is | eventually inverted: a specification doc becomes the definitive | source of truth on the language and `protoc` is updated to | conform to it, instead of vice versa. | mmastrac wrote: | (deleted) | CobrastanJorji wrote: | Ahhh, is that the play? That makes a lot of sense. | jhumphries131 wrote: | Matt, that is not our intention. But we are trying to build a | business around making schema-driven APIs easy, and protobuf | is at the core of our current products. So we are trying to | improve the ecosystem around protobuf, and a critical aspect | of that in our esteem is having a spec. | | While `protoc` remains the source of truth, this spec | captures the syntax and rules accepted and enforced by | `protoc` in a far more detailed way than the official | developer site. | mmastrac wrote: | Ah, hey Josh. I didn't realize you were involved in this | effort. FWIW, seeing your name associated with this | definitely gives it a bit more authority than I had | originally assumed. | kentonv wrote: | If it helps, I (original author of proto2) have been | advising Buf and like what they're doing. (Disclosure: I | also invested a small amount.) | | Buf is founded by engineers who spent a LOT of time | working with Protobufs outside of Google. I was always | the one saying "please don't write your own .proto | parser" but I am convinced Buf actually knows what they | are doing here and probably have all the details right. | | Our industry has a whole lot of tooling and | infrastructure built around JSON, and almost every piece | of it could be way better if it were operating with well- | defined types instead, in the same way that TypeScript | tooling benefits vs. JavaScript. Google has had an all- | protobuf ecosystem internally for a long time, but much | of it will never be released publicly. So that leaves | someone like Buf to really build it out. I'm pretty | interested to see where they're able to take it. | wrs wrote: | This is a similar situation to Ruby and Rubinius (an | alternative implementation of Ruby). Because there was no Ruby | specification other than the original MRI implementation, the | Rubinius project (a new alternative implementation) created | their own test suite to codify expected Ruby behavior. However, | the MRI developers didn't use it, and the behavior diverged. | | The original creator gave up on the idea [0] but it was | immediately taken over by others and is still maintained [1]. | In case of conflict, though, Matz (lead developer on MRI), not | the specification, is the source of truth [2]. | | [0] https://github.com/rubinius/rubinius-website- | archive/blob/87... | | [1] https://github.com/ruby/spec | | [2] http://ruby.github.io/rubyspec.github.io/bugs_found/ | kyrra wrote: | Googler, opinions are my own. I don't work on protobuf at all, | just use it all the time (like most Googlers) | | I haven't dug into this in great detail yet, but the hard thing | about the proto "spec" is that there isn't one, and protoc lets | you do all kinds of crazy things that are really hard to model in | languages like Antlr. There were some poor choices in protoc | dating back to when proto1 was first created that have been | carried forward. Having 20 years of proto definitions lets people | come up with some crazy use cases. | | Definitely interesting for this company to create an EBNF | definition for protobuf. | jjtheblunt wrote: | if comfortable answering, why protobuf instead of gob? | jhumphries131 wrote: | Gob is Go-specific. Our mission is to make schema-driven APIs | easy, regardless of what language you use. Protobuf already | has official support for nearly a dozen languages, and | unofficial support for many more. Protobuf also has a | compiler with a plugin model, which facilitates supporting | even more in the future. | | Furthermore, Protobuf is an IDL, not a full-blown programming | language. This makes it ideal for this use case, for | describing APIs and data structures. | | Gob-encoded data structures are described with Go. While Go | is great for writing server-side business logic, it is not as | well-suited as a description language for data that you need | to share with non-Go systems. | whacker wrote: | proto predates gob by quite a bit. gob was introduced with | golang, and it's not really used anywhere else. | erik_seaberg wrote: | My takeaway from Java serialization was that a schema- | driven encoding that's supported in many languages is a lot | more useful. | jhumphries131 wrote: | It's possible that the internal version of protoc is very | different from the open-source version. (I know there are | numerous differences, but not sure how pervasive they are in | the parser.) | | The open-source version has a hand-written tokenizer and | recursive descent parser that is not too difficult to translate | to EBNF. You'll notice that the section on numeric literals is | a little wonky, because the tokenizer does a check that is hard | to describe in EBNF. But it isn't too bad. | | Also, some of the constraints of the language are in prose in | this spec because they are easier to enforce using a semantic | validation pass, instead of trying to model purely with a CFG. | (Optionality of the colon in the text format, used in message | literals, comes to mind.) | | There are some things that technically _could_ be handled in | the grammar, but they would make the grammar much more | cumbersome to read and understand. So those things are also | extracted into prose. | | > Definitely interesting for this company to create an EBNF | definition for protobuf. | | For what it's worth, Google has also published an EBNF | definition (the subject blog post contains links to those | specs). But they are incomplete and not entirely accurate, | which is a non-trivial part of what led us to writing and | publishing this spec. | kyrra wrote: | One place protoc doesn't align well is the descriptor object. | https://developers.google.com/protocol- | buffers/docs/referenc... | | Comment placement is basically allowed anywhere by protoc, | but how to get those comments within a Descriptor object for | a proto is not well defined (there are places where you can | put comments that are not available within Descriptor). It | provides leading/trailing comments, but there are many other | cases that are missed today (like comments embedded within a | list of items in an array). Maybe this is a mismatch between | what protoc allows and what Descriptor presents, but it's | definitely annoying. | jeffparsons wrote: | > As of today, Protobuf is now a fully-defined language: | | (Etc.) | | I'm not sure what this is meant to achieve in reality. There is | still only one implementation that defines what the language | actually is, and that is Google's protoc. | | I'm my experience working with and writing alternative parsers | for the '.proto' language, I've found that time and again | Google's documentation for the format is either woefully vague, | or directly contradicts the actual implementation. I don't see | the value in a third party "spec" if what I have to do in | practice will always be "whatever Google did in protoc". | returningfory2 wrote: | > There is still only one implementation that defines what the | language actually is, and that is Google's protoc. | | From the article, it seems this is not true anymore? | | > We've built the [new Buf proto] compiler within the buf CLI | to accurately match protoc. | jhumphries131 wrote: | Our intent with the compiler in Buf is to match protoc as | perfectly as we can. We want to instill maximum confidence in | our users that Buf is a trustworthy tool and a suitable | replacement for protoc. And to do that, we need the behavior | to match. | | But we do hope that eventually the _official_ definition of | the language will be a proper specification, not a particular | implementation. (And maybe this document could be the start | of that shift.) | | So while there are multiple implementations (Buf, | Square/Wire, probably others), the protoc implementation is | canon. | overboard2 wrote: | Have you thought of creating your own version of the | protobuf language, sort of like GNU C? You could have an | optional flag to enable it, which would allow you to create | your own official specification. | jhumphries131 wrote: | The intent of this spec is to actually put "whatever Google did | in protoc" into a readable format, so you don't have to read | the C++ code. The official docs fall short on providing much of | the details that are included. | morelisp wrote: | Protobuf isn't too complicated, I've found the wire format | docs to be some of the best among the avro/msgpack/thrift/etc | competitors. | | Maybe you mean something besides the wire format. In that | case, good luck, because that shit ain't protobuf. | akshayshah wrote: | The wire format is fairly straightforward if you've seen a | few binary encodings. The language used to write the | schemas isn't quite as simple and regular as you might | hope, though. | | > Maybe you mean something besides the wire format. In that | case, good luck, because that shit ain't protobuf. | | Naming's hard :) Being really pedantic, I think even Google | calls the schema description language "Protocol Buffers" | and uses phrases like "the Protobuf binary format" or "the | Protocol Buffer wire format" to refer to the wire format. | Colloquially, it's never confused me to just use "Protobuf" | for both. | morelisp wrote: | Except I've written thousands of lines of protobuf format | handling that never, or only extremely distantly, touch a | schema file. But there's no reason you'd ever do the | inverse, pushing protobuf schemas around with no intent | to handle the wire format. As an abstract data definition | format it's exceptionally poor, it only makes sense if | you also want to use the wire format (which is... better | than poor, especially as the commodity ones go). | akshayshah wrote: | > But there's no reason you'd ever do the inverse, | pushing protobuf schemas around with no intent to handle | the wire format. | | You could be writing a linter, a formatter, an | implementation of the Language Server Protocol, a | compiler that's not protoc, a way to apply semantic | patches to large numbers of Protobuf schemas, or any | number of other useful tools. There's clearly at least | some demand for tools like this - partial implementations | of most of these exist, often with some corporate | backing. | | Unless you're implementing a Protobuf runtime | (google.golang.org/protobuf in Go, upb for Python, etc.), | your experience seems unusual to me - most developers | I've encountered read and write the wire format using one | of the existing runtimes. | | That said, it does sound like a lot of fun - especially | if it's in lisp! | lhorie wrote: | That sounds great, but what's the governance story? Are the | authors of the spec document committing to keeping the | document up to date here henceforth? Are the protoc | maintainers committing to have these folks involved in | project direction decisions? | jhumphries131 wrote: | As of right now, the former. That is currently required for | our tools to remain functioning (https://buf.build). | If/when changes are made to the language, we update our | tools (and this spec) to continue to be accurate. | jeffbee wrote: | Out of curiosity: why write proto language implementations | rather than protoc plugins? | jhumphries131 wrote: | A great question: We do plan to add content about the plugin | protocol to this site. While documentation for plugins is | light, it is easier to find and to get a working plugin than | it is to find the information needed, for example, to write a | tool that performs static analysis on a protobuf source | files. | | The biggest omission in the existing docs was the | specification of the language. | | Plugins generally require a library for a particular | implementation language, so content we write would likely | have to focus on a single language and library (at least to | start). Whereas a spec is more broadly useful, regardless of | what implementation language one is using with Protobuf. | geraldcombs wrote: | So that you can analyze and troubleshoot protobuf network | traffic? Wireshark has a protobuf parser that integrates with | our dissection API: | | https://gitlab.com/wireshark/wireshark/-/blob/master/epan/pr. | .. | jeffbee wrote: | Hrmm, I don't see why you would need to think about proto | files to do this. You can dissect protocol messages on the | wire using the descriptor. In fact, I would say that would | be a good improvement to the code you just showed me. | jen20 wrote: | (I don't work at Buf, but happen to be able to answer this) - | the post at [1] describes the rationale for wanting something | different than the Google compiler. | | To my mind I'd rather have something written in Go that I can | pull in and version using `go.mod` instead of having to | special case a single tool, as well. | | [1]: https://docs.buf.build/reference/internal-compiler | season2episode3 wrote: | It appears Google has released a spec as of 11 days ago: | https://github.com/protocolbuffers/protobuf/issues/6188#issu... | bufbuild wrote: | That is for the text format, which is a serialized | representation of Protobuf data. As they specify in the linked | doc, it is not the format for the actual language: | | > This format is distinct from the format of text within a | .proto schema. | silasdavis wrote: | Golang is an example of a language defined by a spec not an | implementation? Discuss. | wrs wrote: | I see your point, but the implementers of Go do pay a lot more | attention than most "defined by an implementation" languages to | specifying what they're doing before they do it. And if you | find a difference between the specification and the | implementation, generally the specification will prevail. | jvolkman wrote: | This seems like a great resource. Kudos. | | > But most of them are based on the incomplete specs from | Google's developer site. None of them can correctly predict what | source files protoc will actually accept or reject 100% of the | time. | | I'd like to think I got pretty close with the plugin that now | ships with IntelliJ. It even supports the 65-bit integer literal | [1] that protoc happens to accept for proto2-style float and | double default values. | | With this as a starting point, it'd be nice to fix some of the | pecularities that arise from "implementation as spec", such as | that literal value, and the fact that colon optionality in text | format is based on value type, not syntax. | | 1: https://github.com/jvolkman/intellij-protobuf- | editor/blob/6e... | miohtama wrote: | Out of curiosity, why does Protobuf allow a negative 64-bit | value in the first place? No CPU architecture supports such as | far as I know. | kentonv wrote: | It's been 15 years since I wrote this so I don't remember | exactly, but I think it's just an implementation quirk. See: | | https://github.com/protocolbuffers/protobuf/blob/main/src/go. | .. | | The parser consumes a "-", then consumes a number. The number | can be any 64-bit unsigned integer. It is then converted to a | double. Finally, if a "-" was seen earlier, it is negated. So | by accident, it ends up allowing the range of a 65-bit signed | integer. | jvolkman wrote: | It's just a long-standing quirk in the parser. It parses the | numeric part as an unsigned 64-bit number, then applies the | sign afterwards. And the result can be approximately stuffed | into a floating point value. | | The behavior for integer fields is different; compilation | will fail with an out of range error. ___________________________________________________________________ (page generated 2022-09-12 23:00 UTC)