[HN Gopher] Kaitai Struct: A new way to develop parsers for bina... ___________________________________________________________________ Kaitai Struct: A new way to develop parsers for binary structures Author : marcodiego Score : 74 points Date : 2022-03-17 20:13 UTC (2 hours ago) (HTM) web link (kaitai.io) (TXT) w3m dump (kaitai.io) | kangalioo wrote: | There's also Wuffs, a safe and fast programming language made by | Google specifically for decoding and encoding file formats | https://github.com/google/wuffs | | Paired with C FFI available in most languages, this seems like | the nicer solution. It's simpler than generating code for a bunch | of high level languages, and more performant | layer8 wrote: | Not for managed environments like client-side JS, JVM, .NET, | ... | jmgao wrote: | This appears to just allow you to parse binary formats to the | represented fields. (Not that that's not extremely useful, | doing this in managed languages is generally a giant pain in | the ass!) | | wuffs is much more powerful: it's essentially a safe C | dialect that compiles to C, that lets you write an entire | codec and know that there aren't any overflows. | eesmith wrote: | How much of a future should I expect for Wuffs? | | The linked-to page says: "Version 0.2. The API and ABI aren't | stabilized yet. The compiler undoubtedly has bugs." | | There are not many recent commits, and mostly by one developer. | secondcoming wrote: | Interesting, but now you have to add in the possibility of having | bugs in your YAML file. The YAML is probably less readable than | the spec for the binary format itself. | | Looking at the code-gen for utf8_string [0] and it's a case of | 'thanks, but no thanks' | | > std::unique_ptr<std::vector<std::unique_ptr<utf8_codepoint_t>>> | m_codepoints; | | This is a solution looking for a problem, but I bet it was fun to | write. | | [0] https://formats.kaitai.io/utf8_string/cpp_stl_11.html | asadawadia wrote: | Great library - too bad it only allows reading | ctoth wrote: | If you're working in Python and need to write as well as read | check out Construct[0], which is also a declarative parser | builder. | | [0]: https://construct.readthedocs.io/en/latest/intro.html | CGamesPlay wrote: | As a code generator, I guess this may be nice. It seems like a | DSL like the Nom [0] API is more natural and expressive, though. | I imagine you can hit limits to expressiveness in Yaml pretty | quickly. | | [0] https://github.com/Geal/nom | mturk wrote: | Kaitai is a really great system, with an awesome WebIDE. At work | we have just started a project to use it for astrophysics | simulations and data from dark matter detectors, and one of my | hobby projects is to use it to explore retro game data formats. | jll29 wrote: | Kudos - this is neat - I especially love the library of pre- | existing descriptions, which helps me to learn about the tool as | well as about an abundance of file formats without re-engineering | time wasted. | | This is somewhat akin to ASN.1. | | My personal feature wish list: | | - support writing as well as reading; | | - support generating Rust, Julia and Swift code. | | - upload button to let users add to a contrib/ folder of existing | format descriptions | dhx wrote: | I contributed a number of file formats a few years ago (and | attempted numerous others) but ran into a number of problems with | certain file formats: | | 1. It's not possible to read from the file until a multiple byte | termination sequence is detected. [1] | | 2. You can't read sections of a file where the termination | condition is the presence of a sequence of bytes denoting the | next unrelated section of the file (and you don't want to | consume/read these bytes) [2] | | 3. The WebIDE at the time couldn't handle very large file format | specifications such as Photoshop (PSD) [3] | | 4. Files containing compressed or encrypted sections require a | compression/encryption algorithm to be hardcoded into Kaitai | struct libraries for each programming language it can output to. | | The WebIDE I particularly liked as it makes it easy to get | started and share results. I also liked how Kaitai Struct allows | easy definition of constraints (simple ones at least) into the | file format specification so that you can say "this section of | the file shall have a size not exceeding header.length * 2 | bytes". | | Some alternative binary file format specification attempts for | those interested in seeing alternatives, each with their own set | of problems/pros/cons: | | 1. 010 Editor [4] | | 2. Synalysis [5] | | 3. hachoir [6] | | 4. DFDL [7] | | [1] https://github.com/kaitai-io/kaitai_struct/issues/158 | | [2] https://github.com/kaitai-io/kaitai_struct/issues/156 | | [3] | https://raw.githubusercontent.com/davidhicks/kaitai_struct_f... | | [4] https://www.sweetscape.com/010editor/repository/templates/ | | [5] https://github.com/synalysis/Grammars | | [6] https://github.com/vstinner/hachoir/tree/main/hachoir/parser | | [7] https://github.com/DFDLSchemas/ | gigel82 wrote: | Ugh, wish I'd found this a couple of years ago; after hand- | writing a Unity asset parser in node.js for a hobby project | (big/little-endian mixes, byte alignment, versioned header | format, different compression algos, etc.). | sidpatil wrote: | This looks really cool! This would have been really useful to me | a couple years ago. | lpapez wrote: | It was available a few years ago, and I found it very useful. | neonsunset wrote: | As far as .NET implementation goes, it is _really bad_ : | | - Very old and currently obsolete project target | | - As a result, does not use modern data types such as Span<T> | | - No utilisation of ArrayPool<T> which is important for things | like serialisers where you expect to deal with buffers a lot | | - Appears to be a blind Java port given provided code style | | This is not acceptable when working with low-level and binary | structures which this standard is focused on. Yes, I know, this | is an OSS project and therefore instead of complaining here I | should have been working on contributing a PR to fix those | issues. However, my main concern is that this standard and set of | libraries in the current form work against the performance- | sensitive nature of working with binary data. | imglorp wrote: | Erlang got this right: for the narrow case of packets | in/mangle/out, described like an RFC bit-field diagram, it was | very clean and simple. | renewiltord wrote: | Seems rather well designed actually. Appears that you can even | use length-delimited lists and stuff. I like it. I have a project | where we have a compact binary encoding and I have to write | documentation _and_ serde for it. This works for docs and | deserialization so that's good. I understand why serialization | isn't supported but I feel like there's probably a clever API | that allows inserting your own ser in. We'll see. I might switch | our internal thing this weekend to it. | | Would be cool if you could generate a protocol diagram from this. ___________________________________________________________________ (page generated 2022-03-17 23:00 UTC)