[HN Gopher] Kaitai Struct: A new way to develop parsers for bina...
       ___________________________________________________________________
        
       Kaitai Struct: A new way to develop parsers for binary structures
        
       Author : marcodiego
       Score  : 74 points
       Date   : 2022-03-17 20:13 UTC (2 hours ago)
        
 (HTM) web link (kaitai.io)
 (TXT) w3m dump (kaitai.io)
        
       | kangalioo wrote:
       | There's also Wuffs, a safe and fast programming language made by
       | Google specifically for decoding and encoding file formats
       | https://github.com/google/wuffs
       | 
       | Paired with C FFI available in most languages, this seems like
       | the nicer solution. It's simpler than generating code for a bunch
       | of high level languages, and more performant
        
         | layer8 wrote:
         | Not for managed environments like client-side JS, JVM, .NET,
         | ...
        
           | jmgao wrote:
           | This appears to just allow you to parse binary formats to the
           | represented fields. (Not that that's not extremely useful,
           | doing this in managed languages is generally a giant pain in
           | the ass!)
           | 
           | wuffs is much more powerful: it's essentially a safe C
           | dialect that compiles to C, that lets you write an entire
           | codec and know that there aren't any overflows.
        
         | eesmith wrote:
         | How much of a future should I expect for Wuffs?
         | 
         | The linked-to page says: "Version 0.2. The API and ABI aren't
         | stabilized yet. The compiler undoubtedly has bugs."
         | 
         | There are not many recent commits, and mostly by one developer.
        
       | secondcoming wrote:
       | Interesting, but now you have to add in the possibility of having
       | bugs in your YAML file. The YAML is probably less readable than
       | the spec for the binary format itself.
       | 
       | Looking at the code-gen for utf8_string [0] and it's a case of
       | 'thanks, but no thanks'
       | 
       | > std::unique_ptr<std::vector<std::unique_ptr<utf8_codepoint_t>>>
       | m_codepoints;
       | 
       | This is a solution looking for a problem, but I bet it was fun to
       | write.
       | 
       | [0] https://formats.kaitai.io/utf8_string/cpp_stl_11.html
        
       | asadawadia wrote:
       | Great library - too bad it only allows reading
        
         | ctoth wrote:
         | If you're working in Python and need to write as well as read
         | check out Construct[0], which is also a declarative parser
         | builder.
         | 
         | [0]: https://construct.readthedocs.io/en/latest/intro.html
        
       | CGamesPlay wrote:
       | As a code generator, I guess this may be nice. It seems like a
       | DSL like the Nom [0] API is more natural and expressive, though.
       | I imagine you can hit limits to expressiveness in Yaml pretty
       | quickly.
       | 
       | [0] https://github.com/Geal/nom
        
       | mturk wrote:
       | Kaitai is a really great system, with an awesome WebIDE. At work
       | we have just started a project to use it for astrophysics
       | simulations and data from dark matter detectors, and one of my
       | hobby projects is to use it to explore retro game data formats.
        
       | jll29 wrote:
       | Kudos - this is neat - I especially love the library of pre-
       | existing descriptions, which helps me to learn about the tool as
       | well as about an abundance of file formats without re-engineering
       | time wasted.
       | 
       | This is somewhat akin to ASN.1.
       | 
       | My personal feature wish list:
       | 
       | - support writing as well as reading;
       | 
       | - support generating Rust, Julia and Swift code.
       | 
       | - upload button to let users add to a contrib/ folder of existing
       | format descriptions
        
       | dhx wrote:
       | I contributed a number of file formats a few years ago (and
       | attempted numerous others) but ran into a number of problems with
       | certain file formats:
       | 
       | 1. It's not possible to read from the file until a multiple byte
       | termination sequence is detected. [1]
       | 
       | 2. You can't read sections of a file where the termination
       | condition is the presence of a sequence of bytes denoting the
       | next unrelated section of the file (and you don't want to
       | consume/read these bytes) [2]
       | 
       | 3. The WebIDE at the time couldn't handle very large file format
       | specifications such as Photoshop (PSD) [3]
       | 
       | 4. Files containing compressed or encrypted sections require a
       | compression/encryption algorithm to be hardcoded into Kaitai
       | struct libraries for each programming language it can output to.
       | 
       | The WebIDE I particularly liked as it makes it easy to get
       | started and share results. I also liked how Kaitai Struct allows
       | easy definition of constraints (simple ones at least) into the
       | file format specification so that you can say "this section of
       | the file shall have a size not exceeding header.length * 2
       | bytes".
       | 
       | Some alternative binary file format specification attempts for
       | those interested in seeing alternatives, each with their own set
       | of problems/pros/cons:
       | 
       | 1. 010 Editor [4]
       | 
       | 2. Synalysis [5]
       | 
       | 3. hachoir [6]
       | 
       | 4. DFDL [7]
       | 
       | [1] https://github.com/kaitai-io/kaitai_struct/issues/158
       | 
       | [2] https://github.com/kaitai-io/kaitai_struct/issues/156
       | 
       | [3]
       | https://raw.githubusercontent.com/davidhicks/kaitai_struct_f...
       | 
       | [4] https://www.sweetscape.com/010editor/repository/templates/
       | 
       | [5] https://github.com/synalysis/Grammars
       | 
       | [6] https://github.com/vstinner/hachoir/tree/main/hachoir/parser
       | 
       | [7] https://github.com/DFDLSchemas/
        
       | gigel82 wrote:
       | Ugh, wish I'd found this a couple of years ago; after hand-
       | writing a Unity asset parser in node.js for a hobby project
       | (big/little-endian mixes, byte alignment, versioned header
       | format, different compression algos, etc.).
        
       | sidpatil wrote:
       | This looks really cool! This would have been really useful to me
       | a couple years ago.
        
         | lpapez wrote:
         | It was available a few years ago, and I found it very useful.
        
       | neonsunset wrote:
       | As far as .NET implementation goes, it is _really bad_ :
       | 
       | - Very old and currently obsolete project target
       | 
       | - As a result, does not use modern data types such as Span<T>
       | 
       | - No utilisation of ArrayPool<T> which is important for things
       | like serialisers where you expect to deal with buffers a lot
       | 
       | - Appears to be a blind Java port given provided code style
       | 
       | This is not acceptable when working with low-level and binary
       | structures which this standard is focused on. Yes, I know, this
       | is an OSS project and therefore instead of complaining here I
       | should have been working on contributing a PR to fix those
       | issues. However, my main concern is that this standard and set of
       | libraries in the current form work against the performance-
       | sensitive nature of working with binary data.
        
       | imglorp wrote:
       | Erlang got this right: for the narrow case of packets
       | in/mangle/out, described like an RFC bit-field diagram, it was
       | very clean and simple.
        
       | renewiltord wrote:
       | Seems rather well designed actually. Appears that you can even
       | use length-delimited lists and stuff. I like it. I have a project
       | where we have a compact binary encoding and I have to write
       | documentation _and_ serde for it. This works for docs and
       | deserialization so that's good. I understand why serialization
       | isn't supported but I feel like there's probably a clever API
       | that allows inserting your own ser in. We'll see. I might switch
       | our internal thing this weekend to it.
       | 
       | Would be cool if you could generate a protocol diagram from this.
        
       ___________________________________________________________________
       (page generated 2022-03-17 23:00 UTC)