_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
 (HTM) Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
 (HTM)   Meta String: A more space-efficient string encoding than UTF-8 in Fury
       
       
        hgs3 wrote 7 hours 42 min ago:
        How does this compare to the official Unicode compression schemes? [1]
        [2] [1]
        
 (HTM)  [1]: https://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_...
 (HTM)  [2]: https://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_U...
       
        lifthrasiir wrote 17 hours 28 min ago:
        In addition to the existing dictionary compression algorithms, it is
        very important to know the characteristics of string contents. The
        proposed encoding has a very crude mechanism to describe them, but
        that's all (and that encoding flag is included to every string encoded,
        even though the schema could have included that information instead).
        Namespaces would start with only a small number of fixed prefixes like
        `java.`, `com.` and `org.`. Namespaces, paths and file names will have
        a lot of overlapping substrings as well. Class and field names are more
        chaotic, but some heuristics can be easily extracted as most of them
        will obey the common naming convention.
        
        By the way, the specification is highly confusing to interpret to say
        the least as well. For example, is the "ref(erence) meta" a separate
        section in the serialization format or not? The corresponding section
        never mentions which bytes are to be written, and reference flags
        apparently describe the state of each object, so it can't be in that
        section anyway. Providing at least a single worked example would have
        significantly reduced confusion.
        
        (Due to the inability to fully comprehend the specification, I can't
        give a general criticism and/or advice for now. But it does seem to
        miss several lessons from existing serialization formats. For instance,
        a varint encoding based on a contiuation bit is actually the worst
        encoding for given distribution, even though it is too common! Consider
        an alternative like `1^n 0 bit^(7n+7)` which can determine the
        contiuation length from the first byte instead.)
       
        chubot wrote 1 day ago:
        Apple's NSString apparently uses an in-memory encoding (not wire
        encoding) that can go as low as 5 bits per byte, as of OS X 10.10 in
        2014: [1] If the length is between 0 and 7, store the string as raw
        eight-bit characters.
        
        If the length is 8 or 9, store the string in a six-bit encoding, using
        the alphabet "eilotrm.apdnsIc
        ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX".*
        
        If the length is 10 or 11, store the string in a five-bit encoding,
        using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013"
        
        ---
        
        How about 5 bits? This isn't totally ludicrous. There are probably a
        lot of strings which are just lowercase, for example. 5 bits gives 32
        possible values. If you include the whole lowercase alphabet, there are
        6 extra values, which you could allot to the more common uppercase
        letters, or some symbols, or digits, or some mix.
        
        Related thread - [2] It's interesting how network formats and in-memory
        formats kinda converge in design, because retrieving data from memory
        to CPU is so expensive now.
        
 (HTM)  [1]: https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer...
 (HTM)  [2]: https://lobste.rs/s/5417dx/storing_data_pointers#c_l4zfrv
       
          arcticbull wrote 1 day ago:
          ... only in tagged pointers meaning the total size can't exceed like
          ~60 bits. After that, the canonical representation of NSString is
          UTF-16 and in Swift strings is UTF-8.
          
          So not "in memory" (although obviously it is in memory) it's when the
          goal is to fit the totality of the string plus a tag into a register.
          
          If they can't completely encode the 10-11 character value using the
          5-bit charset it spills to UTF-16 on the heap.
          
          > It's interesting how network formats and in-memory formats kinda
          converge in design, because retrieving data from memory to CPU is so
          expensive now.
          
          Honestly Strings are probably one of the worst examples of this.
          Strings have like 50 different representations depending on what
          you're trying to do with them.
          
          1. Wire format for storage and retrieval is generally UTF-8.
          
          2. If you're trying to parse a string you generally want to expand it
          into code points.
          
          3. If you're trying to sort or compare strings you generally want to
          expand that into a normalized form like NFC or NFD.
          
          4. If you're trying to edit text, you generally want grapheme
          clusters.
          
          5. If you're trying to present text, you generally want pixels.
          
          The way to think about it is: "there is no such things as the length
          of a string without context." Strings are basically a bytecode format
          for representing concepts.
          
 (HTM)    [1]: https://developer.apple.com/documentation/foundation/nsstrin...
       
            chubot wrote 19 hours 55 min ago:
            My point was something like
            
            - if you want to send a bunch of small strings over the network,
            you can pick a variable length encoding even more compact than
            UTF-8
            
            - if you want to pass and return a lot of strings as values,  store
            them in data structures like a List, then you can pick a variable
            length encoding even more compact than UTF-8.  So you can compute
            on the entire List without indirections for the common case of
            small strings, with good cache locality
            
            ---
            
            That doesn't really contradict anything you said -- the tagged
            pointer is in some sense a "wire format", and then you have to do a
            conversion to actually do something useful with the string.
            
            Although sometimes you don't need conversions either.  For len() in
            bytes or code points and graphemes, which are ALL identical for
            ASCII, you can calculate it directly on the tagged pointer.
            
            And you also have random access to bytes, code points, and
            graphemes in that special case.  A lot of the "work" happened when
            deciding that this encoding is valid for a particular string, which
            is nice.
       
              arcticbull wrote 2 hours 32 min ago:
              > For len() in bytes or code points and graphemes, which are ALL
              identical for ASCII, you can calculate it directly on the tagged
              pointer.
              
              7-bit ASCII yeah, not 8-bit, which it appears are supported in
              tagged pointers per your original comment.
              
              8-bit ASCII has fun characters like é (ASCII 130, 0x82) which
              may be 1 grapheme cluster -- but.
              
              There's several ways to represent it in Unicode. You can do it as
              U+00E9 (NFC) or you can do it as U+0065 U+0301 (the combining
              acute accent, NFD).
              
              This means it's a byte length of 1 in ASCII -- but several in
              UTF-8 (since UTF-8 only has direct compatibility with the 7-bit
              plane). And either 2 or 4 in UTF-16.
              
              Also, it means this string has a Unicode code point length of 1
              or 2 depending on context and normalization form.
              
              Again, asking the context-free len() of a string is a meaningless
              question.
              
              > And you also have random access to bytes, code points, and
              graphemes in that special case.
              
              You should never make this assumption. You should treat strings
              as opaque data blobs.
              
              Unicode is just a billion edge cases in a trench coat ;)
       
              chaokunyang wrote 7 hours 42 min ago:
              Yes, we only use this encoding when it brings benifits. `wire
              format` is a good term to describe such cases, which the context
              are constrained to current message only. So the data for
              compression are small, and many statistical methods won't work
              since it's a `wire format`. The stats itself must be send to
              peers, which will have bigger space and cpu cost
       
        amiga386 wrote 1 day ago:
        Everything old is new again. This reminds me of ZSCII, which was how
        Infocom games encoded string data to fit ~100kb of text data into a
        Commodore 64, packing 3 characters into every 2 bytes.
        
            --first byte-------   --second byte---
            7     6 5 4 3 2  1 0   7 6 5  4 3 2 1 0
            bit  --first--  --second---  --third--
        
        The 5-bit character encoding was one of three character sets:
        
               Z-char 6789abcdef0123456789abcdef
            current   --------------------------
              A0      abcdefghijklmnopqrstuvwxyz
              A1      ABCDEFGHIJKLMNOPQRSTUVWXYZ
              A2       ^0123456789.,!?_#'"/\-:()
                  --------------------------
        
        Z-char 0 was space in all character sets, z-char 1 was newline, z-chars
        2 and 3 switched to one of the other character sets depending on which
        one you were already in for the next character only, and z-chars 4 and
        5 switched permanently, the "shift-lock".
        
 (HTM)  [1]: https://inform-fiction.org/zmachine/standards/z1point0/sect03....
       
          chaokunyang wrote 1 day ago:
          This is interesting, thanks for sharing this. I will take a deep look
          at it.
       
        NohatCoder wrote 1 day ago:
        I can't help but feel like something has gone fundamentally wrong when
        there are so many arbitrary strings in the system that this
        optimisation makes a tangible difference.
       
          smsm42 wrote 1 day ago:
          Why is it wrong? It is a widely known fact that texts are redundant
          and compress well. Many systems work with texts and there's an
          advantage when those texts are human-comprehensible. You can make a
          system which would auto-compress and decompress any text of
          sufficient length - and some do - but there's absolutely no surprise
          that texts are compressible and it doesn't indicate there's something
          wrong. Texts are not optimized for space, they are optimized for
          human convenience.
       
            IsTom wrote 10 hours 5 min ago:
            If you establish mapping first you could probably use 4 or 8 byte
            integers as ids instead of using custom encodings to gain 30 vs 19
            bytes. With 8 bytes you could even use a hash.
       
            zokier wrote 1 day ago:
            But these are not general text, they are identifiers. If you are
            building space-efficient RPC system then I do think it is
            reasonable to question why are these identifiers passed as strings
            in such quantity that it makes a difference.
            
            On top of my head, a better approach could be to have some
            mechanism to establish mapping from these human-readable
            identifiers to numeric identifiers (protocol handshake or some
            other schema exchange), and then use those numeric identifiers in
            the actual high-volume messages.
            
            edit: umm... seems like fury is doing something like that already
            [1] so I am bit puzzled if this saving really makes meaningful
            difference?!
            
 (HTM)      [1]: https://fury.apache.org/docs/guide/java_object_graph_guide...
       
              smsm42 wrote 1 day ago:
              Identifiers are textual also (though shorter). You don't have
              arbitrary byte strings as identifiers. And yes, if you build RPC
              specifically, then you have to be ready to the fact that you will
              have to deal with symbolic identifiers. You can work around that
              by transcoding them somehow into an efficient format - like
              protobuf does by forcing the schema to provide translation, or by
              additional communication as you mentioned, but it's unavoidable
              that the symbolic identifiers will pop up somewhere and they will
              be human-readable and thus redundant.
              
              > I am bit puzzled if this saving really makes meaningful
              difference?!
              
              They make somewhat misleading claim - "37.5% space efficient
              against UTF-8" - but that doesn't tell us how much gain it is on
              a typical protocol interaction, It could be 37.5% improvement on
              0.1% of all data or on 50% of all data - depending on that, the
              overall effect could vary drastically. I guess if they are doing
              it, it makes some sense for them, but it's hard to figure out how
              much it saves in the big picture.
       
          chaokunyang wrote 1 day ago:
          Meta string is not designed for encoding arbitrary strings, they are
          used to encode limited enumerated string only, such as 
          classname/fieldname/packagename/namespace/path/modulename. This is
          why we name it as meta string, used in a case where lossless
          statistical compresion can't provide a gain
       
        vbezhenar wrote 1 day ago:
        What about using gzip on messages? I'd expect for it to yield similar
        savings while keeping message format simpler.
       
          chaokunyang wrote 1 day ago:
          Using gzip to the whole stream will introduce extra cpu cost, we try
          to make out serialization implementation as fast as possible. If we
          apply gzip only on such small meta string, it emits bigger size. I
          belive some stats are written in the data.
       
          masklinn wrote 1 day ago:
          You'd never use gzip for that, it has two dozen bytes of overhead
          before you can even start building a deflate stream. Even a raw
          deflate stream might not recoup the overhead on values which are only
          a few dozen bytes at the most, and if you're not using a static
          huffman tree the overhead for the tree is going to end all
          possibility of a gain.
          
          Furthermore LZ constructions work off of redundancy, which is
          difficult to find in large amounts in such short strings. The
          overhead of framing will just increase the size of the payload after
          construction.
          
          You could define just a static huffman tree and make that part of
          your spec.
       
            chaokunyang wrote 1 day ago:
            Yes, totally agree. I tried gzip first, but it introduce about
            10~20 bytes cost first. So I proposed such meta encoding here. If
            we have a much bigger string, gzip will be better, but all string
            we want to compress here are just
            classname/fieldname/packagename/namespace/path/modulename. We don't
            have enough repete pattern for gzip to work
       
              Dylan16807 wrote 1 day ago:
              If you rip off the gzip headers and footers, DEFLATE applied to a
              small string only has 10 bits of overhead, 2 bytes with padding.
              
              The default Huffman table is not suited for this use case, and a
              Huffman table in general is a waste of space these days, but the
              overhead is very small and can be reduced to nothing if you know
              the original string length.
       
        eps wrote 1 day ago:
        It's a compression via domain-specific frequency encoding.
        
        Good, but that's a pretty common technique for cases where every bit
        counts.
       
          chaokunyang wrote 1 day ago:
          Yes, in many rpc/serialization systems, we need to serialize class
          name, field names, package names.  So we can supprrt dynamic
          deserialization, type forward/backward compatibility. In such cases,
          those meta may took more bits than the actual data, this is where
          meta string bring gains
       
        vlovich123 wrote 1 day ago:
        When doing ad-hoc space optimized encodings, it’s usually a good idea
        to compare the compress+decompress speed in addition to the resultant
        size against just compressing the data. This may win on one-off
        encodings of strings within an RPC, but the compression mechanism
        probably wins on the overall RPC message unless the message is very
        small & you can’t build a more optimized baseline dictionary for
        small messages.
        
        It’s really hard to actually beat something like zstd with these
        ad-hoc compression since both the compressed size will be bigger than
        what zstd produces with a dictionary or on larger messages without a
        dictionary AND the speed of a round trip through zstd is likely faster.
       
          chaokunyang wrote 1 day ago:
          We use meta string mainly for object serialization. Users passed an
          object to Fury, we return a binary to users. 
          In such cases, the serialized binary are mostly in 200~1000 bytes.
          Not big enough for zstd to work.
          
          For dictionary encoding, Fury used similar tech too. When a string
          first come up, we encoding it as binary, later writing in same object
          graph will write such string as an varint id.
          
          Zstd can be used with Fury too. For multiple serialization across
          different objects. Fury don't have a context to compression, unless
          users enable fury meta share mode. If meta share mode not enabled,
          zstd can be used to compress data across multiple object graph
          serializaiton
       
            vlovich123 wrote 1 day ago:
            > In such cases, the serialized binary are mostly in 200~1000
            bytes. Not big enough for zstd to work
            
            You're not referring to the same dictionary that I am. Look at
            --train in [1].
            
            If you have a training corpus of representative data, you can
            generate a dictionary that you preshare on both sides which will
            perform much better for very small binaries (including 200-1k
            bytes). It's the same kind of technique but zstd's mechanism is
            absolutely general whereas purpose-built non-entropy based
            dictionaries are more limited & less likely to outperform.
            
            If you want maximum flexibility (i.e. you don't know the universe
            of representative messages ahead of time or you want maximum
            compression performance), you can gather this corpus transparently
            as messages are generated & then generate a dictionary & attach it
            as sideband metadata to a message. You'll probably need to defer
            the decoding if it references a dictionary not yet received (i.e.
            send delivers messages out-of-order from generation). There are
            other techniques you can apply, but the general rule is that your
            custom encoding scheme is unlikely to outperform zstd + a
            representative training corpus. If it does, you'd need to actually
            show this rather than try to argue from first principles.
            
 (HTM)      [1]: https://github.com/facebook/zstd/blob/dev/programs/zstd.1....
       
              chaokunyang wrote 8 hours 5 min ago:
              Sadly, we don't have such a training corpus of representative
              data. Fury is just a serialization framework, we can't assume any
              string distribution. I thought about scan the  code of apache
              ofbiz, and use the domain objects in this project as the default
              corpus to carry a static huffman/zstd. But ofbiz may not be
              representative still.
              For your second suggestion, I can't agree more. Actually Fury has
              already implemented this, we call it meta share mode. Fury will
              write such meta only once on a channel. And resend such meta if
              the channel got disconnected. But this is not easy to use and
              impose overhead to users. So we proposed meta encoding here.
              Anyway, yours suggestions are very insightful. Thanks very much
       
              irq-1 wrote 1 day ago:
              Wow. Didn't know about Train and custom dictionaries. Very cool.
       
          chaokunyang wrote 1 day ago:
          It's not designed to beat zstd, those are used for different
          scenarios. zstd is used for data compression, the meta string
          encoding here is used for meta meta encoding. Meta data here are 
          classname/fieldname/packagename/namespace/path/modulename mostly. For
          such small string, any statistical compression    won't have a smaller
          size.
       
            Dylan16807 wrote 1 day ago:
            > For such small string, any statistical compression won't have a
            smaller size.
            
            You can hardcode expected frequencies and throw arithmetic encoding
            at it and the average size will probably drop a meaningful amount.
            
            And I can't easily find an example corpus, but the description of
            these strings sounds like they'd often have repetition, so another
            symbol to encode repetition and make this into a normal compression
            algorithm is probably worth it.
            
            I wonder how many of these string start with org.apache
       
            taeric wrote 1 day ago:
            I'm curious on this.  Why wouldn't statistical compression go
            smaller?  Assuming you restrict the statistics in question to be
            tailored to the metadata that is being compressed, it feels like it
            should be comparable?  Yes, you may need/want to store a
            precomputed dictionary at the receiving end so that you aren't
            sending that every time, but that is really no different from
            having custom code that you need at receiving end.  Right?
       
              chaokunyang wrote 7 hours 50 min ago:
              The thing is that we can't store a precomputed dictionary. Fury
              is just a serialization framework, we don't know which data will
              be serialized
       
                taeric wrote 7 hours 43 min ago:
                If you can store new code, you can get a more optimized
                dictionary for what you are doing, though?  Why not?
       
            vlovich123 wrote 1 day ago:
            > For such small string, any statistical compression won't have a
            smaller size
            
            That is a statement you can make, but you have to actually
            demonstrate this is true against zstd's --train mechanism [1] which
            generates a more optimized "small string" dictionary based on
            representative training data precisely for this use-case. [1] 
            
            > those are used for different scenarios. zstd is used for data
            compression, the meta string encoding here is used for meta meta
            encoding
            
            Not sure what this means. The motivating use-case described for the
            alternate string encoding is precisely compression.
            
 (HTM)      [1]: https://github.com/facebook/zstd/blob/dev/programs/zstd.1....
       
              chaokunyang wrote 16 hours 5 min ago:
              Dictionary encoding is already used in fury. We will encode same
              string as an varint when it's seen later. The thing here is that
              many cases the string don't have repeat. Imagine you send `record
              Point(into x, int y)` in an rpc. You have one 'Point' to send. No
              dictionary encoding will be used in such cases
       
                vlovich123 wrote 9 hours 23 min ago:
                You seem to be fundamentally misunderstanding how the zstd
                dictionary works. It’s a dictionary at the statistical level.
                Your `record Point(int x, int y)` RPC would statistically at
                the bit level have to look like no other message for it to have
                no help. That seems unlikely since there’s all sorts of
                framing. And I question the mindset of ad-hoc compression
                schemes that are trying to shrink the long tail of “sent only
                once” messages.
                
                zstd’s approach is fundamentally very different from a basic
                interning dictionary. Seriously, there’s been tons of
                research on this. It’s also interesting that Fury claims to
                be 0-copy and then does all these ad-hoc field compression
                which is the literal definition of not 0-copy (& yes, varint
                encoding is not 0-copy and neither is this custom string
                compression).
                
                Anyway, unless you actually do a proper comparison against zstd
                with a trained dictionary against a representative corpus,
                we’re just going in circles with me saying “you need to
                evaluate your ad-hoc compression against zstd” and you trying
                to argue from first principles that the custom compression
                scheme wins.
       
                  chaokunyang wrote 7 hours 51 min ago:
                  Hi, it's not “sent only once” . It's millions of “sent
                  only once” .
                  
                  The thing here are that all those RPC are stateless, so the
                  context for compression are the one message itself. i.e.
                  compress object like `Point(1,2)` only without priori
                  knowledge. Users may use train static zstd. But Fury can't,
                  Fury doesn't know which data will be serialized.
                  
                  And meta string encoding are not statistical, it won't beats
                  zstd. It's just for cases zstd is not suitable.
       
        bawolff wrote 1 day ago:
        Seems like they are just using a custom 5 bit charset for strings with
        a limited alphabet.
        
        That's not really a rethink of string encoding, just using a different
        fixed encoding.
        
        Given the usecase, not sure i understand why not just use static
        huffman.
       
          smsm42 wrote 1 day ago:
          It looks like it's the same idea as static Huffman without the
          variable length part. Comparing it to UTF-8 is pointless, of course -
          UTF-8 is an universal text representation (whose major selling point
          is it's also ASCII-compatible and human-readable), and this is pretty
          much a compression algorithm. It is obvious natural language based
          text is redundant, so no wonder it's compressible and I am sure a lot
          of literature exists on which compression works best on which kinds
          of texts.
       
          chaokunyang wrote 1 day ago:
          We first consider using some lossless compression algorithms, but
          since we are just encoding meta strings such as
          classname/fieldname/packagename/path. There doesn't have much
          duplciation pattern to detect, the compression won't have a very good
          effect. If we are encoding generic string, such  compression
          algorithms such astd/deflate will be better, but that's not the case
          here. This is why we name it as meta string
       
          chaokunyang wrote 1 day ago:
          We tried huffman first, But we can't know which string will be
          serialized. If using huffman, we must collect most strings and
          compute a static huffman tree. If not, we must send huffman stats to
          peer, which the cost will be bigger
       
            topaz0 wrote 1 day ago:
            The fact that the example is "org.apache.fury.benchmark.data" makes
            me think that there is probably a lot of room to do better than 5
            bits per char, if the goal is simply to reduce space. E.g. you
            probably know ahead of time that the string "org.apache.fury" is
            going to be very common in this context, and giving that a 2-bit or
            even an 8-bit codepoint would be a huge space savings over the
            project. That said, it's not super surprising that you'd rather
            have a suboptimal encoding than have the complexity of a decoder
            that depends on the project. But of course you've already chosen to
            sacrifice simplicity for the sake of space to some degree by
            rolling your own encoding when you could use readily agreed-upon
            ones, which people assume is because space is at a premium. It's
            not surprising that the reader wonders why this space savings is
            worth this complexity, but that space savings is not worth it. I
            think what's missing is the justification of the tradeoffs, at
            least in the blog post. Both the space vs time tradeoffs which I
            can imagine are important here, and the space vs. complexity
            tradeoffs that are more subjective.
       
              chaokunyang wrote 1 day ago:
              We have such options in Fury too. We support register classes. In
              such cases, such string will be encoded with a varint id. It's
              kind of dict encoding. But not all users like to register
              classes. Meta string will try to reduce space cost when such dict
              encoding are not available.
              
              Your suggestions are great. I should clarify why dict encoding
              can't be used to recude cost.
              
              As for performance, it won't be an issue. The string for meta
              encoding are limited, we can cache the encoded result. So it's
              not in critical path
       
            masklinn wrote 1 day ago:
            > If using huffman, we must collect most strings and compute a
            static huffman tree.
            
            Isn't that essentially what you've done to define your bespoke
            alphabets / partial encodings?
       
              chaokunyang wrote 1 day ago:
              The thing is that we don't the frequency about every char happens
       
                ryan-c wrote 1 day ago:
                You don't need to. You compute a static huffman tree over fixed
                probabilities you've estimated or generated from a corpus.
       
                  chaokunyang wrote 1 day ago:
                  Good suggestion, I was planing it as an optional option. We
                  can crawl some github repos, and extract most types, extract
                  their class names, paths, fields names, and use such data as
                  the corpus
       
                    ryan-c wrote 1 day ago:
                    Crude static Huffman example, that could definitely be
                    improved:
                    
                         5 bits 0 0000-1111      acde fghi lmno rstu
                         7 bits 10 00000-11111   bjkp qvwx yzAB CDEF GHIK LMNO
                    PRST UVWY
                         7 bits 110 0000-1111    JQXZ 0123 4567 89/.
                         8 bits 111 00000-11110  !@#$ $%^& *()_ {}[] `~+= |\"'
                    ;:<> ,?  (space at the end)
                        16 bits 11111111 00000000-11111111 plus UTF-8 1 to 256
                    unicode characters encoded as UTF-8
                    
                    You could even include some bigrams in this sort of table.
                    
                    There's some code here that could maybe be used for that
                    sort of static huffman tree: [1] Alternatively, have
                    something like this:
                    
                        00 000000-111111 1-64 characters, 5 bits each
                        01 000000-111111 1-64 characters, 6 bits each
                        10 000000-111111 1-64 characters, ~6.4 bits each
                    (character set of 84 characters, every 5 packed into 32
                    bits)
                        11 000000-111111 1-64 characters, UTF-8
                    
                    This is vaguely similar to how data is encoded for QR
                    codes.
                    
                    As pointed out elsewhere, this will not outperform zstd
                    with a dictionary trained on a corpus, but zstd would
                    require pulling in a lot of code.
                    
 (HTM)              [1]: https://github.com/ryancdotorg/lztp/blob/main/gene...
       
                      masklinn wrote 1 day ago:
                      > This is vaguely similar to how data is encoded for QR
                      codes.
                      
                      Doesn't QR use variable static encodings (alphabets)?
                      e.g. mode 1 is numerics, mode 2 is uppercase
                      alphanumeric, mode 3 is binary, and mode 4 is jis
                      (japanese).
       
                        ryan-c wrote 1 day ago:
                        QR codes use a series of (mode, length, characters)
                        segments - a given code can switch between modes. I
                        think for QR codes the length encoding is
                        mode-specific.
       
        agilob wrote 1 day ago:
        If you're intersted in this topic, there exist more string compression
        algorithms that save data transfer or memory in different ways. Look
        into Thrift Compact Protocol, MessagePack, Protocol Buffers and Avro.
       
          chaokunyang wrote 1 day ago:
          And Fury meta string are not used to compress data plane , it's used
          to compress meta in data plane.
       
          chaokunyang wrote 1 day ago:
          Protobuf only support varint encoding, fury supports that too. But
          IMO, protobuf use utf-8 for string encoding. Could you share more
          details how protobuf compress string.
       
        chaokunyang wrote 1 day ago:
        For implemetation details, [1] can be taken as an example
        
 (HTM)  [1]: https://github.com/apache/incubator-fury/blob/main/java/fury-c...
       
        chaokunyang wrote 1 day ago:
        For spec details, please see fury meta string spec:
        
 (HTM)  [1]: https://fury.apache.org/docs/specification/fury_xlang_serializ...
       
        chaokunyang wrote 1 day ago:
        In rpc/serialization systems, we often need to send
        namespace/path/filename/fieldName/packageName/moduleName/className/enum
        Value string between processes.
        
        Those strings are mostly ascii strings. In order to transfer between
        processes, we encode such strings using utf-8 encodings. Such encoding
        will take one byte for every char, which is not space efficient
        actually.
        
        If we take a deeper look, we will found that most chars are lowercase
        chars, ., $ and _, which can be expressed in a much smaller range 0~32.
        But one byte can represent range 0~255, the significant bits are
        wasted, and this cost is not ignorable. In a dynamic serialization
        framework, such meta will take considerable cost compared to actual
        data.
        
        So we proposed a new string encoding which we called meta string
        encoding in Fury. It will encode most chars using 5 bits instead of 8
        bits in utf-8 encoding, which can bring 37.5% space cost savings
        compared to utf-8 encoding.
        
        For string can't be represented by 5 bits, we also proposed encoding
        using 6 bits which can bring 25% space cost savings
       
          masfuerte wrote 1 day ago:
          You don't include '/' in the list of special characters so how are
          you using this to encode paths?  As an array of strings?
       
            chaokunyang wrote 1 day ago:
            We can use `|` instead, `|` is not used in path. Currently we have
            only 2 chars can be extended. I haven't decided what to inlucde
       
       
 (DIR) <- back to front page