hngopher.com

       [HN Gopher] Elfshaker: Version control system fine-tuned for bin...
       ___________________________________________________________________
        
       Elfshaker: Version control system fine-tuned for binaries
        
       Author : jim90
       Score  : 466 points
       Date   : 2021-11-19 12:41 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mrich wrote:
       | I'm guessing this does not yield that high compression for
       | release builds, where code can be optimized across translation
       | units? Likewise when a commit changes a header that is included
       | in many cpps?
        
         | peterwaller-arm wrote:
         | Author here. The executables shipped in manyclangs are release
         | builds! The catch is that manyclangs stores object files pre-
         | link. Executables are materialized by relinking after they are
         | extracted with elfshaker.
         | 
         | The stored object files are compiled with -ffunction-sections
         | and -fdata-sections, which ensures that insertions/deletions to
         | the object file only have a local effect (they don't cause
         | relative addresses to change across the whole binary).
         | 
         | As you observe, anything which causes significant non-local
         | changes in the data you store is going to have a negative
         | effect when it comes to compression ratio. This is why we don't
         | store the original executables directly.
        
           | zeotroph wrote:
           | Thank you for the explanation, so the pre-link storage is one
           | of the magical ingredients, maybe mention this as well in the
           | README?
           | 
           | Is this the reason why manyclang (using llvms cmake based
           | build system) can be provided easily, but it would be more
           | difficult for gcc? Or is the object -> binary dependency
           | automatically deduced?
        
             | peterwaller-arm wrote:
             | > maybe mention this as well in the README?
             | 
             | We've tweaked the readme, I hope it's clearer.
             | 
             | It would be great to provide this for gcc too. The project
             | is new and we've just started out. I know less about gcc's
             | build system and how hard it will be to apply these
             | techniques there. It seems as though it should be possible
             | though and I'd love to see it happen.
             | 
             | To infer the object->executable dependencies we currently
             | read the compilation database and produce a stand-alone
             | link.sh shell script, which gets packaged into each
             | manyclangs snapshot.
        
               | zeotroph wrote:
               | Ah, the compilation database is where more magic
               | originates from :)
        
               | peterwaller-arm wrote:
               | Yes, this is less great than I would like! :( :)
        
           | mrich wrote:
           | Thanks. I had a use case in mind where LTO is enabled.
           | Unfortunately the LTO step is quite expensive so relinking
           | does not seem like a viable option. If I find some time I'll
           | give it a try though.
        
             | peterwaller-arm wrote:
             | ThinLTO can be pretty quick if you have enough cores, it
             | might work. Not sure how well the LTO objects compress
             | against each other when you have small changes to them. It
             | might work reasonably.
             | 
             | manyclangs is optimized to provide you with a binary
             | quickly. The binary is not necessarily itself optimized to
             | be fast, because it's expected that a developer might want
             | to access any version of it for the purposes of testing
             | whether some input manifests a bug or has a particular
             | codegen output. In that scenario, it's likely that the
             | developer is able to reduce the size of the input such that
             | the speed of the compiler itself is not terribly
             | significant in the overall runtime. Therefore, I don't see
             | LTO for manyclangs as such a significant win. But it is
             | still hoped that the overall end-to-end runtime is good,
             | and the binaries are optimized, just not with LTO.
        
       | nh2 wrote:
       | I experimented with something similar with a Linux distribution's
       | package binary cache.
       | 
       | Using `bup` (deduplicating backup tool using git packfile format)
       | I deduplicated 4 Chromium builds into the size of 1. It could
       | probably pack thousands into the size of a few.
       | 
       | Large download/storage requirements for updates are one of
       | NixOS's few drawbacks, and I think deduplication could solve that
       | pretty much completely.
       | 
       | Details: https://github.com/NixOS/nixpkgs/issues/89380
        
         | peterwaller-arm wrote:
         | Author here. I've used bup, and elfshaker was partially
         | inspired by it! It's great. However, during initial experiments
         | on this project I found bup to be slow, taking quite a long
         | time to snapshot and extract. I think this could in principle
         | be fixed in bup one day, perhaps.
        
           | Siira wrote:
           | Is elfshaker any good for backuping non-text data?
        
           | ybkshaw wrote:
           | Thank you for having such a good description on the project!
           | Sometimes the links from HN lead to a page that takes a few
           | minutes of puzzling to figure out what is going on but not
           | yours.
        
           | nh2 wrote:
           | I also use bup for a long time, but found that for very large
           | server backups I'm hitting performance problems (both in time
           | and memory usage).
           | 
           | I'm currently evaluating `bupstash` (also written in Rust) as
           | a replacment. It's faster and uses a lot less memory, but is
           | younger and thus lacks some features.
           | 
           | Here is somebody's benchmark of bupstas (unfortunately not
           | including `bup`):
           | https://acha.ninja/blog/encrypted_backup_shootout/
           | 
           | The `bupstash` author is super responsive on Gitter/Matrix,
           | it may make sense to join there to discuss
           | approaches/findings together.
           | 
           | I would really like to eventually have deduplication-as-a-
           | library, to make it easier to put into programs like nix, or
           | also other programs, e.g. for versioned "Save" functionality
           | in software like Blender or Meshlab that work with huge files
           | and for which diff-based incremental saving is more
           | difficult/fragile to implement than deduplcating snapshot
           | based saving.
        
             | pdimitar wrote:
             | I used `bupstash` and evaluated it for a while. I am
             | looking to do 5+ offsite backups of a small personal
             | directory to services that offer 5GB of cloud space for
             | free.
             | 
             | `bupstash` lacked good compression. I settled with `borg`
             | because I could use `zstd` compression with it. Currently
             | at 60 snapshots of the directory and the `borg` repo
             | directory is at ~1.52GB out of 5GB quota. The source
             | directory is ~12.19GB uncompressed. Very happy with `borg`
             | + `zstd` and how they handle my scenario.
             | 
             | I liked `bupstash` a lot, and the author is responsive and
             | friendly. But I won't be giving it another try until it
             | implements much more aggressive compression compared to
             | what it can do now. It's a shame, I _really_ wanted to use
             | it.
             | 
             | I do recognize that for many other scenarios `bupstash` is
             | very solid though.
        
       | veselink1 wrote:
       | An author here, we've opened a Q&A discussion on GitHub:
       | https://github.com/elfshaker/elfshaker/discussions/58.
        
       | thristian wrote:
       | This seems very much like the Git repository format, with loose
       | objects being collected into compressed pack files - except I
       | think Git has smarter heuristics about which files are likely to
       | compress well together. It would be interesting to see a
       | comparison between this tool and Git used to store the same
       | collection of similar files.
        
         | peterwaller-arm wrote:
         | An author here, I agree! The packfile format is heavily
         | inspired by git, and git may also do quite well at this.
         | 
         | We did some preliminary experiments with git a while back but
         | found we were able to do the packing and extraction much faster
         | and smaller than git was able to manage. However, we haven't
         | had the time to repeat the experiments with our latest
         | knowledge and the latest version of git. So it is entirely
         | possible that git might be an even better answer here in the
         | end. We just haven't done the best experiments yet. It's
         | something to bear in mind. If someone wants, they could measure
         | this fairly easily by unpacking our snapshots and storing them
         | into git.
         | 
         | On our machines, forming a snapshot of one llvm+clang build
         | takes hundreds of milliseconds. Forming a packfile for 2,000
         | clang builds with elfshaker can take seconds during the pack
         | phase with a 'low' compression level (a minute or two for the
         | best compression level, which gets it down to the ~50-100MiB/mo
         | range), and extracting takes less than a second. Initial
         | experiments with git showed it was going to be much slower.
        
           | johnyzee wrote:
           | As far as I was able to learn (don't remember the details,
           | sorry), git does not do well with large binary files. I
           | believe it ends up with a lot of duplication. It is the major
           | thing I am missing from git, currently we store assets (like
           | big PSDs that change often) outside of version control and it
           | is suboptimal.
        
             | peterwaller-arm wrote:
             | Performing poorly with non-textual data happens for a a
             | number of reasons. Binary data, when changed, often have a
             | lot of 'non-local' changes in them. For example, a PSD file
             | might well have a compression algorithm already applied to
             | it. An insertion/deletion is going to result in a very
             | different compressed representation for which there is no
             | good way to have an efficient delta. elfshaker will suffer
             | the same problem here.
        
               | JoshTriplett wrote:
               | Can you talk a bit more about what ELF-specific
               | heuristics elfshaker uses? What kind of preprocessing do
               | you do before zstd? Do you handle offsets changing in
               | instructions, like the BCJ/BCJ2 filter? Do you do
               | anything to detect insertions/deletions?
        
               | peterwaller-arm wrote:
               | We've just added an applicability section, which explains
               | a bit more what we do. We don't have any ELF specific
               | heuristics [0].
               | 
               | https://github.com/elfshaker/elfshaker#applicability
               | 
               | In summary, for manyclangs, we compile with -ffunction-
               | sections and -fdata-sections, and store the resulting
               | object files. These are fairly robust to insertions and
               | deletions, since the addresses are section relative, so
               | the damage of any addresses changing is contained within
               | the sections. A somewhat surprising thing is that this
               | works well enough when building many revisions of
               | clang/llvm -- as you go from commit to commit, many
               | commits have bit identical object files, even though the
               | build system often wants to rebuild them because some
               | input has changed.
               | 
               | elfshaker packs use a heuristic of sorting all unique
               | objects by size, before concatenating them and storing
               | them with zstandard. This gives us an amortized cost-per-
               | commit of something like 40kiB after compression with
               | zstandard.
               | 
               | [0] (edit: despite the playful name suggesting otherwise
               | -- when we chose the name we planned to do more with ELF
               | files, but it turned out to be unnecessary for our use
               | case)
        
               | JoshTriplett wrote:
               | Ah, I see! Makes sense that you can do much better if you
               | get to compile the programs with your choice of options.
        
               | derefr wrote:
               | One could, in theory, write a _git-clean_ filter (like
               | the one used for git-lfs), that teaches git various
               | heuristic approaches to  "take apart" well-known binary
               | container formats into trees of binary object leaf-nodes.
               | 
               | Then, when you committed a large binary that git could
               | understand, what git would really be committing in its
               | place would be a directory tree -- sort of like the
               | "resource tree" you see if you edit an MKV file, PNG
               | file, etc., but realized as files in directories. Git
               | would generate it, then commit it.
               | 
               | On checkout, this process would happen in reverse: a
               | matching _git-smudge_ filter could notice a metadata file
               | in each of these generated directories, and collapse the
               | contents of the directory together to form a binary
               | chunk; recursively, up the tree, until you hit the
               | toplevel, and end up with the original large binary
               | again.
               | 
               | Since most of the _generated leaf-nodes_ from this
               | process wouldn 't change on each commit, this would
               | eliminate most of the _storage_ overhead of having many
               | historical versions of large files in git. (In exchange
               | for: 1. the potentially-huge CPU overhead of doing this
               | "taking apart" of the file on every commit; 2. the added
               | IOPS for temporarily creating the files to commit them;
               | and 3. the loss of any file-level compression [though git
               | itself compresses its packfiles, so that's a wash.])
               | 
               | I'm almost inspired to try this out for a simple binary
               | tree format like
               | https://en.wikipedia.org/wiki/Interchange_File_Format.
               | But ELF wouldn't be too hard, either! (You could even go
               | well past the "logical tree" of ELF by splitting the text
               | section into objects per symbol, and ensuring the object
               | code for each symbol is stored in a PIC representation in
               | git, even if it isn't in the binary.)
        
             | ChrisMarshallNY wrote:
             | _> we store assets (like big PSDs that change often)
             | outside of version control and it is suboptimal._
             | 
             | Perforce is still used by game developers and other
             | creatives, because it handles large binaries, quite well.
             | 
             | In fact, I'm not sure if they still do it, but one of the
             | game engines (I think, maybe, Unreal) used to have a free
             | tier that also included a free Perforce install.
        
               | mdaniel wrote:
               | It was my recollection, and I confirmed it, that they've
               | almost always had a "the first hit is free" model for
               | small teams, and they also explicitly call out indie game
               | studios as getting free stuff too:
               | https://www.perforce.com/how-buy
        
           | 3np wrote:
           | Do you think it would be feasible to do a git-lfs replacement
           | based on elfshaker?
           | 
           | Down the line maybe it would even be possible to have
           | binaries as "first-class" (save for diff I guess)
        
       | londons_explore wrote:
       | I'd like to see a version of this built into things like IPFS.
       | 
       | It seems obvious that whenever something is saved into IPFS,
       | there might be a similar object already stored. If there is, go
       | make a diff, and only store the diff.
        
         | hcs wrote:
         | It should be possible to do this in IPFS already if you use the
         | go-ipfs --chunker option with a content-sensitive chunking
         | algorithm like rabin or buzhash [1]. With this there's a good
         | chance that a file with small changes from something already on
         | IPFS will have some chunks that hash identically, so they'll be
         | shared.
         | 
         | [1] https://en.wikipedia.org/wiki/Rolling_hash#Content-
         | based_sli...
        
           | londons_explore wrote:
           | But that isn't quite as good as something like this that can
           | 'understand' diffs in files, rather than simply relying on
           | the fact a bunch of bytes in a row might be the same.
        
             | hcs wrote:
             | I don't think elfshaker actually does do any binary diffing
             | (e.g. xdelta or bsdiff). It works well because it uses pre-
             | link objects which are built to change as little as
             | possible between versions. Then when it compresses similar
             | files together in a pack, Zstandard can recognize the
             | trivial repeats.
        
               | peterwaller-arm wrote:
               | Author here. This is correct, we set out to do binary
               | diffing but we soon discovered that if you put similar
               | enough object files together in a stream, and then
               | compress the stream, zstandard does a fantastic job at
               | compressing and decompressing quickly with a high
               | compression ratio. The existing binary diffing tools can
               | produce small patches, but they are relatively expensive
               | both to compute the delta and to apply the patches.
        
       | mal10c wrote:
       | This project reminded me of something I've been looking for for a
       | while - although it's not exactly what I'm looking for...
       | 
       | I use SolidWorks PDM at work to control drawings, BOMs, test
       | procedures, etc. In all honesty, PDM does an alright job when it
       | works, but when I have problems with our local server, all hell
       | breaks loose and worst case, the engineers can't move forward.
       | 
       | In that light, I'd love to switch to another option. Preferably
       | something decentralized just to ensure we have more backups. Git
       | almost gets us there but doesn't include things like "where
       | used."
       | 
       | All that being said, am I overlooking some features of Elfshaker
       | that would fit well into my hopes of finding an alternative to
       | PDM?
       | 
       | I also see there's another HN thread that asks the question I'm
       | asking - just not through the lens of Elfshaker:
       | https://news.ycombinator.com/item?id=20644770
        
         | kvnhn wrote:
         | Maybe not precisely what you want, but I built a CLI tool[1]
         | that's like a simplified and decoupled Git-LFS. It tracks large
         | files in a content-addressed directory, and then you track the
         | references to that store in source control. Data compression
         | isn't a top priority for my tool; it uses immutable symlinks,
         | not archives.
         | 
         | [1]: https://github.com/kevin-hanselman/dud
        
       | erichocean wrote:
       | Seems like the Nix people would be interested in enabling this
       | kind of thing for Nix packages...
        
       | lxpz wrote:
       | This should be integrated with Cargo to reduce the size of the
       | target directories which are becoming ridiculously large.
        
         | peterwaller-arm wrote:
         | Author here. I'm unsure whether this would apply very well to
         | cargo or not. If it has lots of pre-link object files, then
         | maybe.
        
       | lxe wrote:
       | > There are many files,
       | 
       | > Most of them don't change very often so there are a lot of
       | duplicate files,
       | 
       | > When they do change, the deltas of the [binaries] are not huge.
       | 
       | We need this but for node_modules
        
         | ithkuil wrote:
         | The novel trick here is splitting up huge binary files and
         | treat them as if they were many small files.
         | 
         | Node_modulea is already tons and tons of files, and when they
         | are large, they are usually minified and hard to split on any
         | "natural" boundary (like elf sections/symbols etc)
        
       | i_like_waiting wrote:
       | Thanks, seems like that could be good solution for storing of
       | daily backups of DB. I didn't know I needed it but seems like I
       | do.
        
         | phil294 wrote:
         | Have a look at Borg, it handles incremental backups very well
        
         | peterwaller-arm wrote:
         | Author here, this software is young, please don't use it for
         | backups!
         | 
         | But also, in general, it might not work well for your use case,
         | and our use case is niche. Please give it a try before making
         | assumptions about any suitability for use.
        
           | wpietri wrote:
           | In this age of rampant puffery, it's so... soothing to see
           | somebody be positive and frank about the limits of their
           | creation. Thanks for this and all your comments here!
        
             | peterwaller-arm wrote:
             | <3
        
         | the_duke wrote:
         | Borg, bup or restic are relatively popular incremental backup
         | tools that reduplicate with chunking.
        
       | goodpoint wrote:
       | I'm surprised nobody mentioned git-annex. It does the same using
       | git for metadata. It's extremely efficient.
        
         | kristjansson wrote:
         | AFAIK, git-annex doesn't address address sub-file
         | deduplication/compression at all, it just stores a new copy for
         | each new hash it sees? I suppose that content-addressed
         | storage, combined with the pre-link strategy discussed
         | elsewhere for the related manyclangs project would produce
         | similar, if less spectacular, results?
        
       | jankotek wrote:
       | Does it make a sense to turn it into fuse fs, with transparent
       | deduplication?
        
         | peterwaller-arm wrote:
         | Author here. Maybe, it's a fun idea. I have toyed with
         | providing a fuse filesystem for access to a pack but my time
         | for completing this is limited at the moment.
        
           | nh2 wrote:
           | Many packfile-deduplicating backup tools (bup, kopia, borg,
           | restic) can mount the deduplicated storage as FUSE.
           | 
           | It might make sense to check how they do it.
           | 
           | I'd also be interested in how elfshaker compares to those
           | (and `bupstash`, which is written in Rust but doesn't have a
           | FUSE mount yet) in terms of compression and speed.
           | 
           | Did you know of their existence when making elfshaker?
           | 
           | Edit: Question also posted in your Q&A: https://github.com/el
           | fshaker/elfshaker/discussions/58#discus...
        
             | peterwaller-arm wrote:
             | (Copying from Q&A) Before starting out some time ago, I did
             | some experiments with bup. I had a good experience with bup
             | and high expectations for it. However, I found that quite a
             | lot of performance was left on the table, so I was
             | motivated to start elfshaker. Unfortunately that time has
             | past so I don't have scientific numbers for you measured
             | with other software at this time.
             | 
             | As an idea of how elfshaker performs, we see ~300ms time to
             | create a snapshot for clang, and ~seconds-to-minute to
             | create a binary pack containing thousands of revisions.
             | Extraction takes less than a second. One difference of
             | elfshaker compared with some other software I tested is
             | that we do the compression and decompression in parallel,
             | which can make a very big difference on today's many-core
             | machines.
        
       | mhx77 wrote:
       | Somewhat related (and definitely born out of a very similar use
       | case): https://github.com/mhx/dwarfs
       | 
       | I initially built this for having access to 1000+ Perl
       | installations (spanning decades of Perl releases). The
       | compression in this case is not quite as impressive (50 GiB to
       | around 300 MiB), but access times are typically in the
       | millisecond region.
        
         | pdimitar wrote:
         | That's super impressive, I will definitely give it a go. Thanks
         | for sharing!
        
         | peterwaller-arm wrote:
         | Nice, I bet dwarfs would do well at our use case too. Thanks
         | for sharing.
        
       | tttsxhub wrote:
       | Why does it depend on the CPU architecture?
        
         | peterwaller-arm wrote:
         | (Disclosure: I work for Arm, opinions are my own)
         | 
         | Author here. elfshaker itself does not have a dependency on any
         | architecture to our knowledge. We support the architectures we
         | have use of. Contributions to add missing support are welcome.
         | 
         | manyclangs provides binary pack files for aarch64 because
         | that's what we have immediate use of. If elfshaker and
         | manyclangs proves useful to people, I would love to see
         | resource invested to make it more widely useful.
         | 
         | You can still run the manyclangs binaries on other
         | architectures using qemu [0], with some performance cost, which
         | may be tolerable depending on your use case.
         | 
         | [0] https://github.com/elfshaker/manyclangs/tree/main/docker-
         | qem...
        
       | henvic wrote:
       | Interesting. I wonder if this can also be [ab]used to, say,
       | deliver deltas of programs, so that you can have faster updates,
       | but maybe it doesn't make sense.
       | 
       | https://en.wikipedia.org/wiki/Binary_delta_compression
        
         | peterwaller-arm wrote:
         | Author here, I don't think it would apply well to that
         | scenario. elfshaker is good for manyclangs where we ship 2,000
         | revisions in one file (pack), so the cost of individual
         | revision is amortized. If one build of llvm+clang costs you
         | some ~400 MiB; a single elfshaker pack containing 2,000 builds
         | has an amortized cost of around 40kiB/build. But this amazing
         | win is only happening because you are shipping 2,000 builds at
         | once. If you wanted to ship a single delta, you can't compress
         | against all the other builds.
        
           | necovek wrote:
           | How fast would it be to get a delta between any two of the
           | 2,000 builds in a single elfshaker pack?
           | 
           | If that's reasonably fast, perhaps an approach like that
           | could work: server stores the entire pack, but upon user
           | request extracts a delta between user's version and target
           | binary.
           | 
           | Still, the devil is in the details of building all revisions
           | of all software a single distribution has.
        
             | peterwaller-arm wrote:
             | Yes you could do that. On the other hand, all revisions for
             | a month is 100MiB, and all revisions we've built spanning
             | 2019-now are a total of 2.8GiB, so we opted to forego
             | implementing any object negotiation and just say 'you have
             | to download the 100MiB for the month to access it'. I think
             | you could a push/pull protocol could be implemented, but at
             | that point probably git might do a reasonable job in that
             | case :)
        
           | henvic wrote:
           | Thank you for the insight!
        
       | wlll wrote:
       | Related, and impressive: https://github.com/elfshaker/manyclangs
       | 
       | > manyclangs is a project enabling you to run any commit of clang
       | within a few seconds, without having to build it.
       | 
       | > It provides elfshaker pack files, each containing ~2000 builds
       | of LLVM packed into ~100MiB. Running any particular build takes
       | about 4s.
        
         | Tobu wrote:
         | The clever idea that makes manyclangs compress well is to store
         | object files before they are linked, with each function and
         | each variable in its own elf section so that changes are mostly
         | local; addresses will indirect through sections and a change to
         | one item won't cascade into moving every address.
         | 
         | I'm not sure the linking step they provide is
         | deterministic/hermetic, if it is that would prove a decent way
         | to compress the final binaries while shaving most of the
         | compilation time. Maybe the manyclangs repo could store hashes
         | of the linked binaries if so?
         | 
         | I'm not seeing any particular tricks done in elfshaker itself
         | to enable this, the packfile system orders objects by size as a
         | heuristic for grouping similar objects together and compresses
         | everything (using zstd and parallel streams for, well,
         | parallelism). Sorting by size seems to be part of the Git
         | heuristic for delta packing: https://git-scm.com/docs/pack-
         | heuristics
         | 
         | I'd like to see a comparison with Git and others listed here
         | (same unlinked clang artifacts, compare packing and access):
         | https://github.com/elfshaker/elfshaker/discussions/58#discus...
        
           | peterwaller-arm wrote:
           | Author here, I'd like to see such a comparison too actually,
           | but I'm not in the position to do the work at the moment. We
           | did some preliminary experiments at the beginning, but a lot
           | changed over the course of the project and I don't know how
           | well elfshaker fares ultimately against all the options out
           | there. Some basic tests against git found that git is quite a
           | bit slower (10s vs 100ms) during 'git add' and git checkout.
           | Maybe that can be fixed with some tuning or finding
           | appropriate options.
        
         | perth wrote:
         | Reminds me of how Microsoft packages the Windows installer
         | actually. If you've ever unpacked Microsoft's install.esd it's
         | interestingly insane how heavily it's compressed. I assume it's
         | full of a lot of stuff that provides semi redundant binaries
         | for compatibility to a lot of different systems, because the
         | unpacked esd container goes from a few GiBs to I think around
         | 40-50 iirc.
        
           | derefr wrote:
           | The emulation community also has "ROMsets" -- collections of
           | game ROM images, where the ROM images _for a given game
           | title_ are all grouped together into an archive. So you 'd
           | have one archive for e.g. "every release, dump, and ROMhack
           | of Super Mario Bros 1."
           | 
           | These ROM-set archives -- especially when using more modern
           | compression algorithms, like LZMA/7zip -- end up about 1.1x
           | the size of a _single one_ of the contained game ROM images,
           | despite sometimes containing literally hundreds of variant
           | images.
        
             | Daishiman wrote:
             | How does this work? Do all the game series use the same
             | engine code and assets?
        
               | bena wrote:
               | Sort of. ROMHacks are modified ROM images of a certain
               | game.
               | 
               | If you knew where in the ROM image the level data was
               | contained, you could modify it. As long as you didn't
               | violate any constraints, the game would run fine.
               | 
               | You could also potentially influence game behavior as
               | well.
               | 
               | The Game Genie and Gameshark were kind based on this
               | concept. Except, being further along the chain, it could
               | write values coming into and out of memory, so other
               | effects were possible.
               | 
               | So, in the case of Super Mario Bros. ROMHacks, they all
               | use Super Mario Bros. as a base ROM. Then from there, all
               | you need to do is store the diff from the base.
        
               | notafraudster wrote:
               | I think you're slightly misinterpreting what the parent
               | said. Take the game Super Mario World for the console
               | Super Nintendo. It was released in Japan. It was released
               | in the US. It was released in Europe. It was released in
               | Korea. It was released in Australia. It was probably
               | released in various minor regions and given unique
               | translations. There are almost certainly re-releases of
               | the game on Super Nintendo that issued new ROM files to
               | correct minor bugs. Maybe there's a Greatest Hits version
               | which might be the same game, but with an updated
               | copyright date to reflect the re-release. This might
               | amount to 10-12 versions of the same game, but 99.99% of
               | what's in the ROM file is the same across all of them, so
               | they can be represented compressed very well.
               | 
               | A copy of Super Mario Advance 2 for Game Boy Advance,
               | which is also a re-release of Super Mario World, almost
               | surely uses its own engine and would not be part of the
               | same rom set. Likewise, other Mario games (like Mario 64,
               | Super Mario Bros, etc.) would not be part of the same rom
               | set. So it's nothing about the series using the same
               | engine code or assets.
               | 
               | We're talking bugfixes and different regions for the same
               | game on the same console. But this still has the effect
               | of dropping the size for complete console collections by
               | 50% or more, because most consoles have 2-3 regions per
               | game for most games.
        
               | derefr wrote:
               | You're generally correct. But there are interesting
               | exceptions!
               | 
               | Sometimes, ROM-image-based game titles _were_ based on
               | the same  "engine" (i.e. the same core set of assembler
               | source-files with fixed address-space target locations,
               | and so fixed locations in a generated ROM image), but
               | with a few engine modifications, and entirely different
               | assets.
               | 
               | In a sense, this makes these different games effectively
               | into mutual "full conversion ROMhacks" of one-another.
               | 
               | You'll usually find these different game titles
               | compressed together into the _same_ ROMset (with one game
               | title -- usually the one with the oldest official release
               | -- being considered the prototype for the others, and so
               | naming the ROMset), because they _do_ compress together
               | very well -- not near-totally, the way bugfix patches do,
               | but adding only the total amount to the archive size that
               | you 'd expect for the additional new assets.
               | 
               | Well-known examples of this are _Doki Doki Panic_ vs.
               | _Super Mario Bros 2_ ; _Panel de Pon_ vs. _Tetris Attack_
               | ; _Gradius III_ vs. _Parodius_ ; and any game with
               | editions, e.g. _Pokemon_ or _Megaman Battle Network_.
               | 
               | But there are more "complete" examples as well, where
               | you'd never even suspect the two titles are related, with
               | the games perhaps existing in entirely-different genres.
               | (I don't have a ROMset library on-hand to dig out
               | examples, but if you dig through one, you'll find some
               | amazing examples of engine reuse.)
        
         | wpietri wrote:
         | Ooh, neat. I was wondering why anybody would make a binary-
         | specific VCS. And why "elf" was in the name. This answers both
         | questions. Thanks!
        
       | [deleted]
        
       | yincrash wrote:
       | Could this be useful for packing xcode's deriveddata folder for
       | caching in ci builds?
        
       | svilen_dobrev wrote:
       | will some of these work for (compressed) variants of audio?
       | They're never same..
        
         | peterwaller-arm wrote:
         | Author here. Compressed data is unlikely to work well in
         | general, unless it never changes.
        
       | cyounkins wrote:
       | Cool! I wonder how this would compare to ZFS deduplication.
        
         | veselink1 wrote:
         | An author here. elfshaker uses per-file deduplication. When
         | building manyclangs packs, we observed that the deduplicated
         | content is about 10 GiB in size. After compression with
         | `elfshaker pack`, that comes down to ~100 MiB.
         | 
         | There is also a usability difference: elfshaker stores data in
         | pack files, which are more easily shareable. Each of the pack
         | files released as part of manyclangs ~100 MiB and contains
         | enough data to materialize ~2,000 builds of clang and LLVM.
        
       | bogwog wrote:
       | Does this work well with image files? (PNG, JPEG, etc)
        
         | peterwaller-arm wrote:
         | Author here, it works particularly well for our presented use
         | case because it has these properties:
         | 
         | * There are many files,
         | 
         | * Most of them don't change very often,
         | 
         | * When they do change, the deltas of the binaries are not huge.
         | 
         | So, if the image files aren't changing very much, then it might
         | work well for you. If the images are changing, their binary
         | deltas would be quite large, so you'd get a compression ratio
         | somewhat equivalent to if you'd concatenated the two revisions
         | of the file and compressed them using ZStandard.
        
           | shp0ngle wrote:
           | Ahhh that's the key insight I have been missing, and that
           | should be higher somewhere.
           | 
           | Thanks
        
           | IceWreck wrote:
           | Please add these points under a usecase heading in your
           | README.
        
             | peterwaller-arm wrote:
             | Done, hopefully this is clearer. Please let us know if you
             | see a way to improve it further:
             | https://github.com/elfshaker/elfshaker/pull/60
        
       | ghoul2 wrote:
       | If I already have, lets say a 100MB pack file containing (say)
       | 200 builds of clang and then I import the 201st build into that
       | pack file - is it possible to send across a small delta of this
       | new, updated pack file to someone else who already had the older
       | pack file (with 200 builds) such that they can apply the delta to
       | the old pack and get the new pack containing 201 builds?
        
       | carlmr wrote:
       | I find the description a bit confusing, is there and example
       | where we can see the usage?
        
         | mxuribe wrote:
         | Same here. There is a usage guide, which helped a tiny bit:
         | https://github.com/elfshaker/elfshaker/blob/main/docs/users/...
         | 
         | Honestly, I sort of looked at it for conventional backup
         | strategy...as in, i wonder if it could work as a replacement
         | for tar-zipping up a directory, etc. But, not sure if the use
         | cases is appropriate.
        
           | xdfgh1112 wrote:
           | For backup you probably want something like Borg to handle
           | deduplication of identical content between backups.
        
             | peterwaller-arm wrote:
             | Author here, I agree with xdfgh1112, please take care
             | before using brand new software to store your backups!
        
               | mxuribe wrote:
               | Yes, any time that i use something new or different (or
               | both) for something as essential as backups, i take great
               | and deliberate care...and test, test, test...well before
               | standardizing on it. ;-)
        
           | peterwaller-arm wrote:
           | Author here. We'd love this to be a thing, but this is young
           | software, so we don't recommend relying on this as a single
           | way of doing a backup for now. Bear in mind that our main use
           | case is for things that you can reproduce in principle
           | (builds of a commit history, see manyclangs).
        
             | mxuribe wrote:
             | > our main use case is for things that you can reproduce in
             | principle (builds of a commit history, see manyclangs)
             | 
             | I appreciate your response, and thanks very much for the
             | clarification of use case; very helpful! Thanks also of
             | course for building this!
        
         | w0m wrote:
         | My top level being that it's a VCS (like Git) specialized for
         | binaries; with commands baked in to prevent the slowdown that
         | often comes with large git repositories.
        
           | throw_away wrote:
           | Specifically, it's for ELF binaries built in such a way that
           | adding a new function or new data does not break however they
           | cache existing functions/data.
           | 
           | I wonder if this concept could be extended to other binary
           | types that git has problems with, were you able to
           | know/control more about the underlying binary format.
        
         | wyldfire wrote:
         | There is an associated presentation on manyclangs at LLVM dev
         | meeting. I think they presented yesterday?
         | 
         | Unfortunately it won't be uploaded until later but it will show
         | up on the llvm YouTube channel:
         | 
         | https://www.youtube.com/c/LLVMPROJ
        
         | ot wrote:
         | I would guess it's a way to quickly bisect on compiler
         | versions.
        
         | peterwaller-arm wrote:
         | One of the authors here, thanks for the feedback. We've tried
         | to improve it here:
         | https://github.com/elfshaker/elfshaker/pull/59
        
       | xpe wrote:
       | Never shake a baby elf!
        
       | 0942v8653 wrote:
       | Does it do any architecture-specific processing, i.e. BCJ filter?
       | Or is there a generic version of this? The performance seems
       | quite good.
        
         | peterwaller-arm wrote:
         | Author here. No architecture specific processing currently.
         | Most of the magic happens in zstandard (hat tip to this amazing
         | project).
         | 
         | Please see our new applicability section which explains the
         | result in a bit more detail:
         | 
         | https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...
         | 
         | In manyclangs (which uses elfshaker for storage) we arrange
         | that the object code has stable addresses when you do
         | insertions/deletions, which means you don't need such a filter.
         | But today I learned about such filters, so thanks for sharing
         | your question!
        
       | dilap wrote:
       | Huh, interesting, could you maybe use this as an in-repo
       | alternative to something like git-lfs?
        
         | peterwaller-arm wrote:
         | Author here, I don't currently know how this compares to git-
         | lfs. It it is possible git-lfs would perform quite well on the
         | same inputs as elfshaker works on. If git-lfs does already work
         | well for your use case I'd recommend using that rather than
         | elfshaker, as it is more established.
        
           | dilap wrote:
           | Thanks for the response! I was more just curious about future
           | possibilities vs immediate practicle use.
           | 
           | git-lfs just offloads the storage of the large binaries to a
           | remote site, and then downloads on demand.
           | 
           | If you have a lot of binary assets like artwork or huge excel
           | spreadsheets, it's very useful, because in those cases,
           | without git-lfs, the git repo will get very large, git will
           | get extremely slow, and github will get angry at you for
           | having too large a repo.
           | 
           | But it's not all roses with git-lfs, since now you're reliant
           | on the external network to do checkouts, vs having fetched
           | everything at once w/ the initial clone, and also of course
           | just switching between revisions can get slower since you're
           | network-limited to fetch those large files. (And though I'm
           | not sure, it doesn't seem like git-lfs is doing any local
           | caching.)
           | 
           | So you could imagine where something like having elfshaker
           | embedded in the repo and integrated as a checkout filter
           | could potentially be a useful alternative. Basically an
           | efficient way to store binaries directly in the repo.
           | 
           | (Maybe it would be too small a band of use cases to be
           | practicle though? Obviously if you have lots of distinct art
           | assets, that's just going to be big, no matter what...)
        
       | axismundi wrote:
       | does it work on intel macs?
        
       ___________________________________________________________________
       (page generated 2021-11-19 23:00 UTC)