[HN Gopher] Elfshaker: Version control system fine-tuned for bin... ___________________________________________________________________ Elfshaker: Version control system fine-tuned for binaries Author : jim90 Score : 466 points Date : 2021-11-19 12:41 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | mrich wrote: | I'm guessing this does not yield that high compression for | release builds, where code can be optimized across translation | units? Likewise when a commit changes a header that is included | in many cpps? | peterwaller-arm wrote: | Author here. The executables shipped in manyclangs are release | builds! The catch is that manyclangs stores object files pre- | link. Executables are materialized by relinking after they are | extracted with elfshaker. | | The stored object files are compiled with -ffunction-sections | and -fdata-sections, which ensures that insertions/deletions to | the object file only have a local effect (they don't cause | relative addresses to change across the whole binary). | | As you observe, anything which causes significant non-local | changes in the data you store is going to have a negative | effect when it comes to compression ratio. This is why we don't | store the original executables directly. | zeotroph wrote: | Thank you for the explanation, so the pre-link storage is one | of the magical ingredients, maybe mention this as well in the | README? | | Is this the reason why manyclang (using llvms cmake based | build system) can be provided easily, but it would be more | difficult for gcc? Or is the object -> binary dependency | automatically deduced? | peterwaller-arm wrote: | > maybe mention this as well in the README? | | We've tweaked the readme, I hope it's clearer. | | It would be great to provide this for gcc too. The project | is new and we've just started out. I know less about gcc's | build system and how hard it will be to apply these | techniques there. It seems as though it should be possible | though and I'd love to see it happen. | | To infer the object->executable dependencies we currently | read the compilation database and produce a stand-alone | link.sh shell script, which gets packaged into each | manyclangs snapshot. | zeotroph wrote: | Ah, the compilation database is where more magic | originates from :) | peterwaller-arm wrote: | Yes, this is less great than I would like! :( :) | mrich wrote: | Thanks. I had a use case in mind where LTO is enabled. | Unfortunately the LTO step is quite expensive so relinking | does not seem like a viable option. If I find some time I'll | give it a try though. | peterwaller-arm wrote: | ThinLTO can be pretty quick if you have enough cores, it | might work. Not sure how well the LTO objects compress | against each other when you have small changes to them. It | might work reasonably. | | manyclangs is optimized to provide you with a binary | quickly. The binary is not necessarily itself optimized to | be fast, because it's expected that a developer might want | to access any version of it for the purposes of testing | whether some input manifests a bug or has a particular | codegen output. In that scenario, it's likely that the | developer is able to reduce the size of the input such that | the speed of the compiler itself is not terribly | significant in the overall runtime. Therefore, I don't see | LTO for manyclangs as such a significant win. But it is | still hoped that the overall end-to-end runtime is good, | and the binaries are optimized, just not with LTO. | nh2 wrote: | I experimented with something similar with a Linux distribution's | package binary cache. | | Using `bup` (deduplicating backup tool using git packfile format) | I deduplicated 4 Chromium builds into the size of 1. It could | probably pack thousands into the size of a few. | | Large download/storage requirements for updates are one of | NixOS's few drawbacks, and I think deduplication could solve that | pretty much completely. | | Details: https://github.com/NixOS/nixpkgs/issues/89380 | peterwaller-arm wrote: | Author here. I've used bup, and elfshaker was partially | inspired by it! It's great. However, during initial experiments | on this project I found bup to be slow, taking quite a long | time to snapshot and extract. I think this could in principle | be fixed in bup one day, perhaps. | Siira wrote: | Is elfshaker any good for backuping non-text data? | ybkshaw wrote: | Thank you for having such a good description on the project! | Sometimes the links from HN lead to a page that takes a few | minutes of puzzling to figure out what is going on but not | yours. | nh2 wrote: | I also use bup for a long time, but found that for very large | server backups I'm hitting performance problems (both in time | and memory usage). | | I'm currently evaluating `bupstash` (also written in Rust) as | a replacment. It's faster and uses a lot less memory, but is | younger and thus lacks some features. | | Here is somebody's benchmark of bupstas (unfortunately not | including `bup`): | https://acha.ninja/blog/encrypted_backup_shootout/ | | The `bupstash` author is super responsive on Gitter/Matrix, | it may make sense to join there to discuss | approaches/findings together. | | I would really like to eventually have deduplication-as-a- | library, to make it easier to put into programs like nix, or | also other programs, e.g. for versioned "Save" functionality | in software like Blender or Meshlab that work with huge files | and for which diff-based incremental saving is more | difficult/fragile to implement than deduplcating snapshot | based saving. | pdimitar wrote: | I used `bupstash` and evaluated it for a while. I am | looking to do 5+ offsite backups of a small personal | directory to services that offer 5GB of cloud space for | free. | | `bupstash` lacked good compression. I settled with `borg` | because I could use `zstd` compression with it. Currently | at 60 snapshots of the directory and the `borg` repo | directory is at ~1.52GB out of 5GB quota. The source | directory is ~12.19GB uncompressed. Very happy with `borg` | + `zstd` and how they handle my scenario. | | I liked `bupstash` a lot, and the author is responsive and | friendly. But I won't be giving it another try until it | implements much more aggressive compression compared to | what it can do now. It's a shame, I _really_ wanted to use | it. | | I do recognize that for many other scenarios `bupstash` is | very solid though. | veselink1 wrote: | An author here, we've opened a Q&A discussion on GitHub: | https://github.com/elfshaker/elfshaker/discussions/58. | thristian wrote: | This seems very much like the Git repository format, with loose | objects being collected into compressed pack files - except I | think Git has smarter heuristics about which files are likely to | compress well together. It would be interesting to see a | comparison between this tool and Git used to store the same | collection of similar files. | peterwaller-arm wrote: | An author here, I agree! The packfile format is heavily | inspired by git, and git may also do quite well at this. | | We did some preliminary experiments with git a while back but | found we were able to do the packing and extraction much faster | and smaller than git was able to manage. However, we haven't | had the time to repeat the experiments with our latest | knowledge and the latest version of git. So it is entirely | possible that git might be an even better answer here in the | end. We just haven't done the best experiments yet. It's | something to bear in mind. If someone wants, they could measure | this fairly easily by unpacking our snapshots and storing them | into git. | | On our machines, forming a snapshot of one llvm+clang build | takes hundreds of milliseconds. Forming a packfile for 2,000 | clang builds with elfshaker can take seconds during the pack | phase with a 'low' compression level (a minute or two for the | best compression level, which gets it down to the ~50-100MiB/mo | range), and extracting takes less than a second. Initial | experiments with git showed it was going to be much slower. | johnyzee wrote: | As far as I was able to learn (don't remember the details, | sorry), git does not do well with large binary files. I | believe it ends up with a lot of duplication. It is the major | thing I am missing from git, currently we store assets (like | big PSDs that change often) outside of version control and it | is suboptimal. | peterwaller-arm wrote: | Performing poorly with non-textual data happens for a a | number of reasons. Binary data, when changed, often have a | lot of 'non-local' changes in them. For example, a PSD file | might well have a compression algorithm already applied to | it. An insertion/deletion is going to result in a very | different compressed representation for which there is no | good way to have an efficient delta. elfshaker will suffer | the same problem here. | JoshTriplett wrote: | Can you talk a bit more about what ELF-specific | heuristics elfshaker uses? What kind of preprocessing do | you do before zstd? Do you handle offsets changing in | instructions, like the BCJ/BCJ2 filter? Do you do | anything to detect insertions/deletions? | peterwaller-arm wrote: | We've just added an applicability section, which explains | a bit more what we do. We don't have any ELF specific | heuristics [0]. | | https://github.com/elfshaker/elfshaker#applicability | | In summary, for manyclangs, we compile with -ffunction- | sections and -fdata-sections, and store the resulting | object files. These are fairly robust to insertions and | deletions, since the addresses are section relative, so | the damage of any addresses changing is contained within | the sections. A somewhat surprising thing is that this | works well enough when building many revisions of | clang/llvm -- as you go from commit to commit, many | commits have bit identical object files, even though the | build system often wants to rebuild them because some | input has changed. | | elfshaker packs use a heuristic of sorting all unique | objects by size, before concatenating them and storing | them with zstandard. This gives us an amortized cost-per- | commit of something like 40kiB after compression with | zstandard. | | [0] (edit: despite the playful name suggesting otherwise | -- when we chose the name we planned to do more with ELF | files, but it turned out to be unnecessary for our use | case) | JoshTriplett wrote: | Ah, I see! Makes sense that you can do much better if you | get to compile the programs with your choice of options. | derefr wrote: | One could, in theory, write a _git-clean_ filter (like | the one used for git-lfs), that teaches git various | heuristic approaches to "take apart" well-known binary | container formats into trees of binary object leaf-nodes. | | Then, when you committed a large binary that git could | understand, what git would really be committing in its | place would be a directory tree -- sort of like the | "resource tree" you see if you edit an MKV file, PNG | file, etc., but realized as files in directories. Git | would generate it, then commit it. | | On checkout, this process would happen in reverse: a | matching _git-smudge_ filter could notice a metadata file | in each of these generated directories, and collapse the | contents of the directory together to form a binary | chunk; recursively, up the tree, until you hit the | toplevel, and end up with the original large binary | again. | | Since most of the _generated leaf-nodes_ from this | process wouldn 't change on each commit, this would | eliminate most of the _storage_ overhead of having many | historical versions of large files in git. (In exchange | for: 1. the potentially-huge CPU overhead of doing this | "taking apart" of the file on every commit; 2. the added | IOPS for temporarily creating the files to commit them; | and 3. the loss of any file-level compression [though git | itself compresses its packfiles, so that's a wash.]) | | I'm almost inspired to try this out for a simple binary | tree format like | https://en.wikipedia.org/wiki/Interchange_File_Format. | But ELF wouldn't be too hard, either! (You could even go | well past the "logical tree" of ELF by splitting the text | section into objects per symbol, and ensuring the object | code for each symbol is stored in a PIC representation in | git, even if it isn't in the binary.) | ChrisMarshallNY wrote: | _> we store assets (like big PSDs that change often) | outside of version control and it is suboptimal._ | | Perforce is still used by game developers and other | creatives, because it handles large binaries, quite well. | | In fact, I'm not sure if they still do it, but one of the | game engines (I think, maybe, Unreal) used to have a free | tier that also included a free Perforce install. | mdaniel wrote: | It was my recollection, and I confirmed it, that they've | almost always had a "the first hit is free" model for | small teams, and they also explicitly call out indie game | studios as getting free stuff too: | https://www.perforce.com/how-buy | 3np wrote: | Do you think it would be feasible to do a git-lfs replacement | based on elfshaker? | | Down the line maybe it would even be possible to have | binaries as "first-class" (save for diff I guess) | londons_explore wrote: | I'd like to see a version of this built into things like IPFS. | | It seems obvious that whenever something is saved into IPFS, | there might be a similar object already stored. If there is, go | make a diff, and only store the diff. | hcs wrote: | It should be possible to do this in IPFS already if you use the | go-ipfs --chunker option with a content-sensitive chunking | algorithm like rabin or buzhash [1]. With this there's a good | chance that a file with small changes from something already on | IPFS will have some chunks that hash identically, so they'll be | shared. | | [1] https://en.wikipedia.org/wiki/Rolling_hash#Content- | based_sli... | londons_explore wrote: | But that isn't quite as good as something like this that can | 'understand' diffs in files, rather than simply relying on | the fact a bunch of bytes in a row might be the same. | hcs wrote: | I don't think elfshaker actually does do any binary diffing | (e.g. xdelta or bsdiff). It works well because it uses pre- | link objects which are built to change as little as | possible between versions. Then when it compresses similar | files together in a pack, Zstandard can recognize the | trivial repeats. | peterwaller-arm wrote: | Author here. This is correct, we set out to do binary | diffing but we soon discovered that if you put similar | enough object files together in a stream, and then | compress the stream, zstandard does a fantastic job at | compressing and decompressing quickly with a high | compression ratio. The existing binary diffing tools can | produce small patches, but they are relatively expensive | both to compute the delta and to apply the patches. | mal10c wrote: | This project reminded me of something I've been looking for for a | while - although it's not exactly what I'm looking for... | | I use SolidWorks PDM at work to control drawings, BOMs, test | procedures, etc. In all honesty, PDM does an alright job when it | works, but when I have problems with our local server, all hell | breaks loose and worst case, the engineers can't move forward. | | In that light, I'd love to switch to another option. Preferably | something decentralized just to ensure we have more backups. Git | almost gets us there but doesn't include things like "where | used." | | All that being said, am I overlooking some features of Elfshaker | that would fit well into my hopes of finding an alternative to | PDM? | | I also see there's another HN thread that asks the question I'm | asking - just not through the lens of Elfshaker: | https://news.ycombinator.com/item?id=20644770 | kvnhn wrote: | Maybe not precisely what you want, but I built a CLI tool[1] | that's like a simplified and decoupled Git-LFS. It tracks large | files in a content-addressed directory, and then you track the | references to that store in source control. Data compression | isn't a top priority for my tool; it uses immutable symlinks, | not archives. | | [1]: https://github.com/kevin-hanselman/dud | erichocean wrote: | Seems like the Nix people would be interested in enabling this | kind of thing for Nix packages... | lxpz wrote: | This should be integrated with Cargo to reduce the size of the | target directories which are becoming ridiculously large. | peterwaller-arm wrote: | Author here. I'm unsure whether this would apply very well to | cargo or not. If it has lots of pre-link object files, then | maybe. | lxe wrote: | > There are many files, | | > Most of them don't change very often so there are a lot of | duplicate files, | | > When they do change, the deltas of the [binaries] are not huge. | | We need this but for node_modules | ithkuil wrote: | The novel trick here is splitting up huge binary files and | treat them as if they were many small files. | | Node_modulea is already tons and tons of files, and when they | are large, they are usually minified and hard to split on any | "natural" boundary (like elf sections/symbols etc) | i_like_waiting wrote: | Thanks, seems like that could be good solution for storing of | daily backups of DB. I didn't know I needed it but seems like I | do. | phil294 wrote: | Have a look at Borg, it handles incremental backups very well | peterwaller-arm wrote: | Author here, this software is young, please don't use it for | backups! | | But also, in general, it might not work well for your use case, | and our use case is niche. Please give it a try before making | assumptions about any suitability for use. | wpietri wrote: | In this age of rampant puffery, it's so... soothing to see | somebody be positive and frank about the limits of their | creation. Thanks for this and all your comments here! | peterwaller-arm wrote: | <3 | the_duke wrote: | Borg, bup or restic are relatively popular incremental backup | tools that reduplicate with chunking. | goodpoint wrote: | I'm surprised nobody mentioned git-annex. It does the same using | git for metadata. It's extremely efficient. | kristjansson wrote: | AFAIK, git-annex doesn't address address sub-file | deduplication/compression at all, it just stores a new copy for | each new hash it sees? I suppose that content-addressed | storage, combined with the pre-link strategy discussed | elsewhere for the related manyclangs project would produce | similar, if less spectacular, results? | jankotek wrote: | Does it make a sense to turn it into fuse fs, with transparent | deduplication? | peterwaller-arm wrote: | Author here. Maybe, it's a fun idea. I have toyed with | providing a fuse filesystem for access to a pack but my time | for completing this is limited at the moment. | nh2 wrote: | Many packfile-deduplicating backup tools (bup, kopia, borg, | restic) can mount the deduplicated storage as FUSE. | | It might make sense to check how they do it. | | I'd also be interested in how elfshaker compares to those | (and `bupstash`, which is written in Rust but doesn't have a | FUSE mount yet) in terms of compression and speed. | | Did you know of their existence when making elfshaker? | | Edit: Question also posted in your Q&A: https://github.com/el | fshaker/elfshaker/discussions/58#discus... | peterwaller-arm wrote: | (Copying from Q&A) Before starting out some time ago, I did | some experiments with bup. I had a good experience with bup | and high expectations for it. However, I found that quite a | lot of performance was left on the table, so I was | motivated to start elfshaker. Unfortunately that time has | past so I don't have scientific numbers for you measured | with other software at this time. | | As an idea of how elfshaker performs, we see ~300ms time to | create a snapshot for clang, and ~seconds-to-minute to | create a binary pack containing thousands of revisions. | Extraction takes less than a second. One difference of | elfshaker compared with some other software I tested is | that we do the compression and decompression in parallel, | which can make a very big difference on today's many-core | machines. | mhx77 wrote: | Somewhat related (and definitely born out of a very similar use | case): https://github.com/mhx/dwarfs | | I initially built this for having access to 1000+ Perl | installations (spanning decades of Perl releases). The | compression in this case is not quite as impressive (50 GiB to | around 300 MiB), but access times are typically in the | millisecond region. | pdimitar wrote: | That's super impressive, I will definitely give it a go. Thanks | for sharing! | peterwaller-arm wrote: | Nice, I bet dwarfs would do well at our use case too. Thanks | for sharing. | tttsxhub wrote: | Why does it depend on the CPU architecture? | peterwaller-arm wrote: | (Disclosure: I work for Arm, opinions are my own) | | Author here. elfshaker itself does not have a dependency on any | architecture to our knowledge. We support the architectures we | have use of. Contributions to add missing support are welcome. | | manyclangs provides binary pack files for aarch64 because | that's what we have immediate use of. If elfshaker and | manyclangs proves useful to people, I would love to see | resource invested to make it more widely useful. | | You can still run the manyclangs binaries on other | architectures using qemu [0], with some performance cost, which | may be tolerable depending on your use case. | | [0] https://github.com/elfshaker/manyclangs/tree/main/docker- | qem... | henvic wrote: | Interesting. I wonder if this can also be [ab]used to, say, | deliver deltas of programs, so that you can have faster updates, | but maybe it doesn't make sense. | | https://en.wikipedia.org/wiki/Binary_delta_compression | peterwaller-arm wrote: | Author here, I don't think it would apply well to that | scenario. elfshaker is good for manyclangs where we ship 2,000 | revisions in one file (pack), so the cost of individual | revision is amortized. If one build of llvm+clang costs you | some ~400 MiB; a single elfshaker pack containing 2,000 builds | has an amortized cost of around 40kiB/build. But this amazing | win is only happening because you are shipping 2,000 builds at | once. If you wanted to ship a single delta, you can't compress | against all the other builds. | necovek wrote: | How fast would it be to get a delta between any two of the | 2,000 builds in a single elfshaker pack? | | If that's reasonably fast, perhaps an approach like that | could work: server stores the entire pack, but upon user | request extracts a delta between user's version and target | binary. | | Still, the devil is in the details of building all revisions | of all software a single distribution has. | peterwaller-arm wrote: | Yes you could do that. On the other hand, all revisions for | a month is 100MiB, and all revisions we've built spanning | 2019-now are a total of 2.8GiB, so we opted to forego | implementing any object negotiation and just say 'you have | to download the 100MiB for the month to access it'. I think | you could a push/pull protocol could be implemented, but at | that point probably git might do a reasonable job in that | case :) | henvic wrote: | Thank you for the insight! | wlll wrote: | Related, and impressive: https://github.com/elfshaker/manyclangs | | > manyclangs is a project enabling you to run any commit of clang | within a few seconds, without having to build it. | | > It provides elfshaker pack files, each containing ~2000 builds | of LLVM packed into ~100MiB. Running any particular build takes | about 4s. | Tobu wrote: | The clever idea that makes manyclangs compress well is to store | object files before they are linked, with each function and | each variable in its own elf section so that changes are mostly | local; addresses will indirect through sections and a change to | one item won't cascade into moving every address. | | I'm not sure the linking step they provide is | deterministic/hermetic, if it is that would prove a decent way | to compress the final binaries while shaving most of the | compilation time. Maybe the manyclangs repo could store hashes | of the linked binaries if so? | | I'm not seeing any particular tricks done in elfshaker itself | to enable this, the packfile system orders objects by size as a | heuristic for grouping similar objects together and compresses | everything (using zstd and parallel streams for, well, | parallelism). Sorting by size seems to be part of the Git | heuristic for delta packing: https://git-scm.com/docs/pack- | heuristics | | I'd like to see a comparison with Git and others listed here | (same unlinked clang artifacts, compare packing and access): | https://github.com/elfshaker/elfshaker/discussions/58#discus... | peterwaller-arm wrote: | Author here, I'd like to see such a comparison too actually, | but I'm not in the position to do the work at the moment. We | did some preliminary experiments at the beginning, but a lot | changed over the course of the project and I don't know how | well elfshaker fares ultimately against all the options out | there. Some basic tests against git found that git is quite a | bit slower (10s vs 100ms) during 'git add' and git checkout. | Maybe that can be fixed with some tuning or finding | appropriate options. | perth wrote: | Reminds me of how Microsoft packages the Windows installer | actually. If you've ever unpacked Microsoft's install.esd it's | interestingly insane how heavily it's compressed. I assume it's | full of a lot of stuff that provides semi redundant binaries | for compatibility to a lot of different systems, because the | unpacked esd container goes from a few GiBs to I think around | 40-50 iirc. | derefr wrote: | The emulation community also has "ROMsets" -- collections of | game ROM images, where the ROM images _for a given game | title_ are all grouped together into an archive. So you 'd | have one archive for e.g. "every release, dump, and ROMhack | of Super Mario Bros 1." | | These ROM-set archives -- especially when using more modern | compression algorithms, like LZMA/7zip -- end up about 1.1x | the size of a _single one_ of the contained game ROM images, | despite sometimes containing literally hundreds of variant | images. | Daishiman wrote: | How does this work? Do all the game series use the same | engine code and assets? | bena wrote: | Sort of. ROMHacks are modified ROM images of a certain | game. | | If you knew where in the ROM image the level data was | contained, you could modify it. As long as you didn't | violate any constraints, the game would run fine. | | You could also potentially influence game behavior as | well. | | The Game Genie and Gameshark were kind based on this | concept. Except, being further along the chain, it could | write values coming into and out of memory, so other | effects were possible. | | So, in the case of Super Mario Bros. ROMHacks, they all | use Super Mario Bros. as a base ROM. Then from there, all | you need to do is store the diff from the base. | notafraudster wrote: | I think you're slightly misinterpreting what the parent | said. Take the game Super Mario World for the console | Super Nintendo. It was released in Japan. It was released | in the US. It was released in Europe. It was released in | Korea. It was released in Australia. It was probably | released in various minor regions and given unique | translations. There are almost certainly re-releases of | the game on Super Nintendo that issued new ROM files to | correct minor bugs. Maybe there's a Greatest Hits version | which might be the same game, but with an updated | copyright date to reflect the re-release. This might | amount to 10-12 versions of the same game, but 99.99% of | what's in the ROM file is the same across all of them, so | they can be represented compressed very well. | | A copy of Super Mario Advance 2 for Game Boy Advance, | which is also a re-release of Super Mario World, almost | surely uses its own engine and would not be part of the | same rom set. Likewise, other Mario games (like Mario 64, | Super Mario Bros, etc.) would not be part of the same rom | set. So it's nothing about the series using the same | engine code or assets. | | We're talking bugfixes and different regions for the same | game on the same console. But this still has the effect | of dropping the size for complete console collections by | 50% or more, because most consoles have 2-3 regions per | game for most games. | derefr wrote: | You're generally correct. But there are interesting | exceptions! | | Sometimes, ROM-image-based game titles _were_ based on | the same "engine" (i.e. the same core set of assembler | source-files with fixed address-space target locations, | and so fixed locations in a generated ROM image), but | with a few engine modifications, and entirely different | assets. | | In a sense, this makes these different games effectively | into mutual "full conversion ROMhacks" of one-another. | | You'll usually find these different game titles | compressed together into the _same_ ROMset (with one game | title -- usually the one with the oldest official release | -- being considered the prototype for the others, and so | naming the ROMset), because they _do_ compress together | very well -- not near-totally, the way bugfix patches do, | but adding only the total amount to the archive size that | you 'd expect for the additional new assets. | | Well-known examples of this are _Doki Doki Panic_ vs. | _Super Mario Bros 2_ ; _Panel de Pon_ vs. _Tetris Attack_ | ; _Gradius III_ vs. _Parodius_ ; and any game with | editions, e.g. _Pokemon_ or _Megaman Battle Network_. | | But there are more "complete" examples as well, where | you'd never even suspect the two titles are related, with | the games perhaps existing in entirely-different genres. | (I don't have a ROMset library on-hand to dig out | examples, but if you dig through one, you'll find some | amazing examples of engine reuse.) | wpietri wrote: | Ooh, neat. I was wondering why anybody would make a binary- | specific VCS. And why "elf" was in the name. This answers both | questions. Thanks! | [deleted] | yincrash wrote: | Could this be useful for packing xcode's deriveddata folder for | caching in ci builds? | svilen_dobrev wrote: | will some of these work for (compressed) variants of audio? | They're never same.. | peterwaller-arm wrote: | Author here. Compressed data is unlikely to work well in | general, unless it never changes. | cyounkins wrote: | Cool! I wonder how this would compare to ZFS deduplication. | veselink1 wrote: | An author here. elfshaker uses per-file deduplication. When | building manyclangs packs, we observed that the deduplicated | content is about 10 GiB in size. After compression with | `elfshaker pack`, that comes down to ~100 MiB. | | There is also a usability difference: elfshaker stores data in | pack files, which are more easily shareable. Each of the pack | files released as part of manyclangs ~100 MiB and contains | enough data to materialize ~2,000 builds of clang and LLVM. | bogwog wrote: | Does this work well with image files? (PNG, JPEG, etc) | peterwaller-arm wrote: | Author here, it works particularly well for our presented use | case because it has these properties: | | * There are many files, | | * Most of them don't change very often, | | * When they do change, the deltas of the binaries are not huge. | | So, if the image files aren't changing very much, then it might | work well for you. If the images are changing, their binary | deltas would be quite large, so you'd get a compression ratio | somewhat equivalent to if you'd concatenated the two revisions | of the file and compressed them using ZStandard. | shp0ngle wrote: | Ahhh that's the key insight I have been missing, and that | should be higher somewhere. | | Thanks | IceWreck wrote: | Please add these points under a usecase heading in your | README. | peterwaller-arm wrote: | Done, hopefully this is clearer. Please let us know if you | see a way to improve it further: | https://github.com/elfshaker/elfshaker/pull/60 | ghoul2 wrote: | If I already have, lets say a 100MB pack file containing (say) | 200 builds of clang and then I import the 201st build into that | pack file - is it possible to send across a small delta of this | new, updated pack file to someone else who already had the older | pack file (with 200 builds) such that they can apply the delta to | the old pack and get the new pack containing 201 builds? | carlmr wrote: | I find the description a bit confusing, is there and example | where we can see the usage? | mxuribe wrote: | Same here. There is a usage guide, which helped a tiny bit: | https://github.com/elfshaker/elfshaker/blob/main/docs/users/... | | Honestly, I sort of looked at it for conventional backup | strategy...as in, i wonder if it could work as a replacement | for tar-zipping up a directory, etc. But, not sure if the use | cases is appropriate. | xdfgh1112 wrote: | For backup you probably want something like Borg to handle | deduplication of identical content between backups. | peterwaller-arm wrote: | Author here, I agree with xdfgh1112, please take care | before using brand new software to store your backups! | mxuribe wrote: | Yes, any time that i use something new or different (or | both) for something as essential as backups, i take great | and deliberate care...and test, test, test...well before | standardizing on it. ;-) | peterwaller-arm wrote: | Author here. We'd love this to be a thing, but this is young | software, so we don't recommend relying on this as a single | way of doing a backup for now. Bear in mind that our main use | case is for things that you can reproduce in principle | (builds of a commit history, see manyclangs). | mxuribe wrote: | > our main use case is for things that you can reproduce in | principle (builds of a commit history, see manyclangs) | | I appreciate your response, and thanks very much for the | clarification of use case; very helpful! Thanks also of | course for building this! | w0m wrote: | My top level being that it's a VCS (like Git) specialized for | binaries; with commands baked in to prevent the slowdown that | often comes with large git repositories. | throw_away wrote: | Specifically, it's for ELF binaries built in such a way that | adding a new function or new data does not break however they | cache existing functions/data. | | I wonder if this concept could be extended to other binary | types that git has problems with, were you able to | know/control more about the underlying binary format. | wyldfire wrote: | There is an associated presentation on manyclangs at LLVM dev | meeting. I think they presented yesterday? | | Unfortunately it won't be uploaded until later but it will show | up on the llvm YouTube channel: | | https://www.youtube.com/c/LLVMPROJ | ot wrote: | I would guess it's a way to quickly bisect on compiler | versions. | peterwaller-arm wrote: | One of the authors here, thanks for the feedback. We've tried | to improve it here: | https://github.com/elfshaker/elfshaker/pull/59 | xpe wrote: | Never shake a baby elf! | 0942v8653 wrote: | Does it do any architecture-specific processing, i.e. BCJ filter? | Or is there a generic version of this? The performance seems | quite good. | peterwaller-arm wrote: | Author here. No architecture specific processing currently. | Most of the magic happens in zstandard (hat tip to this amazing | project). | | Please see our new applicability section which explains the | result in a bit more detail: | | https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83... | | In manyclangs (which uses elfshaker for storage) we arrange | that the object code has stable addresses when you do | insertions/deletions, which means you don't need such a filter. | But today I learned about such filters, so thanks for sharing | your question! | dilap wrote: | Huh, interesting, could you maybe use this as an in-repo | alternative to something like git-lfs? | peterwaller-arm wrote: | Author here, I don't currently know how this compares to git- | lfs. It it is possible git-lfs would perform quite well on the | same inputs as elfshaker works on. If git-lfs does already work | well for your use case I'd recommend using that rather than | elfshaker, as it is more established. | dilap wrote: | Thanks for the response! I was more just curious about future | possibilities vs immediate practicle use. | | git-lfs just offloads the storage of the large binaries to a | remote site, and then downloads on demand. | | If you have a lot of binary assets like artwork or huge excel | spreadsheets, it's very useful, because in those cases, | without git-lfs, the git repo will get very large, git will | get extremely slow, and github will get angry at you for | having too large a repo. | | But it's not all roses with git-lfs, since now you're reliant | on the external network to do checkouts, vs having fetched | everything at once w/ the initial clone, and also of course | just switching between revisions can get slower since you're | network-limited to fetch those large files. (And though I'm | not sure, it doesn't seem like git-lfs is doing any local | caching.) | | So you could imagine where something like having elfshaker | embedded in the repo and integrated as a checkout filter | could potentially be a useful alternative. Basically an | efficient way to store binaries directly in the repo. | | (Maybe it would be too small a band of use cases to be | practicle though? Obviously if you have lots of distinct art | assets, that's just going to be big, no matter what...) | axismundi wrote: | does it work on intel macs? ___________________________________________________________________ (page generated 2021-11-19 23:00 UTC)