[HN Gopher] Show HN: We scaled Git to support 1 TB repos
       ___________________________________________________________________
        
       Show HN: We scaled Git to support 1 TB repos
        
       I've been in the MLOps space for ~10 years, and data is still the
       hardest unsolved open problem. Code is versioned using Git, data is
       stored somewhere else, and context often lives in a 3rd location
       like Slack or GDocs. This is why we built XetHub, a platform that
       enables teams to treat data like code, using Git.  Unlike Git LFS,
       we don't just store the files. We use content-defined chunking and
       Merkle Trees to dedupe against everything in history. This allows
       small changes in large files to be stored compactly. Read more
       here: https://xethub.com/assets/docs/how-xet-deduplication-works
       Today, XetHub works for 1 TB repositories, and we plan to scale to
       100 TB in the next year. Our implementation is in Rust (client &
       cache + storage) and our web application is written in Go. XetHub
       includes a GitHub-like web interface that provides automatic CSV
       summaries and allows custom visualizations using Vega. Even at 1
       TB, we know downloading an entire repository is painful, so we
       built git-xet mount - which, in seconds, provides a user-mode
       filesystem view over the repo.  XetHub is available today (Linux &
       Mac today, Windows coming soon) and we would love your feedback!
       Read more here:  - https://xetdata.com/blog/2022/10/15/why-xetdata
       - https://xetdata.com/blog/2022/12/13/introducing-xethub
        
       Author : reverius42
       Score  : 194 points
       Date   : 2022-12-13 15:14 UTC (7 hours ago)
        
 (HTM) web link (xethub.com)
 (TXT) w3m dump (xethub.com)
        
       | timsehn wrote:
       | Founder of DoltHub here. One of my team pointed me at this
       | thread. Congrats on the launch. Great to see more folks tackling
       | the data versioning problem.
       | 
       | Dolt hasn't come up here yet, probably because we're focused on
       | OLTP use cases, not MLOps, but we do have some customers using
       | Dolt as the backing store for their training data.
       | 
       | https://github.com/dolthub/dolt
       | 
       | Dolt also scales to the 1TB range and offers you full SQL query
       | capabilities on your data and diffs.
        
         | ylow wrote:
         | CEO/Cofounder here. Thanks! Agreed, we think data versioning is
         | an important problem and we are at related, but opposite parts
         | of the space. (BTW we really wanted gitfordata.com. Or perhaps
         | we can split the domain? OLTP goes here, Unstructured data goes
         | there :-) Shall we chat? )
        
       | chubot wrote:
       | Can it be used to store container images (Docker)? As far as I
       | remember they are just compressed tar files. Does the compression
       | defeat Xet's own chunking?
       | 
       | Can you sync to another machine without Xethub ?
       | 
       | How about cleaning up old files?
        
         | ylow wrote:
         | Yeah... The compression does defeat the chunking (your mileage
         | may vary. We do a small amount of dedupe in some experiments
         | but never quite investigated it in detail.). That said, we have
         | experimental preprocessors / chunkers that are file-type
         | specific that we could potentially do something about tar.gz.
         | Not something we have explored much yet.
        
       | the_arun wrote:
       | Signed up & browsing "Flickr30k" repo (auto generated) repo & it
       | was really slow for me. Like CSV, does it also supports other
       | data formats like json, yml etc.,?
        
         | ylow wrote:
         | We are file format agnostic and you should be able to put
         | anything in the repo. We have special support for CSV files for
         | visualizations. Sorry for the UI perf... there are a lot of
         | optimizations we need to work on.
        
       | amadvance wrote:
       | How data is split in chunks ? Just curious.
        
         | [deleted]
        
         | sesm wrote:
         | They mention 'content-defined chunking', but it as far as
         | understand it requires different chunking algorithms for
         | different content types. Does it support plugins for chunking
         | different file formats?
        
           | ylow wrote:
           | Today we just have a variation of FastCDC in production, but
           | we have alternate experimental chunkers for some file formats
           | (ex: a heuristic chunker for CSV files that will enable
           | almost free subsampling). Hope to have them enter production
           | in the next 6 months.
        
             | sesm wrote:
             | That's interesting. Can a CSV chunker make adding a column
             | not affect all of the chunks?
        
               | ylow wrote:
               | The simplest really is to chunk row-wise so adding
               | columns will unfortunately rewrite all the chunks. If you
               | have a parquet file, adding columns will be cheap.
        
         | ylow wrote:
         | CEO/Cofounder here! Content defined chunking. Specifically a
         | variation of FastCDC. We have a paper coming out soon with a
         | lot more technical details.
        
       | subnetwork wrote:
       | This feels like something that is prime for abuse. I agree with
       | @bastardoperator, treating git for file storage is going to go
       | nowhere good.
        
       | ledauphin wrote:
       | The link takes me to a login page. It would be nice to see that
       | fixed to somehow match the title.
        
         | reverius42 wrote:
         | Visit https://xetdata.com for more info! (Sorry, can't edit the
         | post link now.)
        
       | jrockway wrote:
       | There are a couple of other contenders in this space. DVC
       | (https://dvc.org/) seems most similar.
       | 
       | If you're interested in something you can self-host... I work on
       | Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't
       | have a Git-like interface, but also implements data versioning.
       | Our approach de-duplicates between files (even very small files),
       | and our storage algorithm doesn't create objects proportional to
       | O(n) directory nesting depth as Xet appears to. (Xet is very much
       | like Git in that respect.)
       | 
       | The data versioning system enables us to run pipelines based on
       | changes to your data; the pipelines declare what files they read,
       | and that allows us to schedule processing jobs that only
       | reprocess new or changed data, while still giving you a full view
       | of what "would" have happened if all the data had been
       | reprocessed. This, to me, is the key advantage of data
       | versioning; you can save hundreds of thousands of dollars on
       | compute. Being able to undo an oopsie is just icing on the cake.
       | 
       | Xet's system for mounting a remote repo as a filesystem is a good
       | idea. We do that too :)
        
         | ylow wrote:
         | By the way, our mount mechanism has one very interesting
         | novelty. It does not depend on a FUSE driver on Mac :-)
        
           | jrockway wrote:
           | That's smart! I think users have to install a kext still?
        
             | ylow wrote:
             | Nope. No kernel driver needed :-) We wrote an localhost NFS
             | server.
        
               | catiopatio wrote:
               | Based on unfsd or entirely in-house?
        
               | ylow wrote:
               | Entirely in house. In Rust!
        
               | catiopatio wrote:
               | Fancy! That's awesome.
        
         | chubot wrote:
         | Is DVC useful/efficient at storing container images (Docker)?
         | As far as I remember they are just compressed tar files. Does
         | the compression defeat its chunking / differential compression?
         | 
         | How about cleaning up old versions?
        
         | ylow wrote:
         | We have found pointer files to be _surprisingly_ efficient as
         | long as you don 't have to actually materialize those files.
         | (Git's internals actually very well done). Our mount mechanism
         | does avoid materializing pointer files which makes it pretty
         | fast even for repos with very large number of files.
        
           | unqueued wrote:
           | For bigger annex repos with lots of pointer files, I just
           | disable the git-annex smudge filters. Consider whether smudge
           | filters are requirement, or a convenience. The smudge filter
           | interface does not scale that well at all.
        
       | Izmaki wrote:
       | If I had to "version control" a 1 TB large repo - and assuming I
       | wouldn't quit in anger - I would use a tool which is built for
       | this kind of need and has been used in the industry for decades:
       | Perforce.
        
         | ryneandal wrote:
         | This was my thought as well. Perforce has its own issues, but
         | is an industry standard in game dev for a reason: it can handle
         | immense amounts of data.
        
           | Phelinofist wrote:
           | What does immense mean in the context of game dev?
        
             | llanowarelves wrote:
             | On "real" (AA/AAA) games? Easily hundreds of gigabytes or
             | several terabytes of raw assets + project files.
             | 
             | Sometimes even individual art project files can be many
             | gigabytes each. I saw a .psd that was 30gb because of the
             | embedded hi-res reference images.
             | 
             | You can throw pretty much anything in there, in one place
             | and things like locking, partial-checkout, etc. Which gets
             | artists to use it
        
               | hinkley wrote:
               | Perforce also has support for proxies right? It's not
               | just the TB of data, it's all of your coworkers in a
               | branch office having to pull all the updates first thing
               | in the morning. If each person has to pull from origin,
               | that's a lot of bandwidth, and wasted mornings. If the
               | first person in pays and everyone else gets it off the
               | LAN, then you have a better situation.
        
         | mentos wrote:
         | I work in gamedev and think perforce is good but far from
         | great. Would love to see someone bring some competition to the
         | space maybe XetHub can.
        
         | tinco wrote:
         | So, you wouldn't consider using a new tool that someone
         | developed to solve the same problem despite an older solution
         | already existing? Your advice to that someone is to just use
         | the old solution?
        
           | TylerE wrote:
           | When the new solution involves voluntary use of git? Not just
           | yea, but hell yes. I hate git.
        
             | xur17 wrote:
             | Why do you hate git? I've been pretty happy with it for
             | code, and wouldn't mind being able to use it for data
             | repositories as well.
        
               | TylerE wrote:
               | Is it really worth re-hashing at this point? Reams have
               | been written about the UX
        
               | xur17 wrote:
               | It's used by the vast majority of software engineers, so
               | apparently it's "good enough".
        
               | hinkley wrote:
               | Don't ascribe positive feelings to popularity. I'm only
               | using git until the moment there's a viable alternative
               | written by someone who knows what DX is.
        
       | JZL003 wrote:
       | I also have a lot of issues with versioning data. But look at git
       | annex - it's free, self hosted and has a very easy underlying
       | data structure [1]. So I don't even use the magic commands it has
       | for remote data mounting/multi-device coordination, just backup
       | using basic S3 commands and can use rclone mounting. Very robust,
       | open source, and useful
       | 
       | [1] When you run `git annex add` it hashes the file and moves the
       | original file to a `.git/annex/data` folder under the
       | hash/content addressable file system, like git. Then it replaces
       | the original file with a symlink to this hashed file path. The
       | file is marked as read only, so any command in any language which
       | tries to write to it will error (you can always `git annex
       | unlock` so you can write to it). If you have duplicated files,
       | they easily point to the same hashed location. As long as you git
       | push normally and back up the `.git/annex/data` you're totally
       | version controlled, and you can share the subset of files as
       | needed
        
         | kspacewalk2 wrote:
         | Sounds like `git annex` is file-level deduplication, whereas
         | this tool is block-level, but with some intelligent, context-
         | specific way of defining how to split up the data (i.e.
         | Content-Defined Chunking). For data management/versioning,
         | that's usually a big difference.
        
           | cma wrote:
           | If git annex stores large files uncompressed you could use
           | filesystem bl9ck level deduplication in combination with it.
        
             | synergy20 wrote:
             | can you be more specific here,very interested
        
               | dark-star wrote:
               | There are filesystems that support inline or post-process
               | deduplication. btrfs[1] and zfs[2] come to mind as free
               | ones, but there are also commercial ones like WAFL etc.
               | 
               | It's always a tradeoff. Deduplication is a CPU-heavy
               | process, and if it's done inline, it is also memory-
               | heavy, so you're basically trading CPU and memory for
               | storage space. It heavily depends on the use-case (and
               | the particular FS / deduplication implementation) whether
               | it's worth it or not
               | 
               | [1]:
               | https://btrfs.wiki.kernel.org/index.php/Deduplication
               | 
               | [2]: https://docs.oracle.com/cd/E36784_01/html/E39134/fsd
               | edup-1.h...
        
               | cma wrote:
               | One problem is if you need to support Windows clients.
               | Microsoft charges $1600 for deduplication support or
               | something like that: https://learn.microsoft.com/en-
               | us/windows-server/storage/dat...
        
               | mattpallissard wrote:
               | Yeah, which is great for storage but doesn't help over
               | the wire.
        
               | xmodem wrote:
               | ZFS at least supports sending a deduplicated stream.
        
               | mattpallissard wrote:
               | Right, and btrfs can send a compressed stream as well,
               | but we aren't sending raw filesystem data via VCS.
        
               | alchemist1e9 wrote:
               | zbackup is a great block level deduplication trick.
        
           | rsync wrote:
           | "Sounds like `git annex` is file-level deduplication, whereas
           | this tool is block-level ..."
           | 
           | I am not a user of git annex but I do know that it works
           | perfectly with an rsync.net account as a target:
           | 
           | https://git-
           | annex.branchable.com/forum/making_good_use_of_my...
           | 
           | ... _which means_ that you could do a _dumb mirror_ of your
           | repo(s) - perhaps just using rsync - and then let the ZFS
           | snapshots handle the versioning /rotation which would give
           | you the benefits of _block level diffs_.
           | 
           | One additional benefit, beyond more efficient block level
           | diffs, is that the ZFS snapshots are immutable/readonly as
           | opposed to your 'git' or 'git annex' produced versions which
           | could be destroyed by Mallory ...
        
             | darau1 wrote:
             | > let the ZFS snapshots handle the versioning/rotation
             | which would give you the benefits of block level diffs
             | 
             | Can you explain this a bit? I don't know anything about
             | ZFS, but it sounds as though it creates snapshots based on
             | block level differences? Maybe a git-annex backend could be
             | written to take advantage of that -- I don't know.
        
           | unqueued wrote:
           | No, that is not correct, git-annex uses a variety of special
           | remotes[2], some of which support deduplication. Mentioned in
           | another comment[1]
           | 
           | When you have checked something out and fetched it, then it
           | consumes space on disk, but that is true with git-lfs, and
           | most other tools like it. It does NOT consume any space in
           | any git object files.
           | 
           | I regularly use a git-annex repo that contains about 60G of
           | files, which I can use with github or any git host, and uses
           | about 6G in its annex, and 1M in the actual git repo itself.
           | I chain git-annex to an internal .bup repo, so I can keep
           | track of the location, and benefit from dedup.
           | 
           | I honestly have not found anything that comes close to the
           | versatility of git-annex.
           | 
           | [1]: https://news.ycombinator.com/item?id=33976418
           | 
           | [2]: https://git-annex.branchable.com/special_remotes/
        
           | rajatarya wrote:
           | XetHub Co-founder here. Yes, one illustrative example of the
           | difference is:
           | 
           | Imagine you have a 500MB file (lastmonth.csv) where every day
           | 1MB is changed.
           | 
           | With file-based deduplication every day 500MB will be
           | uploaded, and all clones of the repo will need to download
           | 500MB.
           | 
           | With block-based deduplication, only around the 1MB that
           | changed is uploaded and downloaded.
        
             | unqueued wrote:
             | I combine git-annex with the bup special remote[1], which
             | lets me still externalize big files, while benefiting from
             | block level deduplication. Or depending on your needs, you
             | can just use a tool like bup[2] or borg directly. Bup
             | actually uses the git pack file format and git metadata.
             | 
             | I actually wrote a script which I'm happy to share, that
             | makes this much easier, and even lets you mount your bup
             | repo over .git/annex/objects for direct access.
             | 
             | [1]: https://git-
             | annex.branchable.com/walkthrough/using_bup/
             | 
             | [2]: https://github.com/bup/bup
        
             | civilized wrote:
             | Does that work equally well whether the changes are
             | primarily row-based or primarily column-based?
        
               | rajatarya wrote:
               | Yes, see this for more details of how XetHub
               | deduplication: https://xethub.com/assets/docs/xet-
               | specifics/how-xet-dedupli...
        
               | prirun wrote:
               | HashBackup author here. Your question is (I think) about
               | how well block-based dedup functions on a database -
               | whether rows are changed or columns are changed. This
               | answer is how most block-based dedup software, including
               | HashBackup work.
               | 
               | Block-based dedup can be done either with fixed block
               | sizes or variable block sizes. For a database with fixed
               | page sizes, a fixed block size matching the page size is
               | most efficient. For a database with variable page sizes,
               | a variable block size will work better, assuming there
               | the dedup "chunking" algorithm is fine-grained enough to
               | detect the database page size. For example, if the db
               | used a 4-6K variable page size and the dedup algo used a
               | 1M variable block size, it could not save a single
               | modified db page but would save more like 20 db pages
               | surrounding the modified page.
               | 
               | Your column vs row question depends on how the db stores
               | data, whether key fields are changed, etc. The main dedup
               | efficiency criteria are whether the changes are
               | physically clustered together in the file or whether they
               | are dispersed throughout the file, and how fine-grained
               | the dedup block detection algorithm is.
        
             | AustinDev wrote:
             | Have you tested this out with Unreal Engine blueprint
             | files? If you all can do block-based diffing on those, and
             | other binary assets used in game development it'd be huge
             | for game development.
             | 
             | I have a couple ~1TB repositories I've had the misfortune
             | of working with using perforce in the past.
        
               | rajatarya wrote:
               | Not yet. Would be happy to try - can you point me to a
               | project to use?
               | 
               | Do you have a repo you could try us out with?
               | 
               | We have tried a couple Unity projects (41% smaller due to
               | republication) but not much from Unreal projects yet.
        
               | AustinDev wrote:
               | Most of my examples of that size are AAA game source that
               | I can't share however, I think this is a project using
               | similar files that is based on unreal. It should show if
               | there is any benefit: https://github.com/CesiumGS/cesium-
               | unreal-samples & where the .umap binaries have been
               | updated and in this example where the .uasset blueprints
               | have been updated
               | https://github.com/renhaiyizhigou/Unreal-Blueprint-
               | Project
        
         | timbotron wrote:
         | If you like git annex check out
         | [datalad](http://handbook.datalad.org/en/latest/), it provides
         | some useful wrappers around git annex oriented towards
         | scientific computing.
        
       | blobbers wrote:
       | ... why do you have 1TB of source code (you don't! mandatory
       | hacker snark) Is git really supposed to be used for data? Or is
       | this just a git-like interface for source control on data?
        
         | IshKebab wrote:
         | Git is only not "supposed" to be used for data because it
         | doesn't work very well with data by default. Not because that's
         | not a useful and sensible thing to want from a VCS.
        
           | TillE wrote:
           | It's a fundamentally bad idea because of how any DVCS works.
           | You really don't want to be dragging around gigabytes of
           | obsolete data forever.
           | 
           | Something like git-lfs is the appropriate solution. You need
           | a little bit of centralization.
        
             | IshKebab wrote:
             | Because of how _Git 's current implementation of DVS_
             | works. There's nothing fundamental about it. Git already
             | supports partial clones and on-demand checkouts in some
             | ways, it's just not very ergonomic.
             | 
             | All that's really needed is a way to mark individual files
             | as lazily fetched from a remote only when needed. LFS is a
             | hacky substandard way to emulate that behaviour. It should
             | be built in to Git.
        
         | stevelacy wrote:
         | Game development, especially Unreal engine, can produce repos
         | in excess of 1TB. Git LFS is used extensively for binary file
         | support.
        
       | Eleison23 wrote:
        
       | Aperocky wrote:
       | I see a lot of reasons to version code.
       | 
       | I see far less reasons to version data, in fact, I find reasons
       | _against_ versioning data and storing them in diffs.
        
         | treeman79 wrote:
         | Anything that might be audited. Being able to look at things
         | how they were how they changed to find out how they got to
         | where they currently are, and who did what; is amazing for many
         | application. Finance, healthcare, elections, Etc.
         | 
         | Well unless fraud is the goal.
        
           | bfm wrote:
           | Shameless plug for https://snapdir.org which focuses on this
           | particular use case using regular git and auditable plain
           | text manifests
        
         | zachmu wrote:
         | You're suffering from a failure of imagination, maybe because
         | you've never been able to version data usefully before. There
         | are already lots of interesting applications, and it's still
         | quite new.
         | 
         | https://www.dolthub.com/blog/2022-07-11-dolt-case-studies/
        
         | WorldMaker wrote:
         | Something that I've experienced from many years in enterprise
         | software: 90% of enterprise software is about versioning data
         | in some way. SharePoint is half as complicated as it is because
         | it has be a massive document and data version manager. (Same
         | with Confluence and other competitors.) "Everyone" needs deep
         | audit logs for some likely overlap of SOX compliance, PCI
         | compliance, HIPAA compliance, and/or other industry specific
         | standards and practices. Most business analysts want accurate
         | "point in time" reporting tools to revisit data as it looked at
         | almost any point in the past, and if you don't build it for
         | them they often build it as ad hoc file cabinets full of Excel
         | export dumps for themselves.
         | 
         | The wheels of data versioning just get reinvented over and over
         | and over again, with all sorts of slightly different tools.
         | Most of the job of "boring CRUD app development" is data
         | version management and some of the "joy" is how every database
         | you ever encounter is often its own little snowflake with
         | respect to how it versions its data.
         | 
         | There have been times I've pined for being able to just store
         | it all in git and reduce things to a single paradigm. That
         | said, I'd never actually want to teach business analysts or
         | accountants how to _use_ git (and would probably spend nearly
         | as much time building custom CRUD apps against git as against
         | any other sort of database). There are times though where I
         | have thought for backend work  "if I could just checkout the
         | database at the right git tag instead needing to write this
         | five table join SQL statement with these eighteen differently
         | named timestamp fields that need to be sorted in four different
         | ways...".
         | 
         | Reasons to version data are plenty and most of the data
         | versioning in the world is ad hoc and/or operationally
         | incompatible/inconsistent across systems. (Ever had to ETL
         | SharePoint lists and its CVC-based versioning with a timestamp
         | based data table? Such "fun".) I don't think git is necessarily
         | the savior here, though there remains some appeal in "I can use
         | the same systems I use for code" two birds with one stone.
         | Relatedly, content-addressed storage and/or merkle trees are a
         | growing tool for Enterprise and do look a lot like a git
         | repository and sometimes you also have the feeling like if you
         | are already using git why build your own merkle tree store when
         | git gives you a swiss army knife tool kit on top of that merkle
         | tree store.
        
         | ch71r22 wrote:
         | What are the reasons against?
        
           | ltbarcly3 wrote:
           | The lack of reasons for doing it IS the reason against. GIT
           | isn't a magic 'good way' to store arbitrary data, it's a good
           | way to collaborate on projects implemented using most
           | programming languages which store code as plain text broken
           | into short lines, where edits to non-sequential lines can
           | generally be applied concurrently without careful human
           | verification. That is an extremely specific use case, and
           | anything outside of that very specific use case leaves git
           | terrible, inefficient, and gives almost no benefit despite
           | huge problems.
           | 
           | People in ML ops use git because they aren't very
           | sophisticated with programming professionally and they have
           | git available to them and they haven't run into the
           | consequences of using it to store large binary blobs, namely
           | that it becomes impossible to live with eventually and wastes
           | a huge amount of time and space.
           | 
           | ML didn't invent the need for large artifacts that can't be
           | versioned in source control but must be versioned with it,
           | but they don't know that because they are new to professional
           | programming and aren't familiar with how it's done.
        
             | ylow wrote:
             | Indeed, there is a lot of pain if you actually try to store
             | large binary data in git. But we managed to make that work!
             | So a question worth asking is how might things change IF
             | you can store large binary data in git??
        
               | ltbarcly3 wrote:
               | I think this is a foot-gun, it's a bad idea even if it
               | works great, and I doubt it works very well. You should
               | manage your build artifacts explicitly, not just jam them
               | in git along with the code that generates them because
               | you are already using it and you haven't thought it
               | through.
        
               | wpietri wrote:
               | I don't think you've made your case here. The practices
               | you describe are partly an artifact of computation,
               | bandwidth, and storage costs. But not the current ones,
               | the ones when git was invented more than 15 years ago. In
               | the short term, we have to conform to the computer's
               | needs. But in the long term, it has to be the other way
               | around.
        
               | ltbarcly3 wrote:
               | You're right! It makes way more sense, in the long run,
               | to abuse a tool like git in a way that it isn't designed
               | for and which it can't actually support, then instead of
               | actually using git use a proprietary service that may or
               | may not be around in a week. Here I was thinking short
               | term.
        
               | Game_Ender wrote:
               | Xet's initial focus appears to be on data files used to
               | drive machine learning pipelines, not on any resulting
               | binaries.
        
               | sk0g wrote:
               | That is exactly what git-lfs is, a way to "version
               | control" binary files, by storing revisions - possibly
               | separately, while the actual repo contains text files +
               | "pointer" files that references a binary file.
               | 
               | It's not perfect, and still feels like a bit of a hack
               | compared to something like p4 for the context I uses LFS
               | in (game dev), but it works, and doesn't require
               | expensive custom licenses when teams grow beyond an
               | arbitrary number like 3 or 5.
        
               | rajatarya wrote:
               | XetHub Co-founder here. Yes, we use the same Git
               | extension mechanism as Git LFS (clean/smudge filters) and
               | we store pointer files in the git repository. Unlike Git
               | LFS we do block-level deduplication (Git LFS does file-
               | level deduplication) and this can result in a significant
               | savings in storage and bandwidth.
               | 
               | As an example, a Unity game repo reduced in size by 41%
               | using our block-level deduplication vs Git LFS. Raw repo
               | was 48.9GB, Git LFS was 48.2GB, and with XetHub was
               | 28.7GB.
               | 
               | Why do you think using a Git-based solution is a hack
               | compared to p4? What part of the p4 workflow feels more
               | natural to you?
        
             | mardifoufs wrote:
             | I literally don't know anyone or any team in ML using git
             | as a data versioning tool. It doesn't even make sense to
             | me, and most mlops people I have talked to would agree. Is
             | that really the point of this tool? To be a general purpose
             | data store for mlops? I thought it is for very specialized
             | ML use cases. Because even 1TB isn't much for ML data
             | versioning
             | 
             | Mlops people are very aware of tools that are more suited
             | for the job... even too aware in fact. The entire field is
             | full of tools, databases, etc to the point where it's hard
             | make sense of it. So your comment is a bit weird to me
        
               | ltbarcly3 wrote:
               | I think you'll find varying levels of maturity in ML ops.
               | Anyway I think we basically agree, if you use something
               | like this you aren't that mature, and if you are mature
               | you would avoid this thing.
        
         | oftenwrong wrote:
         | One use-case would be for including dependencies in a repo. For
         | example, it is common for companies to operate their own
         | artifact caches/mirrors to protect their access to artifacts
         | from npm, pypi, dockerhub, maven central, pkg.go.dev, etc. With
         | the ability to efficiently work with a big repo, it would be
         | possible to store the artifacts in git, saving the trouble of
         | having to operate artifact mirrors. Additionally, it guarantees
         | that the artifacts for a given, known-buildable revision are
         | available offline.
        
         | guardian5x wrote:
         | As always it depends on the application. It can definitely be
         | useful in some applications.
        
         | substation13 wrote:
         | Versioning data is great, but storing as diffs is inefficient
         | when 99% of the file changes each version.
        
           | reverius42 wrote:
           | We don't store as diffs, we store as snapshots -- and it's
           | efficient thanks to the way we do dedupe. See
           | https://xethub.com/assets/docs/how-xet-deduplication-works/
        
         | ylow wrote:
         | Cofounder/CEO here! I think it less about "versioning", but the
         | ability to modify with confidence knowing that you can go back
         | in time anytime. (Minor clarification: we are not quite storing
         | diffs; holding snapshots just like Git + a bunch of data
         | dedupe)
        
           | krageon wrote:
           | > the ability to modify with confidence knowing that you can
           | go back in time anytime
           | 
           | This is versioning
        
             | rafael09ed wrote:
             | Versioning is a technique. Backups, copy+paste+rename also
             | does it
        
       | angrais wrote:
       | How's this differ from using git LFS?
        
         | ylow wrote:
         | We are _significantly_ faster? :-) Also, block-level dedupe,
         | scalability, perf, visualization, mounting, etc.
        
       | polemic wrote:
       | There seem to be a lot of data version control systems built
       | around ML pipelines or software development needs, but not so
       | much on the sort of data editing that happens outside of software
       | development & analysis.
       | 
       | Kart (https://kartproject.org) is built on git to provide data
       | version control for geospatial vector & tabular data. Per-row
       | (feature & attribute) version control and the ability to
       | collaborate with a team of people is sorely missing from those
       | workflows. It's focused on geographic use-cases, but you can work
       | with 'plain old tables' too, with MySQL, PostgreSQL and MSSQL
       | working copies (you don't have to pick - you can push and pull
       | between them).
        
       | dandigangi wrote:
       | One monorepo to rule them all and the in the darkness, pull them.
       | - Gandalf, probably
        
         | irrational wrote:
         | And in the darkness merge conflicts.
        
       | amelius wrote:
       | Does this fix the problem that Git becomes unreasonably slow when
       | you have large binary files in the repo?
       | 
       | Also, why can't Git show me an accurate progress-bar while
       | fetching?
        
         | reverius42 wrote:
         | Mostly! (At the moment, it doesn't fully fix the slowdown
         | associated with storing large binary files, but reduces it by
         | 90-99%. We're working on improving to closer to 100% that by
         | moving even the Merkle Tree storage outside the git repo
         | contents.)
         | 
         | As for why git can't show you an accurate progress bar while
         | fetching (specifically when using an extension like git-lfs or
         | git-xet), this has to do with the way git extensions work --
         | each file gets "cleaned" by the extension through a Unix pipe,
         | and the protocol for that is too simple to reflect progress
         | information back to the user. In git-xet, we do write a
         | percent-complete to stdout so you get some more info (but a
         | real progress bar would be nice).
        
       | Game_Ender wrote:
       | The tl;dr is that "xet" is like GitLFS (it stores pointers in
       | Git, with the data in a remote server and uses smudge filters to
       | make this transparent) with some additional features:
       | 
       | - Automatically includes all files >256KB in size
       | 
       | - By default data is de-duplicated 16KB chunks instead of whole
       | files (with the ability to customize this per file type).
       | 
       | - Has a "mount" command to allow read-only browse without
       | downloading
       | 
       | When launching on HN it would be better if the team was a bit
       | more transparent with the internals. I get that "we made a better
       | GitLFS" doesn't market as well. But you can couple that with a
       | credible vision and story about how you are a better and where
       | you are headed next. Instead this is mostly closer to market
       | speak of "trust our magic solution to solve your problem".
        
         | nightpool wrote:
         | These details seemed.... really clear to me from the post the
         | OP made? Did you just not read it, or have they updated it
         | since you commented?
         | 
         | (excerpt from the OP post:
         | 
         | > Unlike Git LFS, we don't just store the files. We use
         | content-defined chunking and Merkle Trees to dedupe against
         | everything in history. This allows small changes in large files
         | to be stored compactly. Read more here:
         | https://xethub.com/assets/docs/how-xet-deduplication-works)
        
       | culanuchachamim wrote:
       | Maybe a silly question:
       | 
       | Why do you need 1Tb for repos? What do you store inside, besides
       | code and some images?
        
         | layer8 wrote:
         | Some docker images? ;)
        
         | lazide wrote:
         | A whole lot of images?
         | 
         | I personally would love to be able to store datasets next to
         | code for regression testing, easier deployment, easier dev
         | workstation spin up, etc.
        
           | culanuchachamim wrote:
           | Still, 1TB?
           | 
           | Once you get to that amount of images it would be much easy
           | to manage it with some files storage solution.
           | 
           | Or I'm missing something important?
        
             | lazide wrote:
             | All of them require having some sort of parallel
             | authentication, synchronization, permissions management,
             | change tracking, etc.
             | 
             | Which is a huge hassle, and a lot of work I'd rather not
             | do.
             | 
             | My current photogrammetry dataset is well over 1TB, and it
             | isn't a lot for the industry by any stretch of the
             | imagination.
             | 
             | In fact, the only thing that considers it 'a lot' and is
             | hard to work with is git.
        
         | dafelst wrote:
         | Repositories for games are often larger than 1TB, and with
         | things like UE5's Nanite becoming more viable, they're only
         | going to get bigger.
        
       | Wojtkie wrote:
       | Can I upload a full .pbix file to this and use it for versioning?
       | If so, I'd use it in a heartbeat.
        
         | ylow wrote:
         | CEO/Cofounder here. We are file format agnostic and will
         | happily take everything. Not too familiar with the needs around
         | pbix, but please do try it out and let us know what you think!
        
       | COMMENT___ wrote:
       | What about SVN?
       | 
       | Besides other features, Subversion supports representation
       | sharing. So adding new textual or binary files with identical
       | data won't increase the size of your repository.
       | 
       | I'm not familiar with ML data sets, but it seems that SVN may
       | work great with them. It already works great for huge and small
       | game dev projects.
        
       | iFire wrote:
       | https://github.com/facebook/sapling is doing good work and they
       | are suggesting their git server for large repositories exists.
        
       | wnzl wrote:
       | Just in case if you are wondering about alternatives: there is
       | Unity's Plastic https://unity.com/products/plastic-scm which
       | happens to use bidirectional sync with git. I'm curious how this
       | solution compares to it! I'll definitely give it a try over the
       | weekend!
        
         | ziml77 wrote:
         | I was already upset about Codice Software pulling Semantic
         | Merge and only making it available as an integrated part of
         | Plastic SCM. Now that I see the reason such a useful tool was
         | taken away was to stuff the pockets of a large company, I'm
         | fuming.
         | 
         | I know that they're well within their rights to do this as they
         | only ever offered subscription licensing for Semantic Merge,
         | but that doesn't make it suck less to lose access.
        
       | web007 wrote:
       | Please consider https://sso.tax/ before making that an
       | "enterprise" feature.
        
         | IshKebab wrote:
         | I mean yeah, that's working as intended surely? Some of those
         | price differences are pretty egregious but in general companies
         | have to actually make money, and charging more for features
         | that are mainly needed by richer customers is a very obvious
         | thing to do.
        
           | mdaniel wrote:
           | I believe the counter-argument is that they should charge for
           | _features_ but that security should be available to anyone.
           | Imagine if  "passwords longer than 6 chars: now only $8/mo!"
           | 
           | That goes double for products where paying for "enterprise"
           | is _only_ to get SAML, which at least in my experience causes
           | me to go shopping for an entirely different product because I
           | view it as extortion
        
             | IshKebab wrote:
             | Security _is_ available for everyone. It 's centralised
             | security that can be easily managed by IT that isn't.
             | 
             | I don't see an issue with charging more for SSO though as I
             | said some of the prices are egregious.
        
         | Alifatisk wrote:
         | Very sad to see Bitwarden in this list
        
       | unqueued wrote:
       | I have a 1.96 TB git repo:
       | https://github.com/unqueued/repo.macintoshgarden.org-fileset (It
       | is a mirror of a Macintosh abandoneware site)                 git
       | annex info .
       | 
       | Of course, it uses pointer files for the binary blobs that are
       | not going to change much anyway.
       | 
       | And the datalad project has neuro imaging repos that are tens of
       | TB in size.
       | 
       | Consider whether you actually need to track differences in all of
       | your files. Honestly git-annex is one of the most powerful tools
       | I have ever used. You can use git for tracking changes in text,
       | but use a different system for tracking binaries.
       | 
       | I love how satisfying it is to be able to store the index for
       | hundreds of gigs of files on a floppy disk if I wanted.
        
       | bastardoperator wrote:
       | I actually encountered a 4TB git repo. After pulling all the
       | binary shit out of it the repo was actually 200MB. Anything that
       | promotes treating git like a filesystem is a bad idea in my
       | opinion.
        
         | frognumber wrote:
         | Yes... and no. The git userspace is horrible for this. The git
         | data model is wonderful.
         | 
         | The git userspace would need to be able to easily:
         | 
         | 1. Not grab all files
         | 
         | 2. Got grab the whole version history
         | 
         | ... and that's more-or-less it. At that point, it'd do great
         | with large files.
        
           | ylow wrote:
           | Exactly for the giant repo use case, we have a mount feature
           | that will let you get a filesystem mount of any repo at any
           | commit very very quickly.
        
       | TacticalCoder wrote:
       | What does a Merkle Tree bring here? (honest question) I mean: for
       | content-based addressing of chunks (and hence deduplication of
       | these chunks), a regular tree works too if I'm not mistaken (I
       | may be wrong but I literally wrote a "deduper" splitting files
       | into chunks and using content-based addressing to dedupe the
       | chunks: but I just used a dumb tree).
       | 
       | Is the Merkle true used because it brings something else than
       | deduplication, like chunks integrity verification or something
       | like that?
        
       | V1ndaar wrote:
       | You say you support up to 1TB repositories, but from your pricing
       | page all I see is the free tier for up to 20GB and one for teams.
       | The latter doesn't have a price and only a contact option and I
       | assume likely will be too expensive for an individual.
       | 
       | As someone who'd love to put their data into a git like system,
       | this sounds pretty interesting. Aside from not offering a tier
       | for someone like me who would maybe have a couple of repositories
       | of size O(250GB) it's unclear how e.g. bandwidth would work &
       | whether other people could simply mount and clone the full repo
       | if desired for free etc.
        
         | rajatarya wrote:
         | XetHub Co-founder here. We are still trying to figure out
         | pricing and would love to understand what sort of pricing tier
         | would work for you.
         | 
         | In general, we are thinking about usage-based pricing (which
         | would include bandwidth and storage) - what are your thoughts
         | for that?
         | 
         | Also, where would you be mounting your repos from? We have
         | local caching options that can greatly reduce the overall
         | bandwidth needed to support data center workloads.
        
           | V1ndaar wrote:
           | Thanks for the reply!
           | 
           | Generally usage based pricing sounds fair. In the end for
           | cases like mine where it's "read rarely, but should be
           | available publicly long term" it would need to compute with
           | pricing offered by the big cloud providers.
           | 
           | I'm about to leave my academic career and I'm thinking about
           | how to make sure all my detector data will be available to
           | other researchers in my field in the future. Aside from the
           | obvious candidate https://zenodo.org it's an annoying problem
           | as usually most universities I'm familiar with only archive
           | data internally, which is hard to access for researchers from
           | different institutions. As I don't want to rely on a single
           | place to have that data available I'm looking for an
           | additional alternative (that I'm willing to pay for out of my
           | own pocket, it just shouldn't be a financial burden).
           | 
           | In particular while still taking data a couple of years ago I
           | would have loved being able to commit each daily data taking
           | in the same way as I commit code. That way having things
           | timestamped, backed up and all possible notes that came up
           | that day associated straight in the commit message would have
           | been very nice.
           | 
           | Regarding mounting I don't have any specific needs there
           | anymore. Just thinking about how other researchers would be
           | able to clone the repo to access the data.
        
           | blagie wrote:
           | My preferences on pricing.
           | 
           | First, it's all open-source, so I can take it and run it.
           | Second, you provide a hosted service, and by virtue of being
           | the author, you're the default SaaS host. You charge a
           | premium over AWS fees for self-hosting, which works out to:
           | 
           | 1. Enough to sustain you.
           | 
           | 2. Less than the cost of doing dev-ops myself (AWS fees +
           | engineer).
           | 
           | 3. A small premium over potential cut-rate competitors.
           | 
           | You offer value-added premium services too. Whether that's
           | economically viable, I don't know.
        
       ___________________________________________________________________
       (page generated 2022-12-13 23:01 UTC)