[HN Gopher] Show HN: We scaled Git to support 1 TB repos ___________________________________________________________________ Show HN: We scaled Git to support 1 TB repos I've been in the MLOps space for ~10 years, and data is still the hardest unsolved open problem. Code is versioned using Git, data is stored somewhere else, and context often lives in a 3rd location like Slack or GDocs. This is why we built XetHub, a platform that enables teams to treat data like code, using Git. Unlike Git LFS, we don't just store the files. We use content-defined chunking and Merkle Trees to dedupe against everything in history. This allows small changes in large files to be stored compactly. Read more here: https://xethub.com/assets/docs/how-xet-deduplication-works Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go. XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. Even at 1 TB, we know downloading an entire repository is painful, so we built git-xet mount - which, in seconds, provides a user-mode filesystem view over the repo. XetHub is available today (Linux & Mac today, Windows coming soon) and we would love your feedback! Read more here: - https://xetdata.com/blog/2022/10/15/why-xetdata - https://xetdata.com/blog/2022/12/13/introducing-xethub Author : reverius42 Score : 194 points Date : 2022-12-13 15:14 UTC (7 hours ago) (HTM) web link (xethub.com) (TXT) w3m dump (xethub.com) | timsehn wrote: | Founder of DoltHub here. One of my team pointed me at this | thread. Congrats on the launch. Great to see more folks tackling | the data versioning problem. | | Dolt hasn't come up here yet, probably because we're focused on | OLTP use cases, not MLOps, but we do have some customers using | Dolt as the backing store for their training data. | | https://github.com/dolthub/dolt | | Dolt also scales to the 1TB range and offers you full SQL query | capabilities on your data and diffs. | ylow wrote: | CEO/Cofounder here. Thanks! Agreed, we think data versioning is | an important problem and we are at related, but opposite parts | of the space. (BTW we really wanted gitfordata.com. Or perhaps | we can split the domain? OLTP goes here, Unstructured data goes | there :-) Shall we chat? ) | chubot wrote: | Can it be used to store container images (Docker)? As far as I | remember they are just compressed tar files. Does the compression | defeat Xet's own chunking? | | Can you sync to another machine without Xethub ? | | How about cleaning up old files? | ylow wrote: | Yeah... The compression does defeat the chunking (your mileage | may vary. We do a small amount of dedupe in some experiments | but never quite investigated it in detail.). That said, we have | experimental preprocessors / chunkers that are file-type | specific that we could potentially do something about tar.gz. | Not something we have explored much yet. | the_arun wrote: | Signed up & browsing "Flickr30k" repo (auto generated) repo & it | was really slow for me. Like CSV, does it also supports other | data formats like json, yml etc.,? | ylow wrote: | We are file format agnostic and you should be able to put | anything in the repo. We have special support for CSV files for | visualizations. Sorry for the UI perf... there are a lot of | optimizations we need to work on. | amadvance wrote: | How data is split in chunks ? Just curious. | [deleted] | sesm wrote: | They mention 'content-defined chunking', but it as far as | understand it requires different chunking algorithms for | different content types. Does it support plugins for chunking | different file formats? | ylow wrote: | Today we just have a variation of FastCDC in production, but | we have alternate experimental chunkers for some file formats | (ex: a heuristic chunker for CSV files that will enable | almost free subsampling). Hope to have them enter production | in the next 6 months. | sesm wrote: | That's interesting. Can a CSV chunker make adding a column | not affect all of the chunks? | ylow wrote: | The simplest really is to chunk row-wise so adding | columns will unfortunately rewrite all the chunks. If you | have a parquet file, adding columns will be cheap. | ylow wrote: | CEO/Cofounder here! Content defined chunking. Specifically a | variation of FastCDC. We have a paper coming out soon with a | lot more technical details. | subnetwork wrote: | This feels like something that is prime for abuse. I agree with | @bastardoperator, treating git for file storage is going to go | nowhere good. | ledauphin wrote: | The link takes me to a login page. It would be nice to see that | fixed to somehow match the title. | reverius42 wrote: | Visit https://xetdata.com for more info! (Sorry, can't edit the | post link now.) | jrockway wrote: | There are a couple of other contenders in this space. DVC | (https://dvc.org/) seems most similar. | | If you're interested in something you can self-host... I work on | Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't | have a Git-like interface, but also implements data versioning. | Our approach de-duplicates between files (even very small files), | and our storage algorithm doesn't create objects proportional to | O(n) directory nesting depth as Xet appears to. (Xet is very much | like Git in that respect.) | | The data versioning system enables us to run pipelines based on | changes to your data; the pipelines declare what files they read, | and that allows us to schedule processing jobs that only | reprocess new or changed data, while still giving you a full view | of what "would" have happened if all the data had been | reprocessed. This, to me, is the key advantage of data | versioning; you can save hundreds of thousands of dollars on | compute. Being able to undo an oopsie is just icing on the cake. | | Xet's system for mounting a remote repo as a filesystem is a good | idea. We do that too :) | ylow wrote: | By the way, our mount mechanism has one very interesting | novelty. It does not depend on a FUSE driver on Mac :-) | jrockway wrote: | That's smart! I think users have to install a kext still? | ylow wrote: | Nope. No kernel driver needed :-) We wrote an localhost NFS | server. | catiopatio wrote: | Based on unfsd or entirely in-house? | ylow wrote: | Entirely in house. In Rust! | catiopatio wrote: | Fancy! That's awesome. | chubot wrote: | Is DVC useful/efficient at storing container images (Docker)? | As far as I remember they are just compressed tar files. Does | the compression defeat its chunking / differential compression? | | How about cleaning up old versions? | ylow wrote: | We have found pointer files to be _surprisingly_ efficient as | long as you don 't have to actually materialize those files. | (Git's internals actually very well done). Our mount mechanism | does avoid materializing pointer files which makes it pretty | fast even for repos with very large number of files. | unqueued wrote: | For bigger annex repos with lots of pointer files, I just | disable the git-annex smudge filters. Consider whether smudge | filters are requirement, or a convenience. The smudge filter | interface does not scale that well at all. | Izmaki wrote: | If I had to "version control" a 1 TB large repo - and assuming I | wouldn't quit in anger - I would use a tool which is built for | this kind of need and has been used in the industry for decades: | Perforce. | ryneandal wrote: | This was my thought as well. Perforce has its own issues, but | is an industry standard in game dev for a reason: it can handle | immense amounts of data. | Phelinofist wrote: | What does immense mean in the context of game dev? | llanowarelves wrote: | On "real" (AA/AAA) games? Easily hundreds of gigabytes or | several terabytes of raw assets + project files. | | Sometimes even individual art project files can be many | gigabytes each. I saw a .psd that was 30gb because of the | embedded hi-res reference images. | | You can throw pretty much anything in there, in one place | and things like locking, partial-checkout, etc. Which gets | artists to use it | hinkley wrote: | Perforce also has support for proxies right? It's not | just the TB of data, it's all of your coworkers in a | branch office having to pull all the updates first thing | in the morning. If each person has to pull from origin, | that's a lot of bandwidth, and wasted mornings. If the | first person in pays and everyone else gets it off the | LAN, then you have a better situation. | mentos wrote: | I work in gamedev and think perforce is good but far from | great. Would love to see someone bring some competition to the | space maybe XetHub can. | tinco wrote: | So, you wouldn't consider using a new tool that someone | developed to solve the same problem despite an older solution | already existing? Your advice to that someone is to just use | the old solution? | TylerE wrote: | When the new solution involves voluntary use of git? Not just | yea, but hell yes. I hate git. | xur17 wrote: | Why do you hate git? I've been pretty happy with it for | code, and wouldn't mind being able to use it for data | repositories as well. | TylerE wrote: | Is it really worth re-hashing at this point? Reams have | been written about the UX | xur17 wrote: | It's used by the vast majority of software engineers, so | apparently it's "good enough". | hinkley wrote: | Don't ascribe positive feelings to popularity. I'm only | using git until the moment there's a viable alternative | written by someone who knows what DX is. | JZL003 wrote: | I also have a lot of issues with versioning data. But look at git | annex - it's free, self hosted and has a very easy underlying | data structure [1]. So I don't even use the magic commands it has | for remote data mounting/multi-device coordination, just backup | using basic S3 commands and can use rclone mounting. Very robust, | open source, and useful | | [1] When you run `git annex add` it hashes the file and moves the | original file to a `.git/annex/data` folder under the | hash/content addressable file system, like git. Then it replaces | the original file with a symlink to this hashed file path. The | file is marked as read only, so any command in any language which | tries to write to it will error (you can always `git annex | unlock` so you can write to it). If you have duplicated files, | they easily point to the same hashed location. As long as you git | push normally and back up the `.git/annex/data` you're totally | version controlled, and you can share the subset of files as | needed | kspacewalk2 wrote: | Sounds like `git annex` is file-level deduplication, whereas | this tool is block-level, but with some intelligent, context- | specific way of defining how to split up the data (i.e. | Content-Defined Chunking). For data management/versioning, | that's usually a big difference. | cma wrote: | If git annex stores large files uncompressed you could use | filesystem bl9ck level deduplication in combination with it. | synergy20 wrote: | can you be more specific here,very interested | dark-star wrote: | There are filesystems that support inline or post-process | deduplication. btrfs[1] and zfs[2] come to mind as free | ones, but there are also commercial ones like WAFL etc. | | It's always a tradeoff. Deduplication is a CPU-heavy | process, and if it's done inline, it is also memory- | heavy, so you're basically trading CPU and memory for | storage space. It heavily depends on the use-case (and | the particular FS / deduplication implementation) whether | it's worth it or not | | [1]: | https://btrfs.wiki.kernel.org/index.php/Deduplication | | [2]: https://docs.oracle.com/cd/E36784_01/html/E39134/fsd | edup-1.h... | cma wrote: | One problem is if you need to support Windows clients. | Microsoft charges $1600 for deduplication support or | something like that: https://learn.microsoft.com/en- | us/windows-server/storage/dat... | mattpallissard wrote: | Yeah, which is great for storage but doesn't help over | the wire. | xmodem wrote: | ZFS at least supports sending a deduplicated stream. | mattpallissard wrote: | Right, and btrfs can send a compressed stream as well, | but we aren't sending raw filesystem data via VCS. | alchemist1e9 wrote: | zbackup is a great block level deduplication trick. | rsync wrote: | "Sounds like `git annex` is file-level deduplication, whereas | this tool is block-level ..." | | I am not a user of git annex but I do know that it works | perfectly with an rsync.net account as a target: | | https://git- | annex.branchable.com/forum/making_good_use_of_my... | | ... _which means_ that you could do a _dumb mirror_ of your | repo(s) - perhaps just using rsync - and then let the ZFS | snapshots handle the versioning /rotation which would give | you the benefits of _block level diffs_. | | One additional benefit, beyond more efficient block level | diffs, is that the ZFS snapshots are immutable/readonly as | opposed to your 'git' or 'git annex' produced versions which | could be destroyed by Mallory ... | darau1 wrote: | > let the ZFS snapshots handle the versioning/rotation | which would give you the benefits of block level diffs | | Can you explain this a bit? I don't know anything about | ZFS, but it sounds as though it creates snapshots based on | block level differences? Maybe a git-annex backend could be | written to take advantage of that -- I don't know. | unqueued wrote: | No, that is not correct, git-annex uses a variety of special | remotes[2], some of which support deduplication. Mentioned in | another comment[1] | | When you have checked something out and fetched it, then it | consumes space on disk, but that is true with git-lfs, and | most other tools like it. It does NOT consume any space in | any git object files. | | I regularly use a git-annex repo that contains about 60G of | files, which I can use with github or any git host, and uses | about 6G in its annex, and 1M in the actual git repo itself. | I chain git-annex to an internal .bup repo, so I can keep | track of the location, and benefit from dedup. | | I honestly have not found anything that comes close to the | versatility of git-annex. | | [1]: https://news.ycombinator.com/item?id=33976418 | | [2]: https://git-annex.branchable.com/special_remotes/ | rajatarya wrote: | XetHub Co-founder here. Yes, one illustrative example of the | difference is: | | Imagine you have a 500MB file (lastmonth.csv) where every day | 1MB is changed. | | With file-based deduplication every day 500MB will be | uploaded, and all clones of the repo will need to download | 500MB. | | With block-based deduplication, only around the 1MB that | changed is uploaded and downloaded. | unqueued wrote: | I combine git-annex with the bup special remote[1], which | lets me still externalize big files, while benefiting from | block level deduplication. Or depending on your needs, you | can just use a tool like bup[2] or borg directly. Bup | actually uses the git pack file format and git metadata. | | I actually wrote a script which I'm happy to share, that | makes this much easier, and even lets you mount your bup | repo over .git/annex/objects for direct access. | | [1]: https://git- | annex.branchable.com/walkthrough/using_bup/ | | [2]: https://github.com/bup/bup | civilized wrote: | Does that work equally well whether the changes are | primarily row-based or primarily column-based? | rajatarya wrote: | Yes, see this for more details of how XetHub | deduplication: https://xethub.com/assets/docs/xet- | specifics/how-xet-dedupli... | prirun wrote: | HashBackup author here. Your question is (I think) about | how well block-based dedup functions on a database - | whether rows are changed or columns are changed. This | answer is how most block-based dedup software, including | HashBackup work. | | Block-based dedup can be done either with fixed block | sizes or variable block sizes. For a database with fixed | page sizes, a fixed block size matching the page size is | most efficient. For a database with variable page sizes, | a variable block size will work better, assuming there | the dedup "chunking" algorithm is fine-grained enough to | detect the database page size. For example, if the db | used a 4-6K variable page size and the dedup algo used a | 1M variable block size, it could not save a single | modified db page but would save more like 20 db pages | surrounding the modified page. | | Your column vs row question depends on how the db stores | data, whether key fields are changed, etc. The main dedup | efficiency criteria are whether the changes are | physically clustered together in the file or whether they | are dispersed throughout the file, and how fine-grained | the dedup block detection algorithm is. | AustinDev wrote: | Have you tested this out with Unreal Engine blueprint | files? If you all can do block-based diffing on those, and | other binary assets used in game development it'd be huge | for game development. | | I have a couple ~1TB repositories I've had the misfortune | of working with using perforce in the past. | rajatarya wrote: | Not yet. Would be happy to try - can you point me to a | project to use? | | Do you have a repo you could try us out with? | | We have tried a couple Unity projects (41% smaller due to | republication) but not much from Unreal projects yet. | AustinDev wrote: | Most of my examples of that size are AAA game source that | I can't share however, I think this is a project using | similar files that is based on unreal. It should show if | there is any benefit: https://github.com/CesiumGS/cesium- | unreal-samples & where the .umap binaries have been | updated and in this example where the .uasset blueprints | have been updated | https://github.com/renhaiyizhigou/Unreal-Blueprint- | Project | timbotron wrote: | If you like git annex check out | [datalad](http://handbook.datalad.org/en/latest/), it provides | some useful wrappers around git annex oriented towards | scientific computing. | blobbers wrote: | ... why do you have 1TB of source code (you don't! mandatory | hacker snark) Is git really supposed to be used for data? Or is | this just a git-like interface for source control on data? | IshKebab wrote: | Git is only not "supposed" to be used for data because it | doesn't work very well with data by default. Not because that's | not a useful and sensible thing to want from a VCS. | TillE wrote: | It's a fundamentally bad idea because of how any DVCS works. | You really don't want to be dragging around gigabytes of | obsolete data forever. | | Something like git-lfs is the appropriate solution. You need | a little bit of centralization. | IshKebab wrote: | Because of how _Git 's current implementation of DVS_ | works. There's nothing fundamental about it. Git already | supports partial clones and on-demand checkouts in some | ways, it's just not very ergonomic. | | All that's really needed is a way to mark individual files | as lazily fetched from a remote only when needed. LFS is a | hacky substandard way to emulate that behaviour. It should | be built in to Git. | stevelacy wrote: | Game development, especially Unreal engine, can produce repos | in excess of 1TB. Git LFS is used extensively for binary file | support. | Eleison23 wrote: | Aperocky wrote: | I see a lot of reasons to version code. | | I see far less reasons to version data, in fact, I find reasons | _against_ versioning data and storing them in diffs. | treeman79 wrote: | Anything that might be audited. Being able to look at things | how they were how they changed to find out how they got to | where they currently are, and who did what; is amazing for many | application. Finance, healthcare, elections, Etc. | | Well unless fraud is the goal. | bfm wrote: | Shameless plug for https://snapdir.org which focuses on this | particular use case using regular git and auditable plain | text manifests | zachmu wrote: | You're suffering from a failure of imagination, maybe because | you've never been able to version data usefully before. There | are already lots of interesting applications, and it's still | quite new. | | https://www.dolthub.com/blog/2022-07-11-dolt-case-studies/ | WorldMaker wrote: | Something that I've experienced from many years in enterprise | software: 90% of enterprise software is about versioning data | in some way. SharePoint is half as complicated as it is because | it has be a massive document and data version manager. (Same | with Confluence and other competitors.) "Everyone" needs deep | audit logs for some likely overlap of SOX compliance, PCI | compliance, HIPAA compliance, and/or other industry specific | standards and practices. Most business analysts want accurate | "point in time" reporting tools to revisit data as it looked at | almost any point in the past, and if you don't build it for | them they often build it as ad hoc file cabinets full of Excel | export dumps for themselves. | | The wheels of data versioning just get reinvented over and over | and over again, with all sorts of slightly different tools. | Most of the job of "boring CRUD app development" is data | version management and some of the "joy" is how every database | you ever encounter is often its own little snowflake with | respect to how it versions its data. | | There have been times I've pined for being able to just store | it all in git and reduce things to a single paradigm. That | said, I'd never actually want to teach business analysts or | accountants how to _use_ git (and would probably spend nearly | as much time building custom CRUD apps against git as against | any other sort of database). There are times though where I | have thought for backend work "if I could just checkout the | database at the right git tag instead needing to write this | five table join SQL statement with these eighteen differently | named timestamp fields that need to be sorted in four different | ways...". | | Reasons to version data are plenty and most of the data | versioning in the world is ad hoc and/or operationally | incompatible/inconsistent across systems. (Ever had to ETL | SharePoint lists and its CVC-based versioning with a timestamp | based data table? Such "fun".) I don't think git is necessarily | the savior here, though there remains some appeal in "I can use | the same systems I use for code" two birds with one stone. | Relatedly, content-addressed storage and/or merkle trees are a | growing tool for Enterprise and do look a lot like a git | repository and sometimes you also have the feeling like if you | are already using git why build your own merkle tree store when | git gives you a swiss army knife tool kit on top of that merkle | tree store. | ch71r22 wrote: | What are the reasons against? | ltbarcly3 wrote: | The lack of reasons for doing it IS the reason against. GIT | isn't a magic 'good way' to store arbitrary data, it's a good | way to collaborate on projects implemented using most | programming languages which store code as plain text broken | into short lines, where edits to non-sequential lines can | generally be applied concurrently without careful human | verification. That is an extremely specific use case, and | anything outside of that very specific use case leaves git | terrible, inefficient, and gives almost no benefit despite | huge problems. | | People in ML ops use git because they aren't very | sophisticated with programming professionally and they have | git available to them and they haven't run into the | consequences of using it to store large binary blobs, namely | that it becomes impossible to live with eventually and wastes | a huge amount of time and space. | | ML didn't invent the need for large artifacts that can't be | versioned in source control but must be versioned with it, | but they don't know that because they are new to professional | programming and aren't familiar with how it's done. | ylow wrote: | Indeed, there is a lot of pain if you actually try to store | large binary data in git. But we managed to make that work! | So a question worth asking is how might things change IF | you can store large binary data in git?? | ltbarcly3 wrote: | I think this is a foot-gun, it's a bad idea even if it | works great, and I doubt it works very well. You should | manage your build artifacts explicitly, not just jam them | in git along with the code that generates them because | you are already using it and you haven't thought it | through. | wpietri wrote: | I don't think you've made your case here. The practices | you describe are partly an artifact of computation, | bandwidth, and storage costs. But not the current ones, | the ones when git was invented more than 15 years ago. In | the short term, we have to conform to the computer's | needs. But in the long term, it has to be the other way | around. | ltbarcly3 wrote: | You're right! It makes way more sense, in the long run, | to abuse a tool like git in a way that it isn't designed | for and which it can't actually support, then instead of | actually using git use a proprietary service that may or | may not be around in a week. Here I was thinking short | term. | Game_Ender wrote: | Xet's initial focus appears to be on data files used to | drive machine learning pipelines, not on any resulting | binaries. | sk0g wrote: | That is exactly what git-lfs is, a way to "version | control" binary files, by storing revisions - possibly | separately, while the actual repo contains text files + | "pointer" files that references a binary file. | | It's not perfect, and still feels like a bit of a hack | compared to something like p4 for the context I uses LFS | in (game dev), but it works, and doesn't require | expensive custom licenses when teams grow beyond an | arbitrary number like 3 or 5. | rajatarya wrote: | XetHub Co-founder here. Yes, we use the same Git | extension mechanism as Git LFS (clean/smudge filters) and | we store pointer files in the git repository. Unlike Git | LFS we do block-level deduplication (Git LFS does file- | level deduplication) and this can result in a significant | savings in storage and bandwidth. | | As an example, a Unity game repo reduced in size by 41% | using our block-level deduplication vs Git LFS. Raw repo | was 48.9GB, Git LFS was 48.2GB, and with XetHub was | 28.7GB. | | Why do you think using a Git-based solution is a hack | compared to p4? What part of the p4 workflow feels more | natural to you? | mardifoufs wrote: | I literally don't know anyone or any team in ML using git | as a data versioning tool. It doesn't even make sense to | me, and most mlops people I have talked to would agree. Is | that really the point of this tool? To be a general purpose | data store for mlops? I thought it is for very specialized | ML use cases. Because even 1TB isn't much for ML data | versioning | | Mlops people are very aware of tools that are more suited | for the job... even too aware in fact. The entire field is | full of tools, databases, etc to the point where it's hard | make sense of it. So your comment is a bit weird to me | ltbarcly3 wrote: | I think you'll find varying levels of maturity in ML ops. | Anyway I think we basically agree, if you use something | like this you aren't that mature, and if you are mature | you would avoid this thing. | oftenwrong wrote: | One use-case would be for including dependencies in a repo. For | example, it is common for companies to operate their own | artifact caches/mirrors to protect their access to artifacts | from npm, pypi, dockerhub, maven central, pkg.go.dev, etc. With | the ability to efficiently work with a big repo, it would be | possible to store the artifacts in git, saving the trouble of | having to operate artifact mirrors. Additionally, it guarantees | that the artifacts for a given, known-buildable revision are | available offline. | guardian5x wrote: | As always it depends on the application. It can definitely be | useful in some applications. | substation13 wrote: | Versioning data is great, but storing as diffs is inefficient | when 99% of the file changes each version. | reverius42 wrote: | We don't store as diffs, we store as snapshots -- and it's | efficient thanks to the way we do dedupe. See | https://xethub.com/assets/docs/how-xet-deduplication-works/ | ylow wrote: | Cofounder/CEO here! I think it less about "versioning", but the | ability to modify with confidence knowing that you can go back | in time anytime. (Minor clarification: we are not quite storing | diffs; holding snapshots just like Git + a bunch of data | dedupe) | krageon wrote: | > the ability to modify with confidence knowing that you can | go back in time anytime | | This is versioning | rafael09ed wrote: | Versioning is a technique. Backups, copy+paste+rename also | does it | angrais wrote: | How's this differ from using git LFS? | ylow wrote: | We are _significantly_ faster? :-) Also, block-level dedupe, | scalability, perf, visualization, mounting, etc. | polemic wrote: | There seem to be a lot of data version control systems built | around ML pipelines or software development needs, but not so | much on the sort of data editing that happens outside of software | development & analysis. | | Kart (https://kartproject.org) is built on git to provide data | version control for geospatial vector & tabular data. Per-row | (feature & attribute) version control and the ability to | collaborate with a team of people is sorely missing from those | workflows. It's focused on geographic use-cases, but you can work | with 'plain old tables' too, with MySQL, PostgreSQL and MSSQL | working copies (you don't have to pick - you can push and pull | between them). | dandigangi wrote: | One monorepo to rule them all and the in the darkness, pull them. | - Gandalf, probably | irrational wrote: | And in the darkness merge conflicts. | amelius wrote: | Does this fix the problem that Git becomes unreasonably slow when | you have large binary files in the repo? | | Also, why can't Git show me an accurate progress-bar while | fetching? | reverius42 wrote: | Mostly! (At the moment, it doesn't fully fix the slowdown | associated with storing large binary files, but reduces it by | 90-99%. We're working on improving to closer to 100% that by | moving even the Merkle Tree storage outside the git repo | contents.) | | As for why git can't show you an accurate progress bar while | fetching (specifically when using an extension like git-lfs or | git-xet), this has to do with the way git extensions work -- | each file gets "cleaned" by the extension through a Unix pipe, | and the protocol for that is too simple to reflect progress | information back to the user. In git-xet, we do write a | percent-complete to stdout so you get some more info (but a | real progress bar would be nice). | Game_Ender wrote: | The tl;dr is that "xet" is like GitLFS (it stores pointers in | Git, with the data in a remote server and uses smudge filters to | make this transparent) with some additional features: | | - Automatically includes all files >256KB in size | | - By default data is de-duplicated 16KB chunks instead of whole | files (with the ability to customize this per file type). | | - Has a "mount" command to allow read-only browse without | downloading | | When launching on HN it would be better if the team was a bit | more transparent with the internals. I get that "we made a better | GitLFS" doesn't market as well. But you can couple that with a | credible vision and story about how you are a better and where | you are headed next. Instead this is mostly closer to market | speak of "trust our magic solution to solve your problem". | nightpool wrote: | These details seemed.... really clear to me from the post the | OP made? Did you just not read it, or have they updated it | since you commented? | | (excerpt from the OP post: | | > Unlike Git LFS, we don't just store the files. We use | content-defined chunking and Merkle Trees to dedupe against | everything in history. This allows small changes in large files | to be stored compactly. Read more here: | https://xethub.com/assets/docs/how-xet-deduplication-works) | culanuchachamim wrote: | Maybe a silly question: | | Why do you need 1Tb for repos? What do you store inside, besides | code and some images? | layer8 wrote: | Some docker images? ;) | lazide wrote: | A whole lot of images? | | I personally would love to be able to store datasets next to | code for regression testing, easier deployment, easier dev | workstation spin up, etc. | culanuchachamim wrote: | Still, 1TB? | | Once you get to that amount of images it would be much easy | to manage it with some files storage solution. | | Or I'm missing something important? | lazide wrote: | All of them require having some sort of parallel | authentication, synchronization, permissions management, | change tracking, etc. | | Which is a huge hassle, and a lot of work I'd rather not | do. | | My current photogrammetry dataset is well over 1TB, and it | isn't a lot for the industry by any stretch of the | imagination. | | In fact, the only thing that considers it 'a lot' and is | hard to work with is git. | dafelst wrote: | Repositories for games are often larger than 1TB, and with | things like UE5's Nanite becoming more viable, they're only | going to get bigger. | Wojtkie wrote: | Can I upload a full .pbix file to this and use it for versioning? | If so, I'd use it in a heartbeat. | ylow wrote: | CEO/Cofounder here. We are file format agnostic and will | happily take everything. Not too familiar with the needs around | pbix, but please do try it out and let us know what you think! | COMMENT___ wrote: | What about SVN? | | Besides other features, Subversion supports representation | sharing. So adding new textual or binary files with identical | data won't increase the size of your repository. | | I'm not familiar with ML data sets, but it seems that SVN may | work great with them. It already works great for huge and small | game dev projects. | iFire wrote: | https://github.com/facebook/sapling is doing good work and they | are suggesting their git server for large repositories exists. | wnzl wrote: | Just in case if you are wondering about alternatives: there is | Unity's Plastic https://unity.com/products/plastic-scm which | happens to use bidirectional sync with git. I'm curious how this | solution compares to it! I'll definitely give it a try over the | weekend! | ziml77 wrote: | I was already upset about Codice Software pulling Semantic | Merge and only making it available as an integrated part of | Plastic SCM. Now that I see the reason such a useful tool was | taken away was to stuff the pockets of a large company, I'm | fuming. | | I know that they're well within their rights to do this as they | only ever offered subscription licensing for Semantic Merge, | but that doesn't make it suck less to lose access. | web007 wrote: | Please consider https://sso.tax/ before making that an | "enterprise" feature. | IshKebab wrote: | I mean yeah, that's working as intended surely? Some of those | price differences are pretty egregious but in general companies | have to actually make money, and charging more for features | that are mainly needed by richer customers is a very obvious | thing to do. | mdaniel wrote: | I believe the counter-argument is that they should charge for | _features_ but that security should be available to anyone. | Imagine if "passwords longer than 6 chars: now only $8/mo!" | | That goes double for products where paying for "enterprise" | is _only_ to get SAML, which at least in my experience causes | me to go shopping for an entirely different product because I | view it as extortion | IshKebab wrote: | Security _is_ available for everyone. It 's centralised | security that can be easily managed by IT that isn't. | | I don't see an issue with charging more for SSO though as I | said some of the prices are egregious. | Alifatisk wrote: | Very sad to see Bitwarden in this list | unqueued wrote: | I have a 1.96 TB git repo: | https://github.com/unqueued/repo.macintoshgarden.org-fileset (It | is a mirror of a Macintosh abandoneware site) git | annex info . | | Of course, it uses pointer files for the binary blobs that are | not going to change much anyway. | | And the datalad project has neuro imaging repos that are tens of | TB in size. | | Consider whether you actually need to track differences in all of | your files. Honestly git-annex is one of the most powerful tools | I have ever used. You can use git for tracking changes in text, | but use a different system for tracking binaries. | | I love how satisfying it is to be able to store the index for | hundreds of gigs of files on a floppy disk if I wanted. | bastardoperator wrote: | I actually encountered a 4TB git repo. After pulling all the | binary shit out of it the repo was actually 200MB. Anything that | promotes treating git like a filesystem is a bad idea in my | opinion. | frognumber wrote: | Yes... and no. The git userspace is horrible for this. The git | data model is wonderful. | | The git userspace would need to be able to easily: | | 1. Not grab all files | | 2. Got grab the whole version history | | ... and that's more-or-less it. At that point, it'd do great | with large files. | ylow wrote: | Exactly for the giant repo use case, we have a mount feature | that will let you get a filesystem mount of any repo at any | commit very very quickly. | TacticalCoder wrote: | What does a Merkle Tree bring here? (honest question) I mean: for | content-based addressing of chunks (and hence deduplication of | these chunks), a regular tree works too if I'm not mistaken (I | may be wrong but I literally wrote a "deduper" splitting files | into chunks and using content-based addressing to dedupe the | chunks: but I just used a dumb tree). | | Is the Merkle true used because it brings something else than | deduplication, like chunks integrity verification or something | like that? | V1ndaar wrote: | You say you support up to 1TB repositories, but from your pricing | page all I see is the free tier for up to 20GB and one for teams. | The latter doesn't have a price and only a contact option and I | assume likely will be too expensive for an individual. | | As someone who'd love to put their data into a git like system, | this sounds pretty interesting. Aside from not offering a tier | for someone like me who would maybe have a couple of repositories | of size O(250GB) it's unclear how e.g. bandwidth would work & | whether other people could simply mount and clone the full repo | if desired for free etc. | rajatarya wrote: | XetHub Co-founder here. We are still trying to figure out | pricing and would love to understand what sort of pricing tier | would work for you. | | In general, we are thinking about usage-based pricing (which | would include bandwidth and storage) - what are your thoughts | for that? | | Also, where would you be mounting your repos from? We have | local caching options that can greatly reduce the overall | bandwidth needed to support data center workloads. | V1ndaar wrote: | Thanks for the reply! | | Generally usage based pricing sounds fair. In the end for | cases like mine where it's "read rarely, but should be | available publicly long term" it would need to compute with | pricing offered by the big cloud providers. | | I'm about to leave my academic career and I'm thinking about | how to make sure all my detector data will be available to | other researchers in my field in the future. Aside from the | obvious candidate https://zenodo.org it's an annoying problem | as usually most universities I'm familiar with only archive | data internally, which is hard to access for researchers from | different institutions. As I don't want to rely on a single | place to have that data available I'm looking for an | additional alternative (that I'm willing to pay for out of my | own pocket, it just shouldn't be a financial burden). | | In particular while still taking data a couple of years ago I | would have loved being able to commit each daily data taking | in the same way as I commit code. That way having things | timestamped, backed up and all possible notes that came up | that day associated straight in the commit message would have | been very nice. | | Regarding mounting I don't have any specific needs there | anymore. Just thinking about how other researchers would be | able to clone the repo to access the data. | blagie wrote: | My preferences on pricing. | | First, it's all open-source, so I can take it and run it. | Second, you provide a hosted service, and by virtue of being | the author, you're the default SaaS host. You charge a | premium over AWS fees for self-hosting, which works out to: | | 1. Enough to sustain you. | | 2. Less than the cost of doing dev-ops myself (AWS fees + | engineer). | | 3. A small premium over potential cut-rate competitors. | | You offer value-added premium services too. Whether that's | economically viable, I don't know. ___________________________________________________________________ (page generated 2022-12-13 23:01 UTC)