[HN Gopher] Data Version Control ___________________________________________________________________ Data Version Control Author : HerrMonnezza Score : 131 points Date : 2022-10-01 16:09 UTC (1 days ago) (HTM) web link (dvc.org) (TXT) w3m dump (dvc.org) | throwawaybutwhy wrote: | The package phones home. One has to set an env var or fix several | lines of code to prevent that. | shcheklein wrote: | Hey, yes, we've decided to keep it opt-out for now and it | collects fully anonymized basic statistics. Here is the full | policy: https://dvc.org/doc/user-guide/analytics . | | It should be easy to opt-out though `dvc config core.analytics | false` or an env variable `DVC_ANALYTICS=False`. | | Could you please clarify about the `several lines of code`? We | were trying to make it very open and visible what we collect | (it prints a large message when it starts) + make it easy to | disable it. | prepend wrote: | This seems pretty anti user since most users prefer opt in. | Seems pretty shady to keep in behavior that users don't like | and potentially harms them (you think it's fully anonymized). | | That's your prerogative as it's your project but makes me | think what else you're doing that's against users best | interest and in your own. | shcheklein wrote: | We are fully aware that it raises concerns. Trust me it | hurts my feelings as well. E.g. on the websites (dvc.org, | cml.dev, etc) - we don't use any cookies, GA, etc. | | We've tried to make it as open as possible - code is | available (its open source), we write openly about this at | the very start, we have a policy online, made it easy to | opt-out. If you have other ideas how to make it even more | friendly, more visible, etc - let us know please. | | Still, we've preferred so far to keep it opt-out since it's | crucial for us to see major product trends (which features | are being used more, product growth MoM etc). Opt-in at | this stage realistically won't give us this information. | prepend wrote: | Yet there are many successful projects that don't collect | this information. So it's not crucial for them but is | crucial for you. | | I think the challenge I have is that since you're getting | IP address that will be an opportunity to abuse. And | there seems to be some rule that any data that can be | misused will eventually be misused. | | Since you're not willing to make it opt-in, I think | perhaps the only other way would be to support an | automated distro that doesn't include it so users are at | least able to easily choose a version. | | I admire you for responding to this thread and me as it's | definitely not easy. I just feel like one of the main | benefits of open source is its alignment with user | benefits so it's discouraging when an open source project | chooses code that users don't want. | shcheklein wrote: | Right, many projects use opt-in, there are many that have | opt-out though: | | https://docs.brew.sh/Analytics | https://docs.npmjs.com/policies/privacy#how-does-npm- | collect... VS Code, etc | | > I think the challenge I have is that since you're | getting IP address that will be an opportunity to abuse. | | Yes! And we are migrating to the new package / | infrastructure because of this - | https://github.com/iterative/telemetry-python (DVC's | sister tool MLEM is already on it and it's not sending | (saving) IP addresses, nor using GA or any other third- | party tools, data is saved into BigQuery and eventually | we'll make publicly accessible - | https://mlem.ai/doc/user-guide/analytics to be fully GDPR | compatible). It's a legacy system that DVC had in place. | There was no intention to use those IP addresses in some | way. | | > I think perhaps the only other way would be to support | an automated distro that doesn't include it so users are | at least able to easily choose a version. | | Thanks. To some extent brew-like policy (not sending | anything significant before there is a chance to disable | it and there is clear explicit message) should be | mitigating this, but I'll check if it works this way now | and if it can be improved. | [deleted] | sva_ wrote: | I wondered how they'll make money | | https://www.crunchbase.com/organization/iterative-ai/company... | nerdponx wrote: | I think their plan was/is to make money on corporate licenses | and support, as well as SaaS/cloud products. | machinekob wrote: | They won't, they can make investor money back only from | selling company to Amazon/Microsoft/Google but in this | economy it won't happen. | adhocmobility wrote: | If you just want a git for large data files, and your files don't | get updated too often (e.g. an ML model deployed in production | which gets updated every month) then git-lfs is a nice solution. | Bitbucket and Github both have support for it. | kortex wrote: | I've used both extensively. Git-lfs has always been a | nightmare. Because each tracked large file can be in one of two | states - binary, or "pointer" - it's super easy for the folder | to get all fouled up. It would be unable to "clean" or | "smudge", since either would cause some conflict. If you | accidentally pushed in the wrong state, you could "infect" the | remote and be really hosed. I had this happen numerous times | over about 2 years of using lfs, and each time the only | solution was some aggressive rewriting of history. | | That, combined with the nature of re-using the same filename | for the metadata files, meant that it was common for folks to | commit the binary and push it. Again, lots of history rewriting | to get git sizes back down. | | Maybe there exist solutions to my problems but I had spent | hours wrestling with it trying to fix these bad states, and it | caused me much distress. | | Also configuring the backing store was generally more painful, | especially if you needed >2GB. | | DVC was easy to use from the first moment. The separate meta | files meant that it can't get into mixed clean/smudge states. | If you aren't in a cloud workflow already, the backing store | was a bit tricky, but even without AWS I made it work. | kernelsanderz wrote: | I do feel like git-lfs is a good solution. Once you have 10s or | 100s of GB of files (eg. a computer vision project), this gets | pretty pricey. | | Ideally I'd love to use git-lfs on top of S3, directly. I've | looked into git-annex and various git-lfs proxies, but I'm not | sure they're maintained well enough to be trusting it with | long-term data storage. | | Huggingface datasets are built on git-lfs and it works really | well for them for storage of large datasets. Ideally I'd love | for AWS to offer this as a hosted thin layer on top of S3, or | for some well funded or supported community effort to do the | same, and in a performant way. | | If you know of any such solution, please let me know! | simonw wrote: | It seems to be the solution Hugging Face have picked too. | LaserToy wrote: | Can it be used for large and fast changing datasets? | | Example: 100 TB, write us every 10 mins. | | Or, 1tb, parquet, 40% is rewritten daily. | nerdponx wrote: | DVC is expressly for tracking artifacts that are files on disk, | and only by comparing their MD5 hashes. So it can definitely | track the parquet files, but you are not going to get row or | field diffs or anything like that. | | Maybe Pachyderm or Dolt would be better tools here. | bumblebritches5 wrote: | AlotOfReading wrote: | Why would you use MD5 in anything written in the last 5 | years? The SHA family is faster on modern hardware and there | aren't trivial collisions floating around out there. | kortex wrote: | It was definitely a bad choice. I wasn't there so I can | only speculate. My guess is because it is sort of | ubiquitous and thus a low-hanging fruit and devs didn't | know better, or the related corollary, it's what S3 uses | for ETags, so it probably seemed logical. Either way, seems | like someone did it and didn't know better, no one agrees | on a fix or whether it's even necessary to change, and thus | it's stuck for now. | | There's an ongoing discussion about replacing/configuring | the hash function, and it looks like there might be some | movement toward replacing the hash and other speedups in | 3.0 | | https://github.com/iterative/dvc/issues/3069 | | > We not only want to switch to a different algorithm in | 3.0, but to also provide better | performance/ui/architecture/ecosystem for data management, | and all of that while not seizing releases with new | features (experiements, dvc machine, plots, etc) and bug | fixes for 2.0, so we've been gradually rebuilding that and | will likely be ready for 3.0 in the upcoming months. - http | s://github.com/iterative/dvc/issues/3069#issuecomment-93... | nerdponx wrote: | Don't quote me on the specific hash algorithm, maybe it's | SHA. Point is that it's just comparing modification times | and hashes. | snthpy wrote: | What about Apache Iceberg for those? | tomthe wrote: | Can anyone compare this to DataLad [1], which someone introduced | to me as "git for data"? | | [https://www.datalad.org/] | remram wrote: | Doesn't use git-annex like DataLad. That alone is a huge | benefit given the state of that tool. | imiric wrote: | I'm curious, what's the problem with git-annex? | | I've considered using it before as an alternative to Git LFS. | niccl wrote: | things that I don't like about it: | | * git diff doesn't work in any sensible way | | * if you forget and do `git add` instead of `git annex | add`, everything is fine, but you've now spoilt the nice | thing that git annex does of de-duping files. (git annex | only stores one copy of identical files) | | * for our use case (which I'm sure is the wrong way of | doing things) it's possible to overwrite the single copy of | a file that git annex stores, which rather spoils the point | of the thing. I do think it's down to the way we use it, | though, so not specifically a git annex problem | | The _great_ thing about git annex is it can be self-hosted. | For various reasons we can't put our source data in one of | the systems that uses git-lfs. | | We've got about 800 GB of data in git annex and I've been | happy with it despite the limitations. | hpfr wrote: | If you configure annex.largefiles, git add should work | with the annex. I start with something like | git annex config --set annex.largefiles 'largerthan=1kb | and (not (mimeencoding=us-ascii or mimeencoding=utf-8)' | | > By default, git-annex add adds all files to the annex | (except dotfiles), and git add adds files to git (unless | they were added to the annex previously). When | annex.largefiles is configured, both git annex add and | git add will add matching large files to the annex, and | the other files to git. --https://git- | annex.branchable.com/git-annex/ | | Note that git add will add large files unlocked, though, | since (as far as I understand) it's assumed you're still | modifying them for safety: | | > If you use git add to add a file to the annex, it will | be added in unlocked form from the beginning. This allows | workflows where a file starts out unlocked, is modified | as necessary, and is locked once it reaches its final | version. --https://git-annex.branchable.com/git-annex- | unlock/ | remram wrote: | Yes it definitely serves a valid use-case, I feel like | someone should try and bring some competition there. A | modern equivalent with fewer gotchas, maybe in Rust/Go, | maybe using a fuse mount and content-defined chunking | (borg/restic/...-style) would be amazing. | kernelsanderz wrote: | I'd love to see a well-supported git-lfs compatible | client/proxy (so you could more easily move backends) | that could run on top of S3/object storage. Yes, and | written in a modern language like golang/rust for | performance / parallelism. There's some node.js and | various other git-lfs proxies out there, but not well | enough maintained that I could count on them being around | and working in another 5 years. git-annex at least has | been around for a while, even though it has its issues. | | Huggingface uses git-lfs for large datasets with good | success. git-lfs on GitHub gets very pricey at higher | volumes of data. Would love the affordability of object | storage, just with a better git blob storage interface, | that will be around in the future. | | Most of these systems do their own hash calculations and | are not interchangeable with each other. I feel like git- | lfs has the momentum at the momentum in data-science at | the moment, but needs some better options for people who | want a low cost storage option that they can control. | | Huggingface is great, but it's one more service to | onboard if you're in an enterprise. And data | privacy/retention/governance means that many people would | liek their data to reside on their own infrastructure. | | If AWS were to give us a low cost git-lfs hosted service | on top of S3 it would be very popular. | | If anyone knows of some good alternatives, please let us | know! | kernelsanderz wrote: | Did some more research to see if anything had changed in | this space. I found two interesting projects (haven't | used them myself yet though): | | One in C# (with support for auth) | | https://github.com/alanedwardes/Estranged.Lfs | | One in Rust (but no Auth, have to run reverse proxy) | | https://github.com/jasonwhite/rudolfs | | Both seem interesting. Anyone use these? | remram wrote: | It lives in this weird wiki that seems to be read-only most | of the time. I don't think it's alive. Its use of hard | links also causes too many problems, of the silent | corruption variety. | hpfr wrote: | Ikiwiki's definitely a bit weird, but I've been | experimenting with git-annex recently and it worked fine | every time I commented. Seems like it's chugging along: | https://git-annex.branchable.com/recentchanges/ | | When does it use hard links? As far as I remember it used | symlinks unless you used something like annex.hardlink | (described in the man page: https://git- | annex.branchable.com/git-annex/) | benhurmarcel wrote: | And what about Dolt? | | https://docs.dolthub.com/introduction/what-is-dolt | shcheklein wrote: | Dolt is for tabular data. It's like SQLite but with | branching, versioning of the DB level. DVC is file-based. It | saves large files, directories, etc to one of the supported | storages - S3, GCP, Azure, etc. It's more like Git-lfs in | that sense. | | Another difference is that for DVC (surprisingly) data | versioning itself is just one of the main fundamental layers | that is needed to provide holistic ML experiments tracking | and versioning. So, DVC has a layer to describe an ML | project, run it, capture and version inputs/outputs. In that | sense DVC becomes a more opinionated / high level tool if | that makes sense. | bs7280 wrote: | What value does this provide that I can't get by versioning my | data in partitioned parquet files on s3? | shcheklein wrote: | I think parquet won't help with images, video, ML models. | | Also, one thing is to physically provide a way to version data | (e.g. partitioned parquet files, cloud versioning, etc, etc), | but another one is to also have a mechanism of saving / | codifying dataset version into the project. E.g. to answer the | question which version of data this model was built with you | would need to save some identifier / hash / list of files that | were used. DVC takes care of that part as well. | | (it has mechanics to cache data that you download, make-file | like pipelines, etc) | smeagull wrote: | I don't think this tool can encompass everything you need in | managing ML models and data sets, even if you limit it to | versioning data. | | I'd need such a tool to manage features, checkpoints and labels. | This doesn't do any of that. Nor does it really handle merging | multiple versions of data. | | And I'd really like the code to be handled separately from the | data. Git is not the place to do this. Because the choice of | picking pairs of code and data should happen at a higher level, | and be tracked along with the results - that's not going in a | repo - MLFlow or Tensorboard handles it better. | lizen_one wrote: | DVC has had the following problems, when I tested it (half a year | ago): | | I gets super slow (waiting minutes) when there are a few thousand | files tracked. Thousands files have to be tracked, if you have | e.g. a 10GB file per day and region and artifacts generated from | it. | | You are encouraged (it only can track artifacts) if you model | your pipeline in DVC (think like make). However, it cannot run | tasks it parallel. So it takes a lot of time to run a pipeline | while you are on a beefy machine and only one core is used. | Obviously, you cannot run other tools (e.g. snakemake) to | distribute/parallelize on multiple machines. Running one (part of | a) stage has also some overhead, because it does commit/checks | after/before running the executable of the task. | | Sometimes you get merge conflicts, if you run a (partial | parmaretized) stage on one machine and the other part on the | other machine manually. These are cumbersome to fix. | | Currently, I think they are more focused on ML features like | experiment tracking (I prefer other mature tools here) instead of | performance and data safety. | | There is an alternative implementation from a single developer (I | cannot find it right now) that fixes some problems. However, I do | not use this because it propably will not have the same | development progress and testing as DVC. | | This sounds negative but I think it is currently the one of the | best tools in this space. | DougBTX wrote: | What's best if parallel step processing is required? | jdoliner wrote: | DVC is great for use cases that don't get to this scale or have | these needs. And the issues here are non-trivial to solve. I've | spent a lot of time figuring out how to solve them in Pachyderm | which is good for use cases where you do need higher levels of | scale or might run into merge conflicts with DVC. There's | trade-offs though. DVC is definitely easier for a single | developer / data scientist to get up and running with. | nerdponx wrote: | I think it's worth noting that DVC can be used to track | artifacts that have been generated by other tools. For | example, you could use MLFlow to run several model | experiments, but at the end track the artifacts with DVC. | Personally I think that this is the best way to use it. | | However I agree that in general it's best for smaller | projects and use cases. for example, it still shares the | primary deficiency of Make in that it can only track files on | the file system, and now things like ensuring a database | table has been created (unless you 'touch' your own sentinel | files). | bagavi wrote: | The alternative tool you are referring to is `Dud` I believe | | Dvc is the best tool (I found) inspite of being dead slow and | complex (trying to do many things). | | What alternatives would you recommend? | remram wrote: | > You are encouraged if you model your pipeline in DVC. | | Encouraged to do what? | | You might want to slow down on the use of parentheses, we are | both getting lost in them. | nerdponx wrote: | I assume they meant to say "you are encouraged to use DVC to | run your model and experiment pipeline". They want to | encourage you to do this because they are trying to build a | business around being a data science ops ecosystem. But the | truth is that DVC is not a great tool for running | "experiments" searching over a parameter space. it could be | improved in that regard, but that's just not what I use it | for nor is it what I recommend it to other people for. | | However it's fantastic for _tracking_ artifacts throughout an | project that have been generated by other means, and for | keeping those artifacts tightly in sync with Git, and for | making it easy to share those artifacts without forcing | people to re-run expensive pipelines. | shcheklein wrote: | > But the truth is that DVC is not a great tool for running | "experiments" searching over a parameter space. | | Would love your feedback what's missing there! We've been | improving it lately - e.g. | | - Hydra support https://dvc.org/doc/user-guide/experiment- | management/hydra | | - VS Code extension - https://marketplace.visualstudio.com/ | items?itemName=Iterativ... | kernelsanderz wrote: | Last I checked it wasn't easy to use something like | optuna to do hyperparameter tuning with hydra/DVC. | | Ideally I'd like the tool I use for data versioning | (DVC/git-lfs/gif-annex) to be orthogonal to that which I | use for hyperparameter sweeping (DVD/optuna/SageMaker | experiments), and orthogonal to that which I use for | configuration management (DVC/Hydra/Plain YAML), to that | what I use for experimental DAG management (DVC/Makefile) | | Optuna is becoming very popular in the data-science/deep | learning ecosystem at the moment. It would be great to | see more composable tools, rather than having to opt all- | in into a given ecosystem. | | Love the work that DVC is doing though to tackle these | difficult problems though! ___________________________________________________________________ (page generated 2022-10-02 23:00 UTC)