[HN Gopher] Data Version Control
       Data Version Control
       Author : HerrMonnezza
       Score  : 131 points
       Date   : 2022-10-01 16:09 UTC (1 days ago)
 (HTM) web link (dvc.org)
 (TXT) w3m dump (dvc.org)
       | throwawaybutwhy wrote:
       | The package phones home. One has to set an env var or fix several
       | lines of code to prevent that.
         | shcheklein wrote:
         | Hey, yes, we've decided to keep it opt-out for now and it
         | collects fully anonymized basic statistics. Here is the full
         | policy: https://dvc.org/doc/user-guide/analytics .
         | It should be easy to opt-out though `dvc config core.analytics
         | false` or an env variable `DVC_ANALYTICS=False`.
         | Could you please clarify about the `several lines of code`? We
         | were trying to make it very open and visible what we collect
         | (it prints a large message when it starts) + make it easy to
         | disable it.
           | prepend wrote:
           | This seems pretty anti user since most users prefer opt in.
           | Seems pretty shady to keep in behavior that users don't like
           | and potentially harms them (you think it's fully anonymized).
           | That's your prerogative as it's your project but makes me
           | think what else you're doing that's against users best
           | interest and in your own.
             | shcheklein wrote:
             | We are fully aware that it raises concerns. Trust me it
             | hurts my feelings as well. E.g. on the websites (dvc.org,
             | cml.dev, etc) - we don't use any cookies, GA, etc.
             | We've tried to make it as open as possible - code is
             | available (its open source), we write openly about this at
             | the very start, we have a policy online, made it easy to
             | opt-out. If you have other ideas how to make it even more
             | friendly, more visible, etc - let us know please.
             | Still, we've preferred so far to keep it opt-out since it's
             | crucial for us to see major product trends (which features
             | are being used more, product growth MoM etc). Opt-in at
             | this stage realistically won't give us this information.
               | prepend wrote:
               | Yet there are many successful projects that don't collect
               | this information. So it's not crucial for them but is
               | crucial for you.
               | I think the challenge I have is that since you're getting
               | IP address that will be an opportunity to abuse. And
               | there seems to be some rule that any data that can be
               | misused will eventually be misused.
               | Since you're not willing to make it opt-in, I think
               | perhaps the only other way would be to support an
               | automated distro that doesn't include it so users are at
               | least able to easily choose a version.
               | I admire you for responding to this thread and me as it's
               | definitely not easy. I just feel like one of the main
               | benefits of open source is its alignment with user
               | benefits so it's discouraging when an open source project
               | chooses code that users don't want.
               | shcheklein wrote:
               | Right, many projects use opt-in, there are many that have
               | opt-out though:
               | https://docs.brew.sh/Analytics
               | https://docs.npmjs.com/policies/privacy#how-does-npm-
               | collect... VS Code, etc
               | > I think the challenge I have is that since you're
               | getting IP address that will be an opportunity to abuse.
               | Yes! And we are migrating to the new package /
               | infrastructure because of this -
               | https://github.com/iterative/telemetry-python (DVC's
               | sister tool MLEM is already on it and it's not sending
               | (saving) IP addresses, nor using GA or any other third-
               | party tools, data is saved into BigQuery and eventually
               | we'll make publicly accessible -
               | https://mlem.ai/doc/user-guide/analytics to be fully GDPR
               | compatible). It's a legacy system that DVC had in place.
               | There was no intention to use those IP addresses in some
               | way.
               | > I think perhaps the only other way would be to support
               | an automated distro that doesn't include it so users are
               | at least able to easily choose a version.
               | Thanks. To some extent brew-like policy (not sending
               | anything significant before there is a chance to disable
               | it and there is clear explicit message) should be
               | mitigating this, but I'll check if it works this way now
               | and if it can be improved.
         | [deleted]
         | sva_ wrote:
         | I wondered how they'll make money
         | https://www.crunchbase.com/organization/iterative-ai/company...
           | nerdponx wrote:
           | I think their plan was/is to make money on corporate licenses
           | and support, as well as SaaS/cloud products.
           | machinekob wrote:
           | They won't, they can make investor money back only from
           | selling company to Amazon/Microsoft/Google but in this
           | economy it won't happen.
       | adhocmobility wrote:
       | If you just want a git for large data files, and your files don't
       | get updated too often (e.g. an ML model deployed in production
       | which gets updated every month) then git-lfs is a nice solution.
       | Bitbucket and Github both have support for it.
         | kortex wrote:
         | I've used both extensively. Git-lfs has always been a
         | nightmare. Because each tracked large file can be in one of two
         | states - binary, or "pointer" - it's super easy for the folder
         | to get all fouled up. It would be unable to "clean" or
         | "smudge", since either would cause some conflict. If you
         | accidentally pushed in the wrong state, you could "infect" the
         | remote and be really hosed. I had this happen numerous times
         | over about 2 years of using lfs, and each time the only
         | solution was some aggressive rewriting of history.
         | That, combined with the nature of re-using the same filename
         | for the metadata files, meant that it was common for folks to
         | commit the binary and push it. Again, lots of history rewriting
         | to get git sizes back down.
         | Maybe there exist solutions to my problems but I had spent
         | hours wrestling with it trying to fix these bad states, and it
         | caused me much distress.
         | Also configuring the backing store was generally more painful,
         | especially if you needed >2GB.
         | DVC was easy to use from the first moment. The separate meta
         | files meant that it can't get into mixed clean/smudge states.
         | If you aren't in a cloud workflow already, the backing store
         | was a bit tricky, but even without AWS I made it work.
         | kernelsanderz wrote:
         | I do feel like git-lfs is a good solution. Once you have 10s or
         | 100s of GB of files (eg. a computer vision project), this gets
         | pretty pricey.
         | Ideally I'd love to use git-lfs on top of S3, directly. I've
         | looked into git-annex and various git-lfs proxies, but I'm not
         | sure they're maintained well enough to be trusting it with
         | long-term data storage.
         | Huggingface datasets are built on git-lfs and it works really
         | well for them for storage of large datasets. Ideally I'd love
         | for AWS to offer this as a hosted thin layer on top of S3, or
         | for some well funded or supported community effort to do the
         | same, and in a performant way.
         | If you know of any such solution, please let me know!
         | simonw wrote:
         | It seems to be the solution Hugging Face have picked too.
       | LaserToy wrote:
       | Can it be used for large and fast changing datasets?
       | Example: 100 TB, write us every 10 mins.
       | Or, 1tb, parquet, 40% is rewritten daily.
         | nerdponx wrote:
         | DVC is expressly for tracking artifacts that are files on disk,
         | and only by comparing their MD5 hashes. So it can definitely
         | track the parquet files, but you are not going to get row or
         | field diffs or anything like that.
         | Maybe Pachyderm or Dolt would be better tools here.
           | bumblebritches5 wrote:
           | AlotOfReading wrote:
           | Why would you use MD5 in anything written in the last 5
           | years? The SHA family is faster on modern hardware and there
           | aren't trivial collisions floating around out there.
             | kortex wrote:
             | It was definitely a bad choice. I wasn't there so I can
             | only speculate. My guess is because it is sort of
             | ubiquitous and thus a low-hanging fruit and devs didn't
             | know better, or the related corollary, it's what S3 uses
             | for ETags, so it probably seemed logical. Either way, seems
             | like someone did it and didn't know better, no one agrees
             | on a fix or whether it's even necessary to change, and thus
             | it's stuck for now.
             | There's an ongoing discussion about replacing/configuring
             | the hash function, and it looks like there might be some
             | movement toward replacing the hash and other speedups in
             | 3.0
             | https://github.com/iterative/dvc/issues/3069
             | > We not only want to switch to a different algorithm in
             | 3.0, but to also provide better
             | performance/ui/architecture/ecosystem for data management,
             | and all of that while not seizing releases with new
             | features (experiements, dvc machine, plots, etc) and bug
             | fixes for 2.0, so we've been gradually rebuilding that and
             | will likely be ready for 3.0 in the upcoming months. - http
             | s://github.com/iterative/dvc/issues/3069#issuecomment-93...
             | nerdponx wrote:
             | Don't quote me on the specific hash algorithm, maybe it's
             | SHA. Point is that it's just comparing modification times
             | and hashes.
         | snthpy wrote:
         | What about Apache Iceberg for those?
       | tomthe wrote:
       | Can anyone compare this to DataLad [1], which someone introduced
       | to me as "git for data"?
       | [https://www.datalad.org/]
         | remram wrote:
         | Doesn't use git-annex like DataLad. That alone is a huge
         | benefit given the state of that tool.
           | imiric wrote:
           | I'm curious, what's the problem with git-annex?
           | I've considered using it before as an alternative to Git LFS.
             | niccl wrote:
             | things that I don't like about it:
             | * git diff doesn't work in any sensible way
             | * if you forget and do `git add` instead of `git annex
             | add`, everything is fine, but you've now spoilt the nice
             | thing that git annex does of de-duping files. (git annex
             | only stores one copy of identical files)
             | * for our use case (which I'm sure is the wrong way of
             | doing things) it's possible to overwrite the single copy of
             | a file that git annex stores, which rather spoils the point
             | of the thing. I do think it's down to the way we use it,
             | though, so not specifically a git annex problem
             | The _great_ thing about git annex is it can be self-hosted.
             | For various reasons we can't put our source data in one of
             | the systems that uses git-lfs.
             | We've got about 800 GB of data in git annex and I've been
             | happy with it despite the limitations.
               | hpfr wrote:
               | If you configure annex.largefiles, git add should work
               | with the annex. I start with something like
               | git annex config --set annex.largefiles 'largerthan=1kb
               | and (not (mimeencoding=us-ascii or mimeencoding=utf-8)'
               | > By default, git-annex add adds all files to the annex
               | (except dotfiles), and git add adds files to git (unless
               | they were added to the annex previously). When
               | annex.largefiles is configured, both git annex add and
               | git add will add matching large files to the annex, and
               | the other files to git. --https://git-
               | annex.branchable.com/git-annex/
               | Note that git add will add large files unlocked, though,
               | since (as far as I understand) it's assumed you're still
               | modifying them for safety:
               | > If you use git add to add a file to the annex, it will
               | be added in unlocked form from the beginning. This allows
               | workflows where a file starts out unlocked, is modified
               | as necessary, and is locked once it reaches its final
               | version. --https://git-annex.branchable.com/git-annex-
               | unlock/
               | remram wrote:
               | Yes it definitely serves a valid use-case, I feel like
               | someone should try and bring some competition there. A
               | modern equivalent with fewer gotchas, maybe in Rust/Go,
               | maybe using a fuse mount and content-defined chunking
               | (borg/restic/...-style) would be amazing.
               | kernelsanderz wrote:
               | I'd love to see a well-supported git-lfs compatible
               | client/proxy (so you could more easily move backends)
               | that could run on top of S3/object storage. Yes, and
               | written in a modern language like golang/rust for
               | performance / parallelism. There's some node.js and
               | various other git-lfs proxies out there, but not well
               | enough maintained that I could count on them being around
               | and working in another 5 years. git-annex at least has
               | been around for a while, even though it has its issues.
               | Huggingface uses git-lfs for large datasets with good
               | success. git-lfs on GitHub gets very pricey at higher
               | volumes of data. Would love the affordability of object
               | storage, just with a better git blob storage interface,
               | that will be around in the future.
               | Most of these systems do their own hash calculations and
               | are not interchangeable with each other. I feel like git-
               | lfs has the momentum at the momentum in data-science at
               | the moment, but needs some better options for people who
               | want a low cost storage option that they can control.
               | Huggingface is great, but it's one more service to
               | onboard if you're in an enterprise. And data
               | privacy/retention/governance means that many people would
               | liek their data to reside on their own infrastructure.
               | If AWS were to give us a low cost git-lfs hosted service
               | on top of S3 it would be very popular.
               | If anyone knows of some good alternatives, please let us
               | know!
               | kernelsanderz wrote:
               | Did some more research to see if anything had changed in
               | this space. I found two interesting projects (haven't
               | used them myself yet though):
               | One in C# (with support for auth)
               | https://github.com/alanedwardes/Estranged.Lfs
               | One in Rust (but no Auth, have to run reverse proxy)
               | https://github.com/jasonwhite/rudolfs
               | Both seem interesting. Anyone use these?
             | remram wrote:
             | It lives in this weird wiki that seems to be read-only most
             | of the time. I don't think it's alive. Its use of hard
             | links also causes too many problems, of the silent
             | corruption variety.
               | hpfr wrote:
               | Ikiwiki's definitely a bit weird, but I've been
               | experimenting with git-annex recently and it worked fine
               | every time I commented. Seems like it's chugging along:
               | https://git-annex.branchable.com/recentchanges/
               | When does it use hard links? As far as I remember it used
               | symlinks unless you used something like annex.hardlink
               | (described in the man page: https://git-
               | annex.branchable.com/git-annex/)
         | benhurmarcel wrote:
         | And what about Dolt?
         | https://docs.dolthub.com/introduction/what-is-dolt
           | shcheklein wrote:
           | Dolt is for tabular data. It's like SQLite but with
           | branching, versioning of the DB level. DVC is file-based. It
           | saves large files, directories, etc to one of the supported
           | storages - S3, GCP, Azure, etc. It's more like Git-lfs in
           | that sense.
           | Another difference is that for DVC (surprisingly) data
           | versioning itself is just one of the main fundamental layers
           | that is needed to provide holistic ML experiments tracking
           | and versioning. So, DVC has a layer to describe an ML
           | project, run it, capture and version inputs/outputs. In that
           | sense DVC becomes a more opinionated / high level tool if
           | that makes sense.
       | bs7280 wrote:
       | What value does this provide that I can't get by versioning my
       | data in partitioned parquet files on s3?
         | shcheklein wrote:
         | I think parquet won't help with images, video, ML models.
         | Also, one thing is to physically provide a way to version data
         | (e.g. partitioned parquet files, cloud versioning, etc, etc),
         | but another one is to also have a mechanism of saving /
         | codifying dataset version into the project. E.g. to answer the
         | question which version of data this model was built with you
         | would need to save some identifier / hash / list of files that
         | were used. DVC takes care of that part as well.
         | (it has mechanics to cache data that you download, make-file
         | like pipelines, etc)
       | smeagull wrote:
       | I don't think this tool can encompass everything you need in
       | managing ML models and data sets, even if you limit it to
       | versioning data.
       | I'd need such a tool to manage features, checkpoints and labels.
       | This doesn't do any of that. Nor does it really handle merging
       | multiple versions of data.
       | And I'd really like the code to be handled separately from the
       | data. Git is not the place to do this. Because the choice of
       | picking pairs of code and data should happen at a higher level,
       | and be tracked along with the results - that's not going in a
       | repo - MLFlow or Tensorboard handles it better.
       | lizen_one wrote:
       | DVC has had the following problems, when I tested it (half a year
       | ago):
       | I gets super slow (waiting minutes) when there are a few thousand
       | files tracked. Thousands files have to be tracked, if you have
       | e.g. a 10GB file per day and region and artifacts generated from
       | it.
       | You are encouraged (it only can track artifacts) if you model
       | your pipeline in DVC (think like make). However, it cannot run
       | tasks it parallel. So it takes a lot of time to run a pipeline
       | while you are on a beefy machine and only one core is used.
       | Obviously, you cannot run other tools (e.g. snakemake) to
       | distribute/parallelize on multiple machines. Running one (part of
       | a) stage has also some overhead, because it does commit/checks
       | after/before running the executable of the task.
       | Sometimes you get merge conflicts, if you run a (partial
       | parmaretized) stage on one machine and the other part on the
       | other machine manually. These are cumbersome to fix.
       | Currently, I think they are more focused on ML features like
       | experiment tracking (I prefer other mature tools here) instead of
       | performance and data safety.
       | There is an alternative implementation from a single developer (I
       | cannot find it right now) that fixes some problems. However, I do
       | not use this because it propably will not have the same
       | development progress and testing as DVC.
       | This sounds negative but I think it is currently the one of the
       | best tools in this space.
         | DougBTX wrote:
         | What's best if parallel step processing is required?
         | jdoliner wrote:
         | DVC is great for use cases that don't get to this scale or have
         | these needs. And the issues here are non-trivial to solve. I've
         | spent a lot of time figuring out how to solve them in Pachyderm
         | which is good for use cases where you do need higher levels of
         | scale or might run into merge conflicts with DVC. There's
         | trade-offs though. DVC is definitely easier for a single
         | developer / data scientist to get up and running with.
           | nerdponx wrote:
           | I think it's worth noting that DVC can be used to track
           | artifacts that have been generated by other tools. For
           | example, you could use MLFlow to run several model
           | experiments, but at the end track the artifacts with DVC.
           | Personally I think that this is the best way to use it.
           | However I agree that in general it's best for smaller
           | projects and use cases. for example, it still shares the
           | primary deficiency of Make in that it can only track files on
           | the file system, and now things like ensuring a database
           | table has been created (unless you 'touch' your own sentinel
           | files).
         | bagavi wrote:
         | The alternative tool you are referring to is `Dud` I believe
         | Dvc is the best tool (I found) inspite of being dead slow and
         | complex (trying to do many things).
         | What alternatives would you recommend?
         | remram wrote:
         | > You are encouraged if you model your pipeline in DVC.
         | Encouraged to do what?
         | You might want to slow down on the use of parentheses, we are
         | both getting lost in them.
           | nerdponx wrote:
           | I assume they meant to say "you are encouraged to use DVC to
           | run your model and experiment pipeline". They want to
           | encourage you to do this because they are trying to build a
           | business around being a data science ops ecosystem. But the
           | truth is that DVC is not a great tool for running
           | "experiments" searching over a parameter space. it could be
           | improved in that regard, but that's just not what I use it
           | for nor is it what I recommend it to other people for.
           | However it's fantastic for _tracking_ artifacts throughout an
           | project that have been generated by other means, and for
           | keeping those artifacts tightly in sync with Git, and for
           | making it easy to share those artifacts without forcing
           | people to re-run expensive pipelines.
             | shcheklein wrote:
             | > But the truth is that DVC is not a great tool for running
             | "experiments" searching over a parameter space.
             | Would love your feedback what's missing there! We've been
             | improving it lately - e.g.
             | - Hydra support https://dvc.org/doc/user-guide/experiment-
             | management/hydra
             | - VS Code extension - https://marketplace.visualstudio.com/
             | items?itemName=Iterativ...
               | kernelsanderz wrote:
               | Last I checked it wasn't easy to use something like
               | optuna to do hyperparameter tuning with hydra/DVC.
               | Ideally I'd like the tool I use for data versioning
               | (DVC/git-lfs/gif-annex) to be orthogonal to that which I
               | use for hyperparameter sweeping (DVD/optuna/SageMaker
               | experiments), and orthogonal to that which I use for
               | configuration management (DVC/Hydra/Plain YAML), to that
               | what I use for experimental DAG management (DVC/Makefile)
               | Optuna is becoming very popular in the data-science/deep
               | learning ecosystem at the moment. It would be great to
               | see more composable tools, rather than having to opt all-
               | in into a given ecosystem.
               | Love the work that DVC is doing though to tackle these
               | difficult problems though!
       (page generated 2022-10-02 23:00 UTC)