[HN Gopher] Who needs MLflow when you have SQLite? ___________________________________________________________________ Who needs MLflow when you have SQLite? Author : edublancas Score : 202 points Date : 2022-11-16 14:55 UTC (8 hours ago) (HTM) web link (ploomber.io) (TXT) w3m dump (ploomber.io) | praveenhm wrote: | what is alternative to MLflow other than SQLite, like Kubeflow, | Metaflow? | crucialfelix wrote: | Weights and Balances https://wandb.ai/site | pcerdam wrote: | Weights and *Biases :) | kuba_dmp wrote: | neptune.ai https://neptune.ai/ | the83 wrote: | Comet: https://www.comet.com/site/ | mmq wrote: | https://github.com/polyaxon | isoprophlex wrote: | Yeah, MLFlow is a shitshow. The docs seem designed to confuse, | the API makes Pandas look good and the internal data model is | badly designed and exposed, as the article says. | | But, hordes of architects and managers who almost have a clue | have been conditioned to want l and expect mlflow. And it's baked | into databricks too, so for most purposes you'll be stuck with | it. | | Props to the author for daring to challenge the status quo. | idomi wrote: | How many data scientists that use Databricks for modeling do | you know? | isoprophlex wrote: | It's ubiquitous. I've consulted for a 100 person company that | built a data product on top of some IoT data. Everything was | in databricks, literally everything. (Not endorsing that, | just an observation) | | Talking to a 2000+ person org now that is standardizing data | science across the org using... you guessed it | idomi wrote: | Pretty interesting. I think this is part of this notion to | release half baked products, like some of the stuff in | there are really cool, just enough to get you in but it | doesn't scale and usually is complex to deploy/use. | dachryn wrote: | its forced upon many of them that are in finance, banking, | insurance, ... | | Mainly because those tend to run on Microsoft Azure, which | has no decent analytics offering, and are pushing Databricks | extremely hard. The CTO or whatever just pushes databricks. | On paper it checks all the boxes. Mlops, notebooks, | experiment management. It just does all of those things very | badly, but the exec doesn't care. They only care about the | microsoft credits. Just to avoid using Jupyter so the | compliance teams stay happy as well because Microsoft sales | people scared them away from from open source. | akdor1154 wrote: | What would you go with instead for collaborative notebooks? | | I ask because normally I tend pretty strongly towards the | "NO just let the DSes/analysts work how they want to", | which in this case would be running Jupyter locally. | However DBr's notebooks seem genuinely useful. | | Is your issue "but I don't need Spark" or "i wanna code in | a python project, not a notebook?", or something else? | | Imo if DBr cut their wedding to Spark and provided a | Python-only nb environment they'd have a killer offering on | their hands. | nerdponx wrote: | My team very nearly had this happen to us. | | We pushed back on it very, very, very hard, and finally | convinced "IT" to not turn off our big Linux server running | JupyterHub. We actually ended up using Databricks (PySpark, | Delta Lake, hosted MLFlow) quite a bit for various | purposes, and were happy to have it available. | | But the thought of forcing us into it as our _only_ | computing platform was a spine-chilling nightmare. | Something that only a person who has no idea what data | analysts and data scientists actually do all day would | decide to do. | chaps wrote: | "the API makes Pandas look good" | | It sparks joy in my heart whenever I see shade cast against | pandas. | lordgroff wrote: | Every time I open up pandas I jealously remember the | expressive beauty of R for these tasks. But because we're all | "serious" of course we must use Python for production lest we | not be serious. | laichzeit0 wrote: | To be fair, taking R to production is a goddamn nightmare. | chaxor wrote: | R is a trash of a language. It doesn't have any sense of | coherency to it at all. They keep trying to fix the | underlying problems by ducktaping paradigms on to it over | and over (S3, S4, R6, etc). There's never a clear sense | of the best way to do anything, but plenty of options to | do a thing in a very hacky 'script-kiddy' way. Looking | out at the community of different projects it becomes | clear that everyone is pretty lost as to what design | principles should be used for certain tasks, so every | repo has its own way of doing things (I know personal | style occurs in other languages, but commonalities are | much less recognizable in R projects). It's tragic that | such a large community uses it. | jmt_ wrote: | Trash language is a bit harsh. I'm not sure I would try | to put an R project into production or build a huge | project with it but, at the very least, R/R Studio was | the best scientific calculator I've ever used. Was | particularly great during college | lordgroff wrote: | Yep, this is a mark of someone that's never used R but | has heard a lot of incredibly ill informed criticism | around it. | | One look of dplyr code over pandas would of course | disabuse anyone of the notion that R is trash and the | tragedy is Python will in the current state never have | anything like that. That's the advantage of the language | being influenced by Lisp vs not. | tomrod wrote: | I've heavily used R several times. | | I agree that it is a trash language and that, outside | that many frontier academic ideas are available and some | plotting preferences are solidly prescriptive, it should | be thrown into the trash bin. | | Python, Julia when it gets its druthers for TTFP, Octave, | Fortran, C, and eventually Rust. These are the tools I've | found in use over and over and over again across | business, government, and non-profits. | | Everywhere R is used by the org I have seen major gaps in | capacity to deliver specifically because R doesn't scale | well. | nerdponx wrote: | Try to separate the language from its standard library. | Neither one is "trash". | | I agree that the standard library is what you might call | "a chaotic disorganized mess". | tomrod wrote: | I'm not emotionally invested in tools so am happy to | identify the user experience and operational experience | as "trash." | | "Trash", despite its connotations of lacking value, is | really just a chaotic disorganized mess of something made | by artifice with dubious reclaim/reuse/recycle value. | Being a subjective assessment, it is natural that one | person's trash is a treasure to another. | nerdponx wrote: | I take issue with your implication that I'm emotionally | invested in something when I shouldn't be. You are free | to dislike R and not use it, but to claim that it's | "trash" is to wrongly disavow its usefulness for the many | people that do find it useful, and to cast aspersions on | the judgement of all those people. | whatever1 wrote: | I have never seen a worse documented library. Initially I | thought that they were lazy, now I realize that it cannot be | documented because it is a total mess of a library held | together with tape. | | Close second is the plotly library. | nerdponx wrote: | The Pandas documentation has improved quite a bit. Last I | checked, the only part of the reference docs with a big gap | was the description of "extension arrays" and accessors. | | The _user guide_ material absolutely needs work, and the | examples in the reference docs tend to be a little | contrived. But I absolutely have seen worse-documented | libraries, such as Gunicorn and Pydantic. | claytonjy wrote: | I'm surprised to see Pydantic in here; I've used Pandas | and Pydantic both quite a lot, and have found the | Pydantic docs to be quite good! Also a much smaller | library with a saner API, and thus easier to document | well. | __mharrison__ wrote: | Genuinely curious what you have against the Pandas | documentation. It has some of the best docstrings I've | seen. | | (I also wrote a Pandas book or two... So there's that) | chaps wrote: | Docstrings are one thing, but functionality discovery, | picking up from scratch, troubleshooting, etc are... not | fun, nor easy with the documentation. If you know it well | already and use it a lot it's easier to forgive its | documentation faults since you can waive off the problems | as "that's just learning something new". | | But for a lot of people who use it infrequently its | documentation is a frustrating mess. Simple problems turn | into significant time sinks of trying to find which page | of the documentation to look at. | | A lot of issues are made worse by shit-awful interop | between libraries that claim to fully support dayaframes, | but often fail in non-obvious ways... meaning back to the | documentation mines. | | I'd argue that because there's a market for a single | author to write two books about it is indicative of | documentation problems. | 333luke wrote: | What makes the documentation so bad in your opinion? I'm | not arguing but curious since I use pandas all day at my | job and can't think of any times the docs weren't clear to | me. (Plotly I have had some annoying times with!) | bobertlo wrote: | I think the R docs are the intended reference material for | pandas ;) | dekhn wrote: | What bothers me the most is the egregious data types for any | argument. If it's a string, do this. If it's a list, do that. | If it's a dictionary of lists, do this other thing. | | No, I want you to force me to provide my data in the right | way and raise a noisy exception if I don't. | nerdponx wrote: | Series and DataFrame have "alternate constructors" for this | purpose, and the loc/iloc accessors give you a bit more | control. | | I agree that the magic type auto-detection is a bit too | magical and sloppy, but you have to realize that data | analysts and scientists have historically been incredibly | sloppy programmers who _wanted_ as much magic as possible. | It 's only in recent years that researchers have begun to | value some amount of discipline in their research code. | mostdataisnice wrote: | Where does the article say that? | isoprophlex wrote: | About exposing the data inside MLFlow | | > I found the query feature extremely limiting (if my | experiments are stored in a SQL table, why not allow me to | query them with SQL). | guangyeu wrote: | As noted in an earlier comment, I think there is a false | equivalence between end-to-end MLOps platforms like MLflow and | tools for experiment tracking. The project looks like a solid | tracking solution for individual data scientists, but it is not | designed for collaboration among teams or organizations. | | > There were a few things I didn't like: it seemed too much to | have to start a web server to look at my experiments, and I found | the query feature extremely limiting (if my experiments are | stored in a SQL table, why not allow me to query them with SQL). | | While a relational database (like sqlite) can store | hyperparameters and metrics, it cannot scale for the many aspects | of experiment tracking for a team/organization, from visual | inspection of model performance results to sharing models to | lineage tracking from experimentation to production. As noted in | the article, you need a GUI on top of a SQL database to make | meaningful model experimentation. The MLflow web service allows | you to scale across your teams/organizations with interactive | visualizations, built-in search & ranking, shareable snapshots, | etc. You can run it across a variety of production-grade | relational dBs so users can query the data directly through the | SQL database or through a UI that makes it easier to search for | those not interested in using SQL. | | > I also found comparing the experiments limited. I rarely have a | project where a single (or a couple of) metric(s) is enough to | evaluate a model. It's mostly a combination of metrics and | evaluation plots that I need to look at to assess a model. | Furthermore, the numbers/plots themselves have no value in | isolation; I need to benchmark them against a base model, and | doing model comparisons at this level was pretty slow from the | GUI. | | The MLflow UI allows you to compare thousands of models from the | same page in tabular or graphical format. It renders the | performance-related artifacts associated with a model, including | feature importance graphs, ROC & precision-recall curves, and any | additional information that can be expressed in image, CSV, HTML, | or PDF format. | | > If you look at the script's source code, you'll see that there | are no extra imports or calls to log the experiments, it's a | vanilla Python script. | | MLflow already provides low-code solutions for MLOps, including | autologging. After running a single line of code - | mlflow.autolog() - every model you train across the most | prominent ML frameworks, including but not limited to scikit- | learn, XGBoost, TensorFlow & Keras, PySpark, LightGBM, and | statsmodels is automatically tracked with MLflow, including all | relevant hyperparameters, performance metrics, model files, | software dependencies, etc. All of this information is made | immediately available in the MLflow UI. | | Addendum: As noted, there is a false equivalence between an end- | to-end MLOps lifecycle platform like MLflow and tools for | experiment tracking. To succeed with end-to-end MLOps, | teams/organizations also need projects to package code for | reproducibility on any platform across many different package | versions, deploy models in multiple environments, and a registry | to store and manage these models - all of which is provided by | MLflow. | | It is battle-tested with hundreds of developers and thousands of | organizations using widely-adopted open source standards. I | encourage you to chime in on the MLflow GitHub on any issues and | PRs, too! | czumar wrote: | +1. I'd also like to note that it's very easy to get started | with MLflow; our quickstart walks you through the process of | installing the library, logging runs, and viewing the UI: | https://mlflow.org/docs/latest/quickstart.html. | | We'd love to work with the author to make MLflow Tracking an | even better experiment tracking tool and immediately benefit | thousands of organizations and users on the platform. MLflow is | the largest open source MLOps platform with over 500 external | contributors actively developing the project and a maintainer | group dedicated to making sure your contributions & | improvements are merged quickly. | bfung wrote: | How about a side-by-side comparison? | | Far too often, these articles of X is bad, use my homebrew Y | instead, without showing comparison to X doesn't help illustrate | 'why Y instead'. | | You know... <cheeky>For science.</cheeky> | benjaminwootton wrote: | The elephant in the room with data is that we don't need a lot of | the fancy and powerful technology. SQL against a relational | database gets us extraordinarily far. Add some Python scripts | where we need some imperative logic and glue code, and a sprinkle | of CI/CD if we really want to professionalise the work of data | scientists. I think this covers the vast majority of situations. | | Despite being around it for some time, I'm not sure big data or | machine learning needed to be a thing for the vast majority of | businesses. | bob1029 wrote: | > SQL against a relational database gets us extraordinarily | far. | | I think it gets us all the way once you consider the ability to | expose domain-specific functions to SQL that are serviced by | your application code. | | I've always been of the mindset that you can do anything with | SQL if you are clever enough. | citizenpaul wrote: | Unless your income is depending on carrying out the exact | demands of some money guy that's most common phrase while using | a computer is "it won't let me" and they want "big data". | | Then you just suck it up and build one of the totally | unnecessary big data systems that have been excreted all over | the business world these days. I don't think the problem is | that devs are over-engineering. | | I wonder what its called, makes me think of tragedy of the | commons but probably not quite right. | morelisp wrote: | Maybe like 20 years ago you were right but today there's a | generation that's _been working for 10 years_ on systems | built like that. They don 't know any better, and in most | cases nobody is around to teach them otherwise. | tomrod wrote: | Hierarchy on bueracracies, by Jean Tirole. I know because | this was the phenomenon I wanted to study in grad school only | to find he scooped me (on this an several items) by several | decades. | | Edit: Tirole, Jean. "Hierarchies and bureaucracies: On the | role of collusion in organizations." JL Econ. & Org. 2 | (1986): 181. | chasil wrote: | The article mentions this workflow: | | "Let's now execute the script multiple times, one per set of | parameters, and store the results in the experiments.db SQLite | database... After finishing executing the experiments, we can | initialize our database (experiments.db) and explore the | results." | | Be warned that issuing queries while DML is in process can | result in SQLITE_BUSY, and the default behavior is to abort the | transaction, resulting in lost data. | | Setting WAL mode for greater concurrency between a writer and | reader(s) can lead to corruption if the IPC structures are not | visible: | | "To accelerate searching the WAL, SQLite creates a WAL index in | shared memory. This improves the performance of read | transactions, but the use of shared memory requires that all | readers must be on the same machine [and OS instance]." | | If the database will not be entirely left alone during DML, | then the busy handler must be addressed. | habibur wrote: | None of these are a problem for the workload discussed. | | When I am working with sqlite I am more likely accessing it | from a single machine. | | And in this case of ML, most likely from 1 process and by | running multiple times in serial. | isoprophlex wrote: | Yeah and even if you do need to do proper big-dataset-ML... a | SQL box and maybe something like a blob storage for large | artifacts (S3, Azure storage account, whatever) is all you need | as well. But if your boss bought The MLOps Experience, you | gotta do what the cool kids are doing! | navbaker wrote: | I work in an environment where there are multiple tech teams | developing models for multiple use cases on VMs and GPU clusters | spread across our corporate intranet. Once you move beyond a | single dev working on a model on their laptop, you absolutely | need something that can handle not just metrics tracking, but | making the model binaries available and providing a means to | ensure reproducibility by the rest of the team. That's what | MLFlow is providing for us. The API is a mess, but at least we | didn't have to code up some bespoke in-house framework, we just | put some engineers on task to play around with it for a few hours | and figure out the nuances of basic interactions and deployed it. | edublancas wrote: | Agree. Once you have a team, you need to have a service they | can all interact with. This release is a first step, we want to | get the user experience right for an individual and then think | of how to expand that to teams. Ultimately, the two things | we're the most excited about are 1) you don't need to add any | extra code (and it works with all libraries, not a pre-defined | set) 2) SQL as the query language | spicyramen_ wrote: | cdong wrote: | I don't get why a lot of people are calling mlflow a shitshow | when it has done so much getting data scientist out of recording | experiments via CSV. I can log models and parameters and use the | UI to track different runs. After comparisons, I can use the | registry to register different staging. If you have other model | diagnostic charts you can log the artifact as well. I think | mlflow v2 has auto logging included so why all the fuss? | nerdponx wrote: | People tend to forget that first movers rarely tend to also | have the best design. MLFlow (and DVC) brought us out of the | dark ages. Now we can build better tools, with the benefit of | hindsight. | | Claiming that something is "broken" or "trash" when you mean "I | don't like it" is a good way to make yourself feel big and | smart, but it's not actually constructive. | cameronfraser wrote: | There are those who create and those who complain on the | internet about tools they've used one time | isoprophlex wrote: | Okay that's coming across as a pretty snide remark aimed at | me, I'll bite. | | Yes, I can understand why you comment that. I don't like | blind slagging of free software either. | | But there are ALSO those whose day job it is, and has been | for the last 2 years, to use a badly designed overcomplex | horrorshow of a tool that could be replaced easily by | something better ... if it wasn't for the lock-in effects and | strong marketing. | | So I'm ventilating my frustration and at the same time | expressing my gratitude to the person who made something | fresh, that shows us things can be better. | | I can't build the replacement to MLFlow myself, but I can | cheer people on who do, and let them know their efforts are | sorely needed. | phr0k wrote: | guangyeu wrote: | Could you provide context on why SQLite would replace MLflow? | From the standpoint of model tracking (record and query | experiments), projects (package code for reproducibility on any | platform), deploy models in multiple environments, registry for | storing and managing models, and now recipes (to simplify model | creation and deployment), MLflow helps with the MLOps life cycle. | [deleted] | edublancas wrote: | Fair point. MLflow has a lot of features to cover the end-to- | end dev cycle. This SQLite tracker only covers the experiment | tracking part. | | We have another project to cover the orchestration/pipelines | aspect: https://github.com/ploomber/ploomber and we have plans | to work on the rest of features. For now, we're focusing on | those two. | mostdataisnice wrote: | SQLite is literally a backend for MLflow, so the argument being | made really is that you should just use SQL when you can, which | is kind of adjacent to any criticisms of MLflow | edublancas wrote: | Is querying the underlying SQL database officially supported in | MLflow? Last time I used it, it wasn't documented. I took a | look at the database and it wasn't end-user friendly. | mostdataisnice wrote: | As someone replied above, it's because SQL is just 1 backend | and it's weird to expose an API that only works on 1 backend. | Once you have many devs working together, you need a remote | server. If you have a remote abstracted backend, it needs to | have a unified API surface so the same client can talk to any | backend. You might argue "This interface _should_ be SQL ", | and to that I would say there are many file stores (like your | local file system) that are not easy to control with SQL. | afrnz wrote: | You can also use mlflow locally with SQLite (https://www.mlflow.o | rg/docs/latest/tracking.html#scenario-2-...). Even though I | haven't tried querying the db directly ... | frgtpsswrdlame wrote: | Wow this looks perfect for what I need right now - just a bit of | lightweight tracking. | nerdponx wrote: | DVC also fills the "lightweight tracking" niche, although it | relies on automatically creating Git branches as its technique | for tracking experiments. I personally find that distasteful, | so I don't use it specifically for experiment tracking, but the | feature is there. | | The company behind DVC is also building a handful of other | related tools, e.g. https://iterative.ai/blog/iterative-studio- | model-registry | wxnx wrote: | Hm, in what way do you find that DVC requires creating new | branches for experiment tracking? | | I find the following workflow works well, for example: | | 1. Define steps depending on a `config.yml`. | | 2. Run an initial experiment (with an initial config) and | commit the results. | | 3. Update config (preserving the alternate config and using | symlinks from `config.yml` to various new configs if | necessary), re-run, and commit. | | 4. Results are then all preserved in your git history. | shcheklein wrote: | It doesn't require creating a branch when you iterate, it | requires creating a branch or commit if you want to share it | with the team - see it on GitHub or in Studio. But even those | lightweight iterations (https://dvc.org/doc/command- | reference/exp/run) could shared as well via Git server - they | won't be visible for now via UI in GH/Studio at the moment. | | Happy to provide more details on how it's done. It's actually | quite interesting technical thing - custom Git namespace | https://iterative.ai/blog/experiment-refs | edublancas wrote: | If you need help, you can open an issue on GitHub | (https://github.com/ploomber/ploomber-engine) or join our | Slack! (https://ploomber.io/community/) | geminicoolaf wrote: | What about BentoML? | LeanderK wrote: | I think MLflow is a good idea (very) badly executed. I would like | to have a library that combines: | | - simple logging of (simple) metrics during and after training | | - simple logging of all arguments the model was created with | | - simple logging of a textual representation of the model | | - simple logging of general architecture details (number of | parameters, regularisation hyperparameters, learning rate, number | of epochs etc.) | | - and of course checkpoints | | - simple archiving of the model (and relevant data) | | and all that without much (coding) overhead and only using a | shared filesystem (!) And with an easy notebook integration. | MLflow just has way to many unnecessary features and is | unreliable and complicated. When it doesn't work it's so | frustrating, it's also quite often super slow. But I always end | up creating something like MLflow when working on an architecture | for a long time. | | EDIT: having written this...I fell like trying to write my own | simple library after finishing the paper. A few ideas have | already accumulated in my notes that would make my life easier. | | EDIT2: I actually remember trying to use SQLite to manage my | models! But the server I worked on was locked down and going | through the process to get somebody to install me SQLite was just | not worth it. It's also was not available on the cluster for big | experiments, where it would be even more work to get it, so I | gave up on the idea of trying SQLite. | Fiahil wrote: | > I think MLFlow is a good idea (very) badly executed. | | Oh yes, I'm glad to see other with similar opinion. | pletnes wrote: | Sqlite is in python's stdlib, so how can this be an issue? Was | there no local filesystem whatsoever? | tekknolagi wrote: | sqlite bindings are in the stdlib but not the library itself. | imachine1980_ wrote: | im asking from ignorance, what the difference in effect | this context of not having the library itself? | funklute wrote: | Using the bindings is only possible if the library itself | is already installed (since the bindings directly make | use of the library, under the hood). | edublancas wrote: | I'm happy to collaborate with you, let's build the best | experiment tracker out there! Feel free to ping me at | eduardo@ploomber.io | smehta73 wrote: | have you used comet? it basically does everything you are | asking and lot more user-friendly than MLFlow | nerdponx wrote: | Isn't Comet a proprietary SaaS? I like MLFlow because I can | run it on my own computer if I want to. | tomrod wrote: | Check out flyte and union.ml. No personal affiliation, just | good projects in the vein of | airflow/prefect/mlflow/kubeflow | YetAnotherNick wrote: | I really like guild.ai. The best thing is that their | developers assumed people to be lazy and automatically | makes flag for global variables and track them. ___________________________________________________________________ (page generated 2022-11-16 23:00 UTC)