[HN Gopher] XetCache: Improve Jupyter notebook reruns by caching... ___________________________________________________________________ XetCache: Improve Jupyter notebook reruns by caching cells Author : skadamat Score : 73 points Date : 2023-12-19 15:18 UTC (7 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | morkalork wrote: | Tempting but also sounds like a nightmare for writing notebooks | with reproducible results? | skadamat wrote: | Using XetCache a bunch daily, I've honestly found the | _opposite_. Here are a few situations: | | - My notebook queries a database for some data -> I run some | downstream calculations. Re-querying returns slightly different | data, but using cached outputs means I can replicate the work. | | - In another notebook, I was trying to replicate my coworkers | work where he queried OpenAI a few thousand times. The results | may be different when I run it, since OpenAI changed something. | | When notebooks are used well, it's in the context of internal | data / ML project collaboration. So being able to flash-freeze | a cell for your collaborators helps you juggle and pass around | notebooks more easily with confidence. | | Shipping notebooks directly to production is a whole another | set of challenges that I have my own, separate opinions on but | that's a different beast entirely. | ylow wrote: | Author here. | | The goal in a way is _better_ reproducibility. The memo hashes | the contents of the inputs, so if your cell is deterministic | (and you cover all the inputs), the memo should give you the | right answers. Of course if you want to rerun everything you | can always just delete the memos, but properly used it should | be a strict performance improvement. | | Of course there are many improvements that can be made | (tracking if dependent functions have changed etc) and of | course there are inherent limitations. This is very much "work | in progress". But is quite useful right now! | mirashii wrote: | > The goal in a way is better reproducibility. The memo | hashes the contents of the inputs, so if your cell is | deterministic (and you cover all the inputs), the memo should | give you the right answers. | | That's an enormously large if, and without any way to detect | or manage when that's not the case, it's hard to see how this | could aid reproducibility. This adds a new failure mode for | reproducibility where an important side-effect is not rerun, | altering the results. | hedgehog wrote: | Think of this like incremental builds via make etc. You | don't necessarily trust the output 100% but it speeds up | work so much it's worth it, you can always do a clean build | at key points where you want more confidence. | ltbarcly3 wrote: | That only works for things entirely encapsulated by the value | of the cell, which isn't a useful amount of stuff. This will | only help with primative toy examples where you are slinging | numpy arrays or other easily hashed data with no | encapsulation or indetermancy or order independence. | | Sorry, if it were possible to just generically cache | computations it would be done everywhere for everything. This | is just going to help with toy examples. | | The only way this leads to "better reproducability" is if it | fails to recompute things that it should have recomputed. If | the computation was actually deterministic from its inputs | the best and literally the only possible thing this can do is | exactly what recomputing it would do, but faster. Frankly | that you said it is more reproducible is enough evidence for | me that you are doing something fundementally broken. | juujian wrote: | There is a similar feature built into r markdown that I have used | a couple of times. That and eval=false which stops you from | accidentally rerunning large queries is actually really helpful | for figuring out complex computations of large datasets. | krastanov wrote: | This is very neat, but these days I am truly enamoured with | "reactive" notebooks. They are rather orthogonal to the concern | here. Reactive notebooks automatically build a dependency tree | between cells. When you edit a cell, all cells depending on it | get re-evaluated. | | It is extremely valuable for reproducibility. | | My favorite example is Pluto Notebooks in Julia (very Julia- | specific), but I have seen (maybe less polished) similar tools in | other languages too: https://plutojl.org/ | | On the other hand, when it comes to caching, Mandala (in python), | brings the best both from caching and reactivity. A truly amazing | memoization tool for computational graphs, much more | sophisticated and more powerful than alternatives: | https://github.com/amakelov/mandala | ylow wrote: | Author here. Mandala looks really cool. Thanks for the | recommendation! | pgbovine wrote: | very cool idea! i was also very interested in this problem | during grad school ... prototyped an approach by hacking | CPython, but the code (python 2.6? from 2010 era) has long | bitrotted: https://pg.ucsd.edu/publications/IncPy- | memoization-in-Python... | https://pg.ucsd.edu/publications/IncPy-memoization-in- | Python... | pgbovine wrote: | also, while i have your attention here, since you wrote that | related post on (not) vector db's ... what would you | recommend for a newbie to get started with RAG? let's say i | have a large collection of text files on my computer that i | want to use for RAG. the options out there seem bewildering. | is there something simple akin to Ollama for RAG? | zzleeper wrote: | If you want to get something done quickly, try llama index. | | If you want to learn/hack, pick an easy vectordb, get an | OpenAI API account, and do a quick attempt | | Then you can switch to a local LLM and embedder, and it | helps a bit in learning what the pain points are | jihadjihad wrote: | There was an interesting post yesterday that captured some of | these ideas: | | https://news.ycombinator.com/item?id=38681115 | westurner wrote: | Dockerfile and Containerfile also cache outputs as layers. | | `docker build --layers` is the default: | https://docs.podman.io/en/latest/markdown/podman-build.1.htm... | | container/common//docs/Containerfile.5.md: | https://github.com/containers/common/blob/main/docs/Containe... | westurner wrote: | It may be better to just start with managing caching with code. | | In the standard library, the @functools.cache and | @functools.cached_property function and method decorators do LRU | caching in RAM only. | https://docs.python.org/3/library/functools.html | | Dask docs > "Automatic Opportunistic Caching": | https://docs.dask.org/en/stable/caching.html#automatic-oppor... ; | dask.cache.Cache(size_bytes:int) ... "Cache tasks, not | expressions" | | Pickles are not a safe way to deserialize data; there is not a | data only pickling protocol. | | So caching arbitrary cell objects (or e.g. stack frames) to disk, | as pickles at least, creates risk of code injection if the | serialized data contains executable code objects. | | Similarly, the file permissions on e.g. rr traces. | https://en.wikipedia.org/wiki/Rr_(debugging) | | Dataclasses in the standard library helps with object | serialization, but not with caching cell outputs containing code | objects. | | Apache Arrow and Parquet also require schema for efficient | serialization. | | LRU: Least-Recently Used | | MRU: Most-Recently Used; most frequently accessed | | Out-of-order execution in notebooks may or may not have wasted | cycles of human time and CPU time. If the prompt numbers aren't | sequential, what you ran in a notebook is not necessarily what | others will get when running that computation graph in order; and | so it's best practice to "Restart and Run All" to test the actual | control flow before committing or pushing. | | There are ways to run and test notebooks _in order_ on or before | git commit (and compare their output with the output from the | previous run) like software with tests. | ylow wrote: | The issue comes when cells take many minutes or even hours to | run (intentionally or not). The ideal is indeed sequential, and | this helps me a lot with maintaining the sequential ordering as | it simplifies and speeds up the "restart and run all" process. | westurner wrote: | Oh I understand the issue. | | E.g. dask-labextension does not implicitly do dask.cache for | you. | | How are the objects serialized, and are code objects | serialized to files on disk, what is the permission umask on | such files, and what directory (/var/cache) should they be | selinux-labeled in (when it is running code from not memcache | instead of the source) because if you can write to those | cache files, you control the execution flow of the notebook | (which is already unreproducibly out-of-ordered without | consideration) | IanCal wrote: | I agree about being explicit with this, though I will warn | about using the python functools caching, because it changes | the logic of your program. Because it doesn't copy data, any | return that's mutable is risky. It's also not _obvious_ this | happens, and even less obvious if you 're not keenly aware of | this kind of issue being likely. | ltbarcly3 wrote: | The inputs and outputs of anything that is slow enough to care | about caching is so large in an ml context that caching is | unreasonable. The one exception would be people recomputing | entire notebooks over and over, but again caching in some generic | way that can't tell if underlying data changed or not is going to | break all kinds of stuff. | | Don't use notebooks for real work guys, I know you have a PHD but | that's not an excuse to refuse to learn software engineering. We | have lots of ways to define data dependencies and conditionally | rebuild just the parts that need rebuilt. Look at any build | system. | skadamat wrote: | Notebooks aren't necessarily for production but are a really | great way to explore data & models and move at the speed of | thought. It's for prototyping, collaboration, and feedback. | | I sense that most of the frustration teams have with notebooks | is when they try to ship notebooks, which is likely a mistake | (in many but not all cases) as you're pointing out! | amakelov wrote: | I see two concerns here: | | - inputs/outputs being high volume: the inputs/outputs that are | large are often also things that don't change over the course | of a project (e.g. a dataset or a model). So you don't really | need to cache the object itself, just a (typically short | string) immutable reference to it. As long as the object can be | looked up at runtime, everything's fine; | | - detecting changes in data: content hashing is the general way | in which you can tell if a result changed; using `joblib.dump` | and then hashing the resulting string provides a good starting | implementation, though certainly there are some corner cases to | be aware of. | | Both of these approaches are available/used in mandala | (https://github.com/amakelov/mandala; disclosure: I'm the | author), which uses content hashing to tell when data (or even | code/code dependencies) have changed, and gives you a generic | caching decorator for functions which can then look up large | objects by reference; this is the way I used it for e.g. my | mechanistic interpretability work, which is often of the form | one big model + lots of analyses producing tiny artifacts based | on it. | david_draco wrote: | How does differ from the very neat joblib.Memory | https://joblib.readthedocs.io/en/latest/generated/joblib.Mem...? | bravura wrote: | Man I want to love joblib, but I can't. | | One of the smarted ML researchers I know swears by it. | | But for whatever reason, in my workflow of prototyping new ML | approaches, optimizing and unoptimizing different preprocessing | steps, and sometimes migrating data across the cloud, I always | seem to start with joblib and then shed it pretty quickly in | favor of large JSONL.gz and sqlite3 etc checkpoints that I | create after key steps. | simon_acca wrote: | Some prior art: https://github.com/nextjournal/clerk | | > Clerk notebooks always evaluate from top to bottom. Clerk | builds a dependency graph of Clojure vars and only recomputes the | needed changes to keep the feedback loop fast. | amakelov wrote: | This is neat and self-contained! But as someone running | experiments with a high degree of interactivity, I often have an | orthogonal requirement: add more computations to the _same_ cell | without recomputing previous computations done in the cell (or in | other cells). | | For a concrete example, often in an ML project you want to study | how several quantities vary across several parameters. A | straightforward workflow for this is: write some nested loops, | collect results in python dictionaries, finally put everything | together in a dataframe and compare (by plotting or otherwise). | | However, after looking at the results, maybe you spot some trend | and wonder if it will continue if you tweak one of the parameters | by using a new value for it; of course, you also want to look at | the previous values and bring everything together in the same | plot(s). You now have a problem: either re-run the cell (thus | losing previous work, which is annoying even if you have to wait | 1 minute - you know it's a wasted minute!), or write the new | computation in a new cell, possibly with a lot of redundancy | (which over time makes the notebook hard to navigate and keep | consistent). | | So, this and other considerations eventually convinced me that | the _function_ is more natural than the cell as an interface | /boundary at which caching should be implemented, at least for my | use cases (coming from ML research). I wrote a framework based on | this idea, with lots of other features (some quite | experimental/unusual) to turn this into a feasible experiment | management tool - check it out at | https://github.com/amakelov/mandala | | P.S.: I notice you use `pickle` for the hashing - `joblib.dump` | is faster with objects containing numpy arrays, which covers a | lot of useful ML things ___________________________________________________________________ (page generated 2023-12-19 23:00 UTC)