[HN Gopher] XetCache: Improve Jupyter notebook reruns by caching...
       ___________________________________________________________________
        
       XetCache: Improve Jupyter notebook reruns by caching cells
        
       Author : skadamat
       Score  : 73 points
       Date   : 2023-12-19 15:18 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | morkalork wrote:
       | Tempting but also sounds like a nightmare for writing notebooks
       | with reproducible results?
        
         | skadamat wrote:
         | Using XetCache a bunch daily, I've honestly found the
         | _opposite_. Here are a few situations:
         | 
         | - My notebook queries a database for some data -> I run some
         | downstream calculations. Re-querying returns slightly different
         | data, but using cached outputs means I can replicate the work.
         | 
         | - In another notebook, I was trying to replicate my coworkers
         | work where he queried OpenAI a few thousand times. The results
         | may be different when I run it, since OpenAI changed something.
         | 
         | When notebooks are used well, it's in the context of internal
         | data / ML project collaboration. So being able to flash-freeze
         | a cell for your collaborators helps you juggle and pass around
         | notebooks more easily with confidence.
         | 
         | Shipping notebooks directly to production is a whole another
         | set of challenges that I have my own, separate opinions on but
         | that's a different beast entirely.
        
         | ylow wrote:
         | Author here.
         | 
         | The goal in a way is _better_ reproducibility. The memo hashes
         | the contents of the inputs, so if your cell is deterministic
         | (and you cover all the inputs), the memo should give you the
         | right answers. Of course if you want to rerun everything you
         | can always just delete the memos, but properly used it should
         | be a strict performance improvement.
         | 
         | Of course there are many improvements that can be made
         | (tracking if dependent functions have changed etc) and of
         | course there are inherent limitations. This is very much "work
         | in progress". But is quite useful right now!
        
           | mirashii wrote:
           | > The goal in a way is better reproducibility. The memo
           | hashes the contents of the inputs, so if your cell is
           | deterministic (and you cover all the inputs), the memo should
           | give you the right answers.
           | 
           | That's an enormously large if, and without any way to detect
           | or manage when that's not the case, it's hard to see how this
           | could aid reproducibility. This adds a new failure mode for
           | reproducibility where an important side-effect is not rerun,
           | altering the results.
        
             | hedgehog wrote:
             | Think of this like incremental builds via make etc. You
             | don't necessarily trust the output 100% but it speeds up
             | work so much it's worth it, you can always do a clean build
             | at key points where you want more confidence.
        
           | ltbarcly3 wrote:
           | That only works for things entirely encapsulated by the value
           | of the cell, which isn't a useful amount of stuff. This will
           | only help with primative toy examples where you are slinging
           | numpy arrays or other easily hashed data with no
           | encapsulation or indetermancy or order independence.
           | 
           | Sorry, if it were possible to just generically cache
           | computations it would be done everywhere for everything. This
           | is just going to help with toy examples.
           | 
           | The only way this leads to "better reproducability" is if it
           | fails to recompute things that it should have recomputed. If
           | the computation was actually deterministic from its inputs
           | the best and literally the only possible thing this can do is
           | exactly what recomputing it would do, but faster. Frankly
           | that you said it is more reproducible is enough evidence for
           | me that you are doing something fundementally broken.
        
       | juujian wrote:
       | There is a similar feature built into r markdown that I have used
       | a couple of times. That and eval=false which stops you from
       | accidentally rerunning large queries is actually really helpful
       | for figuring out complex computations of large datasets.
        
       | krastanov wrote:
       | This is very neat, but these days I am truly enamoured with
       | "reactive" notebooks. They are rather orthogonal to the concern
       | here. Reactive notebooks automatically build a dependency tree
       | between cells. When you edit a cell, all cells depending on it
       | get re-evaluated.
       | 
       | It is extremely valuable for reproducibility.
       | 
       | My favorite example is Pluto Notebooks in Julia (very Julia-
       | specific), but I have seen (maybe less polished) similar tools in
       | other languages too: https://plutojl.org/
       | 
       | On the other hand, when it comes to caching, Mandala (in python),
       | brings the best both from caching and reactivity. A truly amazing
       | memoization tool for computational graphs, much more
       | sophisticated and more powerful than alternatives:
       | https://github.com/amakelov/mandala
        
         | ylow wrote:
         | Author here. Mandala looks really cool. Thanks for the
         | recommendation!
        
           | pgbovine wrote:
           | very cool idea! i was also very interested in this problem
           | during grad school ... prototyped an approach by hacking
           | CPython, but the code (python 2.6? from 2010 era) has long
           | bitrotted: https://pg.ucsd.edu/publications/IncPy-
           | memoization-in-Python...
           | https://pg.ucsd.edu/publications/IncPy-memoization-in-
           | Python...
        
           | pgbovine wrote:
           | also, while i have your attention here, since you wrote that
           | related post on (not) vector db's ... what would you
           | recommend for a newbie to get started with RAG? let's say i
           | have a large collection of text files on my computer that i
           | want to use for RAG. the options out there seem bewildering.
           | is there something simple akin to Ollama for RAG?
        
             | zzleeper wrote:
             | If you want to get something done quickly, try llama index.
             | 
             | If you want to learn/hack, pick an easy vectordb, get an
             | OpenAI API account, and do a quick attempt
             | 
             | Then you can switch to a local LLM and embedder, and it
             | helps a bit in learning what the pain points are
        
         | jihadjihad wrote:
         | There was an interesting post yesterday that captured some of
         | these ideas:
         | 
         | https://news.ycombinator.com/item?id=38681115
        
       | westurner wrote:
       | Dockerfile and Containerfile also cache outputs as layers.
       | 
       | `docker build --layers` is the default:
       | https://docs.podman.io/en/latest/markdown/podman-build.1.htm...
       | 
       | container/common//docs/Containerfile.5.md:
       | https://github.com/containers/common/blob/main/docs/Containe...
        
       | westurner wrote:
       | It may be better to just start with managing caching with code.
       | 
       | In the standard library, the @functools.cache and
       | @functools.cached_property function and method decorators do LRU
       | caching in RAM only.
       | https://docs.python.org/3/library/functools.html
       | 
       | Dask docs > "Automatic Opportunistic Caching":
       | https://docs.dask.org/en/stable/caching.html#automatic-oppor... ;
       | dask.cache.Cache(size_bytes:int) ... "Cache tasks, not
       | expressions"
       | 
       | Pickles are not a safe way to deserialize data; there is not a
       | data only pickling protocol.
       | 
       | So caching arbitrary cell objects (or e.g. stack frames) to disk,
       | as pickles at least, creates risk of code injection if the
       | serialized data contains executable code objects.
       | 
       | Similarly, the file permissions on e.g. rr traces.
       | https://en.wikipedia.org/wiki/Rr_(debugging)
       | 
       | Dataclasses in the standard library helps with object
       | serialization, but not with caching cell outputs containing code
       | objects.
       | 
       | Apache Arrow and Parquet also require schema for efficient
       | serialization.
       | 
       | LRU: Least-Recently Used
       | 
       | MRU: Most-Recently Used; most frequently accessed
       | 
       | Out-of-order execution in notebooks may or may not have wasted
       | cycles of human time and CPU time. If the prompt numbers aren't
       | sequential, what you ran in a notebook is not necessarily what
       | others will get when running that computation graph in order; and
       | so it's best practice to "Restart and Run All" to test the actual
       | control flow before committing or pushing.
       | 
       | There are ways to run and test notebooks _in order_ on or before
       | git commit (and compare their output with the output from the
       | previous run) like software with tests.
        
         | ylow wrote:
         | The issue comes when cells take many minutes or even hours to
         | run (intentionally or not). The ideal is indeed sequential, and
         | this helps me a lot with maintaining the sequential ordering as
         | it simplifies and speeds up the "restart and run all" process.
        
           | westurner wrote:
           | Oh I understand the issue.
           | 
           | E.g. dask-labextension does not implicitly do dask.cache for
           | you.
           | 
           | How are the objects serialized, and are code objects
           | serialized to files on disk, what is the permission umask on
           | such files, and what directory (/var/cache) should they be
           | selinux-labeled in (when it is running code from not memcache
           | instead of the source) because if you can write to those
           | cache files, you control the execution flow of the notebook
           | (which is already unreproducibly out-of-ordered without
           | consideration)
        
         | IanCal wrote:
         | I agree about being explicit with this, though I will warn
         | about using the python functools caching, because it changes
         | the logic of your program. Because it doesn't copy data, any
         | return that's mutable is risky. It's also not _obvious_ this
         | happens, and even less obvious if you 're not keenly aware of
         | this kind of issue being likely.
        
       | ltbarcly3 wrote:
       | The inputs and outputs of anything that is slow enough to care
       | about caching is so large in an ml context that caching is
       | unreasonable. The one exception would be people recomputing
       | entire notebooks over and over, but again caching in some generic
       | way that can't tell if underlying data changed or not is going to
       | break all kinds of stuff.
       | 
       | Don't use notebooks for real work guys, I know you have a PHD but
       | that's not an excuse to refuse to learn software engineering. We
       | have lots of ways to define data dependencies and conditionally
       | rebuild just the parts that need rebuilt. Look at any build
       | system.
        
         | skadamat wrote:
         | Notebooks aren't necessarily for production but are a really
         | great way to explore data & models and move at the speed of
         | thought. It's for prototyping, collaboration, and feedback.
         | 
         | I sense that most of the frustration teams have with notebooks
         | is when they try to ship notebooks, which is likely a mistake
         | (in many but not all cases) as you're pointing out!
        
         | amakelov wrote:
         | I see two concerns here:
         | 
         | - inputs/outputs being high volume: the inputs/outputs that are
         | large are often also things that don't change over the course
         | of a project (e.g. a dataset or a model). So you don't really
         | need to cache the object itself, just a (typically short
         | string) immutable reference to it. As long as the object can be
         | looked up at runtime, everything's fine;
         | 
         | - detecting changes in data: content hashing is the general way
         | in which you can tell if a result changed; using `joblib.dump`
         | and then hashing the resulting string provides a good starting
         | implementation, though certainly there are some corner cases to
         | be aware of.
         | 
         | Both of these approaches are available/used in mandala
         | (https://github.com/amakelov/mandala; disclosure: I'm the
         | author), which uses content hashing to tell when data (or even
         | code/code dependencies) have changed, and gives you a generic
         | caching decorator for functions which can then look up large
         | objects by reference; this is the way I used it for e.g. my
         | mechanistic interpretability work, which is often of the form
         | one big model + lots of analyses producing tiny artifacts based
         | on it.
        
       | david_draco wrote:
       | How does differ from the very neat joblib.Memory
       | https://joblib.readthedocs.io/en/latest/generated/joblib.Mem...?
        
         | bravura wrote:
         | Man I want to love joblib, but I can't.
         | 
         | One of the smarted ML researchers I know swears by it.
         | 
         | But for whatever reason, in my workflow of prototyping new ML
         | approaches, optimizing and unoptimizing different preprocessing
         | steps, and sometimes migrating data across the cloud, I always
         | seem to start with joblib and then shed it pretty quickly in
         | favor of large JSONL.gz and sqlite3 etc checkpoints that I
         | create after key steps.
        
       | simon_acca wrote:
       | Some prior art: https://github.com/nextjournal/clerk
       | 
       | > Clerk notebooks always evaluate from top to bottom. Clerk
       | builds a dependency graph of Clojure vars and only recomputes the
       | needed changes to keep the feedback loop fast.
        
       | amakelov wrote:
       | This is neat and self-contained! But as someone running
       | experiments with a high degree of interactivity, I often have an
       | orthogonal requirement: add more computations to the _same_ cell
       | without recomputing previous computations done in the cell (or in
       | other cells).
       | 
       | For a concrete example, often in an ML project you want to study
       | how several quantities vary across several parameters. A
       | straightforward workflow for this is: write some nested loops,
       | collect results in python dictionaries, finally put everything
       | together in a dataframe and compare (by plotting or otherwise).
       | 
       | However, after looking at the results, maybe you spot some trend
       | and wonder if it will continue if you tweak one of the parameters
       | by using a new value for it; of course, you also want to look at
       | the previous values and bring everything together in the same
       | plot(s). You now have a problem: either re-run the cell (thus
       | losing previous work, which is annoying even if you have to wait
       | 1 minute - you know it's a wasted minute!), or write the new
       | computation in a new cell, possibly with a lot of redundancy
       | (which over time makes the notebook hard to navigate and keep
       | consistent).
       | 
       | So, this and other considerations eventually convinced me that
       | the _function_ is more natural than the cell as an interface
       | /boundary at which caching should be implemented, at least for my
       | use cases (coming from ML research). I wrote a framework based on
       | this idea, with lots of other features (some quite
       | experimental/unusual) to turn this into a feasible experiment
       | management tool - check it out at
       | https://github.com/amakelov/mandala
       | 
       | P.S.: I notice you use `pickle` for the hashing - `joblib.dump`
       | is faster with objects containing numpy arrays, which covers a
       | lot of useful ML things
        
       ___________________________________________________________________
       (page generated 2023-12-19 23:00 UTC)