[HN Gopher] Don't Pickle Your Data
       ___________________________________________________________________
        
       Don't Pickle Your Data
        
       Author : behnamoh
       Score  : 46 points
       Date   : 2022-08-11 20:03 UTC (2 hours ago)
        
 (HTM) web link (www.benfrederickson.com)
 (TXT) w3m dump (www.benfrederickson.com)
        
       | chaxor wrote:
       | Last time I checked (i.e. performed several benchmarks upon),
       | parquet with Zstd was about the best way to store compressed data
       | for really fast and small files.
       | 
       | Zstd is quite good, and is now (iirc) in the linux kernel.
       | 
       | People may have some issue with parquet being column based, which
       | can make inserts a little slower for example, but for a large
       | mostly-set database it is a very good choice. A tsv.zst file
       | could be another way to go as well. But like others, I really
       | with hdf5 had some of these features of compression and wasn't so
       | dang slow.
        
       | ris wrote:
       | Don't Assume Things About Others Use Cases.
       | 
       | In cases where I'm doing some sort of interactive or exploratory
       | data analysis with structures of complex python objects and want
       | to stash a copy of what I'm working with in case the next thing I
       | do screws the up or, who knows, I lose power - being able to
       | quickly pickle something and have an amount of confidence I'll be
       | able to get it back in a sensible state is very useful.
       | 
       | I've also used it for debug dumps in experimental software so I
       | have a chance of reproducing odd cases it comes across.
        
         | hansvm wrote:
         | I made a simple library for just such a purpose if you're
         | interested. You can wrap a whole module (like requests or
         | pandas) and cache every function/coroutine result to disk.
         | https://github.com/hmusgrave/ememo
         | 
         | I mainly use it for web scraping to be polite while I figure
         | out the remote API, but I'm sure somebody could have another
         | use.
        
       | solarkraft wrote:
       | > Pickle is slow
       | 
       | ... Python is slow. But "slow" means "plenty fast" nowadays and
       | the development speed advantage is immense.
       | 
       | > unpickling malicious data can cause security issues
       | 
       | Why would I do that?
       | 
       | I can't read the linked page because it seems to be down/the link
       | is broken, so I don't know whether this includes user data that
       | is present before pickling and then turns to be an issue after
       | pickling. Then I would worry, otherwise ... yeah, I'm not gonna
       | unpickle random data.
       | 
       | > Just use JSON
       | 
       | How do I effortlessly restore objects including their methods
       | from JSON?
        
         | marcosdumay wrote:
         | > How do I effortlessly restore objects including their methods
         | from JSON?
         | 
         | The recommendation from the title is usually made instead of
         | something like "deserializing executable data is harmful". That
         | is exactly the one question where the answer is "don't".
         | 
         | It's not exactly the unpickling process that is the problem.
         | It's how you established that the data isn't malicious. It is
         | very hard to use pickle without creating some local privilege
         | escalation possibilities. And at the end of the process, you
         | usually don't get any capability that replicating the code on
         | both sides of the communication channel wouldn't give you.
         | 
         | (The problem isn't specific to Python either. There was a time
         | when that kind of functionality was very hyped on both the
         | industry and academia. For example, Java also got something
         | similar that they had to retract. The famous Gnu-Hurd OS (the
         | one that would never finish) was supposed to do that on the
         | system level.)
        
           | xhevahir wrote:
           | The Mozart/Oz people came up with pickle, I think.
        
         | NotTameAntelope wrote:
         | Instantiate a new object of the class with the JSON as
         | arguments, is one way.
         | 
         | I've built a bunch of these systems, keeping your data separate
         | solves a lot of future problems.
        
         | LtWorf wrote:
         | The benchmark is bad. Because after you load a json you can't
         | really use it. Well to use it you must check lists are lists
         | for real, objects are really objects and have the keys you
         | think they should have and so on.
         | 
         | The alternative is using something like typedload (which I
         | wrote) or pydantic in addition to json load, to avoid
         | cluttering the code with the countless and error prone checks
         | one must do to use untrusted json.
         | 
         | In the end dealing with untrusted json directly is terrible.
        
           | theamk wrote:
           | if you are dealing with untrusted data, pickle is not an
           | option at all, it lacks security.
        
         | cratermoon wrote:
         | >> unpickling malicious data can cause security issues
         | 
         | > Why would I do that?
         | 
         | If you pickle data from an untrusted source, say a web form
         | submission and then later unpickle it. See
         | https://cwe.mitre.org/data/definitions/502.html
        
           | ademarre wrote:
           | _> If you pickle data from an untrusted source . . . and then
           | later unpickle it_
           | 
           | That is not exactly right. The risk is when you unpickle data
           | that was pickled by someone else or that was tampered with
           | after you pickled it.
        
             | cratermoon wrote:
             | Look closer at the CWE and the linked examples: An attacker
             | can construct a illegitimate, serialized object, like an
             | auth token or sessionID that instantiates one of Python's
             | subprocesses to execute arbitrary commands
        
         | TremendousJudge wrote:
         | There's also the much faster cPickle. It may just be fast
         | enough for your needs. If it isn't, then you start exploring
         | other options.
        
           | kzrdude wrote:
           | the regular pickle module uses "cPickle" transparently. It
           | should not be worth mentioning since Python 3.x.
           | 
           | The article is 8 years old, so it kind of misses this detail.
        
           | IshKebab wrote:
           | That was included in the benchmarks.
        
         | IshKebab wrote:
         | > But "slow" means "plenty fast" nowadays
         | 
         | Not in my experience. "Slow" means "it seems fast enough now
         | and I'm sure we'll have time to rewrite it in a fast language
         | once it's grown to a monster that processes 1000 times the data
         | it does now... right?".
         | 
         | > Why would I do that?
         | 
         | Because you are using someone else's code and make the fairly
         | reasonable assumption that deserialising data doesn't cause
         | arbitrary code execution... But of course it's all your fault
         | because you didn't read their code to see that it's using
         | Pickle!
         | 
         | > How do I effortlessly restore objects including their methods
         | from JSON?
         | 
         | You don't. You shouldn't.
        
         | vore wrote:
         | One thing that's not mentioned is that pickled data is
         | effectively fossilized once you've pickled it. If you want to
         | change the layout of a class and have objects unpickle
         | correctly, it can be an ordeal, as objects are unpickled by
         | their class name, and you need both the original class and the
         | new class to correctly unpickle and migrate.
         | 
         | If you instead selectively pick what you want to serialize
         | about your data and keep the representations separate, you can
         | change the internal model easily without having a huge impact
         | on the serialized model.
        
       | jessikat wrote:
       | JSON really is a terrible serialization format. Even JavaScript
       | can't safely deserialize JSON without silent data corruption.
       | I've had to stringify numbers because of JavaScript, and there
       | were no errors. Perhaps that's the fault of JavaScript, but I
       | find the lack of encoding the numerical storage type to be a bug
       | rather than a feature.
        
         | windows_sucks wrote:
         | would love to see an example of the data corruption you're
         | talking about
        
       | solarkraft wrote:
       | (2014)
        
       | ohiovr wrote:
       | I found unpickling a lot slower than json loading.
        
         | nomel wrote:
         | I've found it to be much faster, with large amounts of data,
         | like numpy arrays. And, some things aren't possible to convert
         | to JSON, without writing a bunch of code to do the
         | serialization/deserialization, which often makes things slow
         | again.
        
         | LtWorf wrote:
         | But then you have to check that the "list" is really a list,
         | that the objects do have the keys, that the strings are
         | strings.
         | 
         | This should be factored in the cost, and it wasn't in the
         | benchmark.
        
       | cratermoon wrote:
       | Much the same can be leveled against Java's serialized objects.
       | The OWASP top 10 from 2017 even had "Insecure Deserialization" at
       | #8. The 2021 update[1] changes it to "Software and Data Integrity
       | Failures", still at #8. It's CWE-502: Deserialization of
       | Untrusted Data[2], where Python and Java are specifically
       | mentioned.
       | 
       | 1 https://owasp.org/www-project-top-ten/
       | 
       | 2 https://cwe.mitre.org/data/definitions/502.html
        
       | jleahy wrote:
       | Should be (2014).
       | 
       | More interestingly, as much as numpy and everybody advises
       | against it, I believe that pickling data into a zstd stream is
       | one of the fastest ways of storing sets of large matrices.
       | 
       | The 'recommended' alternatives include numpy.save (uncompressed,
       | which is bad when lz4 is faster than memcpy and you're saving to
       | disk), numpy.savez (uncompressed zip files, even worse),
       | numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds
       | worst formats and also using zlib), etc. I wish it wasn't the
       | case, but it certainly seems like a good argument for pickle.
        
         | a-dub wrote:
         | even though all the metadata is weird and overengineered, i
         | would probably still use hdf5 as it provides for interop with
         | other numerical computing environments (matlab, julia).
         | 
         | also hdf5 is at least securable. pickle streams are not
         | designed for that. it's good to be able to send your data to
         | others.
         | 
         | fwiw. matlab .mat files are hdf5 at their core.
         | 
         | i should also note that json is pretty bad for numerical data.
         | the specification says nothing about how much precision to
         | retain and printf/scanf is ridiculously slow for storing
         | floats.
        
           | jleahy wrote:
           | hdf5 is extremely slow however, pickle+zstd is faster and
           | results in smaller files.
        
             | welterde wrote:
             | I kind of feel like you are doing something wrong with
             | HDF5, since for my use cases it's the fastest solution by
             | far.
        
             | a-dub wrote:
             | hdf5+zstd would likely be comparable.
             | 
             | good luck loading those pickle files 5y from now.
        
               | jleahy wrote:
               | hdf5+zstd is not a thing (or at least not a thing that's
               | interoperable or usable 5y from now). I just wish there
               | was a good off-the-shelf solution, this stuff is not
               | difficult.
        
       | northisup wrote:
       | Who is out there using pickle because they think it is a good
       | idea? We use it because it is easy and builtin to the language
       | and handles datetime by default!
        
       | 0xbadcafebee wrote:
       | I agree with using JSON for most things, but YAML is another data
       | serialization format with a lot more features
       | (https://yaml.org/spec/1.2.2/). It too is a security risk, but
       | you can use a 'safe' version of it. If you use it, use the great
       | ruamel.yaml library. (No idea how slow it is, but probably slow)
        
       | meatmanek wrote:
       | They forgot another major problem: You can only reliably unpickle
       | data using the same (or same-enough) code that pickled it. If
       | your class definitions have changed or moved around, unpickling
       | can break.
        
         | kangalioo wrote:
         | Reminds me of C's dumping structs to disk via memcpy
        
       | AussieWog93 wrote:
       | From the conclusion of the article: >Pickle on the other hand is
       | slow, insecure, and can be only parsed in Python. The only real
       | advantage to pickle is that it can serialize arbitrary Python
       | objects
       | 
       | ie, a bunch of drawbacks that don't really matter at all for the
       | average home-made Python script, plus the "minor" advantage of
       | being able to pickle literally anything and have it "just work".
       | 
       | None of the other options out there let you build a foolproof
       | "save button" in 3 lines of code.
        
       | ridiculous_fish wrote:
       | What are some alternatives to pickling which can handle cyclic
       | references?
       | 
       | I've looked into ORMs but these are invasive in terms of needing
       | to annotate your classes and fields.
        
       ___________________________________________________________________
       (page generated 2022-08-11 23:00 UTC)