[HN Gopher] Don't Pickle Your Data ___________________________________________________________________ Don't Pickle Your Data Author : behnamoh Score : 46 points Date : 2022-08-11 20:03 UTC (2 hours ago) (HTM) web link (www.benfrederickson.com) (TXT) w3m dump (www.benfrederickson.com) | chaxor wrote: | Last time I checked (i.e. performed several benchmarks upon), | parquet with Zstd was about the best way to store compressed data | for really fast and small files. | | Zstd is quite good, and is now (iirc) in the linux kernel. | | People may have some issue with parquet being column based, which | can make inserts a little slower for example, but for a large | mostly-set database it is a very good choice. A tsv.zst file | could be another way to go as well. But like others, I really | with hdf5 had some of these features of compression and wasn't so | dang slow. | ris wrote: | Don't Assume Things About Others Use Cases. | | In cases where I'm doing some sort of interactive or exploratory | data analysis with structures of complex python objects and want | to stash a copy of what I'm working with in case the next thing I | do screws the up or, who knows, I lose power - being able to | quickly pickle something and have an amount of confidence I'll be | able to get it back in a sensible state is very useful. | | I've also used it for debug dumps in experimental software so I | have a chance of reproducing odd cases it comes across. | hansvm wrote: | I made a simple library for just such a purpose if you're | interested. You can wrap a whole module (like requests or | pandas) and cache every function/coroutine result to disk. | https://github.com/hmusgrave/ememo | | I mainly use it for web scraping to be polite while I figure | out the remote API, but I'm sure somebody could have another | use. | solarkraft wrote: | > Pickle is slow | | ... Python is slow. But "slow" means "plenty fast" nowadays and | the development speed advantage is immense. | | > unpickling malicious data can cause security issues | | Why would I do that? | | I can't read the linked page because it seems to be down/the link | is broken, so I don't know whether this includes user data that | is present before pickling and then turns to be an issue after | pickling. Then I would worry, otherwise ... yeah, I'm not gonna | unpickle random data. | | > Just use JSON | | How do I effortlessly restore objects including their methods | from JSON? | marcosdumay wrote: | > How do I effortlessly restore objects including their methods | from JSON? | | The recommendation from the title is usually made instead of | something like "deserializing executable data is harmful". That | is exactly the one question where the answer is "don't". | | It's not exactly the unpickling process that is the problem. | It's how you established that the data isn't malicious. It is | very hard to use pickle without creating some local privilege | escalation possibilities. And at the end of the process, you | usually don't get any capability that replicating the code on | both sides of the communication channel wouldn't give you. | | (The problem isn't specific to Python either. There was a time | when that kind of functionality was very hyped on both the | industry and academia. For example, Java also got something | similar that they had to retract. The famous Gnu-Hurd OS (the | one that would never finish) was supposed to do that on the | system level.) | xhevahir wrote: | The Mozart/Oz people came up with pickle, I think. | NotTameAntelope wrote: | Instantiate a new object of the class with the JSON as | arguments, is one way. | | I've built a bunch of these systems, keeping your data separate | solves a lot of future problems. | LtWorf wrote: | The benchmark is bad. Because after you load a json you can't | really use it. Well to use it you must check lists are lists | for real, objects are really objects and have the keys you | think they should have and so on. | | The alternative is using something like typedload (which I | wrote) or pydantic in addition to json load, to avoid | cluttering the code with the countless and error prone checks | one must do to use untrusted json. | | In the end dealing with untrusted json directly is terrible. | theamk wrote: | if you are dealing with untrusted data, pickle is not an | option at all, it lacks security. | cratermoon wrote: | >> unpickling malicious data can cause security issues | | > Why would I do that? | | If you pickle data from an untrusted source, say a web form | submission and then later unpickle it. See | https://cwe.mitre.org/data/definitions/502.html | ademarre wrote: | _> If you pickle data from an untrusted source . . . and then | later unpickle it_ | | That is not exactly right. The risk is when you unpickle data | that was pickled by someone else or that was tampered with | after you pickled it. | cratermoon wrote: | Look closer at the CWE and the linked examples: An attacker | can construct a illegitimate, serialized object, like an | auth token or sessionID that instantiates one of Python's | subprocesses to execute arbitrary commands | TremendousJudge wrote: | There's also the much faster cPickle. It may just be fast | enough for your needs. If it isn't, then you start exploring | other options. | kzrdude wrote: | the regular pickle module uses "cPickle" transparently. It | should not be worth mentioning since Python 3.x. | | The article is 8 years old, so it kind of misses this detail. | IshKebab wrote: | That was included in the benchmarks. | IshKebab wrote: | > But "slow" means "plenty fast" nowadays | | Not in my experience. "Slow" means "it seems fast enough now | and I'm sure we'll have time to rewrite it in a fast language | once it's grown to a monster that processes 1000 times the data | it does now... right?". | | > Why would I do that? | | Because you are using someone else's code and make the fairly | reasonable assumption that deserialising data doesn't cause | arbitrary code execution... But of course it's all your fault | because you didn't read their code to see that it's using | Pickle! | | > How do I effortlessly restore objects including their methods | from JSON? | | You don't. You shouldn't. | vore wrote: | One thing that's not mentioned is that pickled data is | effectively fossilized once you've pickled it. If you want to | change the layout of a class and have objects unpickle | correctly, it can be an ordeal, as objects are unpickled by | their class name, and you need both the original class and the | new class to correctly unpickle and migrate. | | If you instead selectively pick what you want to serialize | about your data and keep the representations separate, you can | change the internal model easily without having a huge impact | on the serialized model. | jessikat wrote: | JSON really is a terrible serialization format. Even JavaScript | can't safely deserialize JSON without silent data corruption. | I've had to stringify numbers because of JavaScript, and there | were no errors. Perhaps that's the fault of JavaScript, but I | find the lack of encoding the numerical storage type to be a bug | rather than a feature. | windows_sucks wrote: | would love to see an example of the data corruption you're | talking about | solarkraft wrote: | (2014) | ohiovr wrote: | I found unpickling a lot slower than json loading. | nomel wrote: | I've found it to be much faster, with large amounts of data, | like numpy arrays. And, some things aren't possible to convert | to JSON, without writing a bunch of code to do the | serialization/deserialization, which often makes things slow | again. | LtWorf wrote: | But then you have to check that the "list" is really a list, | that the objects do have the keys, that the strings are | strings. | | This should be factored in the cost, and it wasn't in the | benchmark. | cratermoon wrote: | Much the same can be leveled against Java's serialized objects. | The OWASP top 10 from 2017 even had "Insecure Deserialization" at | #8. The 2021 update[1] changes it to "Software and Data Integrity | Failures", still at #8. It's CWE-502: Deserialization of | Untrusted Data[2], where Python and Java are specifically | mentioned. | | 1 https://owasp.org/www-project-top-ten/ | | 2 https://cwe.mitre.org/data/definitions/502.html | jleahy wrote: | Should be (2014). | | More interestingly, as much as numpy and everybody advises | against it, I believe that pickling data into a zstd stream is | one of the fastest ways of storing sets of large matrices. | | The 'recommended' alternatives include numpy.save (uncompressed, | which is bad when lz4 is faster than memcpy and you're saving to | disk), numpy.savez (uncompressed zip files, even worse), | numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds | worst formats and also using zlib), etc. I wish it wasn't the | case, but it certainly seems like a good argument for pickle. | a-dub wrote: | even though all the metadata is weird and overengineered, i | would probably still use hdf5 as it provides for interop with | other numerical computing environments (matlab, julia). | | also hdf5 is at least securable. pickle streams are not | designed for that. it's good to be able to send your data to | others. | | fwiw. matlab .mat files are hdf5 at their core. | | i should also note that json is pretty bad for numerical data. | the specification says nothing about how much precision to | retain and printf/scanf is ridiculously slow for storing | floats. | jleahy wrote: | hdf5 is extremely slow however, pickle+zstd is faster and | results in smaller files. | welterde wrote: | I kind of feel like you are doing something wrong with | HDF5, since for my use cases it's the fastest solution by | far. | a-dub wrote: | hdf5+zstd would likely be comparable. | | good luck loading those pickle files 5y from now. | jleahy wrote: | hdf5+zstd is not a thing (or at least not a thing that's | interoperable or usable 5y from now). I just wish there | was a good off-the-shelf solution, this stuff is not | difficult. | northisup wrote: | Who is out there using pickle because they think it is a good | idea? We use it because it is easy and builtin to the language | and handles datetime by default! | 0xbadcafebee wrote: | I agree with using JSON for most things, but YAML is another data | serialization format with a lot more features | (https://yaml.org/spec/1.2.2/). It too is a security risk, but | you can use a 'safe' version of it. If you use it, use the great | ruamel.yaml library. (No idea how slow it is, but probably slow) | meatmanek wrote: | They forgot another major problem: You can only reliably unpickle | data using the same (or same-enough) code that pickled it. If | your class definitions have changed or moved around, unpickling | can break. | kangalioo wrote: | Reminds me of C's dumping structs to disk via memcpy | AussieWog93 wrote: | From the conclusion of the article: >Pickle on the other hand is | slow, insecure, and can be only parsed in Python. The only real | advantage to pickle is that it can serialize arbitrary Python | objects | | ie, a bunch of drawbacks that don't really matter at all for the | average home-made Python script, plus the "minor" advantage of | being able to pickle literally anything and have it "just work". | | None of the other options out there let you build a foolproof | "save button" in 3 lines of code. | ridiculous_fish wrote: | What are some alternatives to pickling which can handle cyclic | references? | | I've looked into ORMs but these are invasive in terms of needing | to annotate your classes and fields. ___________________________________________________________________ (page generated 2022-08-11 23:00 UTC)