[HN Gopher] Analyzing multi-gigabyte JSON files locally ___________________________________________________________________ Analyzing multi-gigabyte JSON files locally Author : bubblehack3r Score : 99 points Date : 2023-03-18 15:41 UTC (7 hours ago) (HTM) web link (thenybble.de) (TXT) w3m dump (thenybble.de) | jahewson wrote: | I had to parse a database backup from Firebase, which was, | remarkably, a 300GB JSON file. The database is a tree rooted at a | single object, which means that any tool that attempts to stream | individual objects always wanted to buffer this single 300GB root | object. It wasn't enough to strip off the root either, as the | really big records were arrays a couple of levels down, with a | few different formats depending on the schema. For added fun our | data included some JSON serialised inside strings too. | | This was a few years ago and I threw every tool and language I | could at it, but they were either far too slow or buffered | records larger than memory, even the fancy C++ SIMD parsers did | this. I eventually got something working in Go and it was | impressively fast and ran on my MacBook, but we never ended up | using it as another engineer just wrote a script that read the | entire database from the Firebase API record-by-record throttled | over several days, lol. | simonw wrote: | I've used ijson in Python for this kind of thing in the past, | it's pretty effective: https://pypi.org/project/ijson/ | kosherhurricane wrote: | What I would have done is first create a map of the file, just | the keys and shapes, without the data. That way I can traverse | the file. And then mmap the file to traverse and read the data. | | A couple of dozen lines of code would do it. | zerop wrote: | Other day i discovered duckdb on HN which allows firing SQL on | JSON. But i am not sure if that can take this much volume of | data. | version_five wrote: | For a hacky solution, I've often just used grep, tr, awk, etc. If | it's a well structured file and all the records are the same or | similar enough, it's often possible to grep your way into getting | the thing you want on each line, and then use awk or sed to parse | out the data. Obviously lots of ways this can break down, buy 9GB | is nothing if you can make it work with these tools. I have found | jq much slower. | philwelch wrote: | Yeah, if the JSON is relatively flat, converting to TSV makes | the data fairly trivial to consume using awk and other classic | command line tools. I did a lot of this when a past employer | decided they couldn't afford Splunk. | jeffbee wrote: | One thing that will greatly help with `jq` is rebuilding it so it | suits your machine. The package of jq that comes with Debian or | Ubuntu Linux is garbage that targets k8-generic (on the x86_64 | variant), is built with debug assertions, and uses the GNU system | allocator which is the worst allocator on the market. Rebuilding | it targeting your platform, without assertions, and with tcmalloc | makes it twice as fast in many cases. | | On this 988MB dataset I happen to have at hand, compare Ubuntu jq | with my local build, with hot caches on an Intel Core i5-1240P. | time parallel -n 100 /usr/bin/jq -rf ../program.jq ::: * -> | 1.843s time parallel -n 100 ~/bin/jq -rf ../program.jq ::: | * -> 1.121s | | I know it stinks of Gentoo, but if you have any performance | requirements at all, you can help yourself by rebuilding the | relevant packages. Never use the upstream mysql, postgres, redis, | jq, ripgrep, etc etc. | jeffbee wrote: | I guess another interesting fact worth mentioning here is the | "efficiency" cores on a modern Intel CPU are every bit as good | as the performance cores for this purpose. The 8C/8T Atom side | of the i5-1240P has the same throughput as the 4C/8T Core side | for this workload. I get 1.79s using CPUs 0-7 and 1.82s on CPUs | 8-15. | berkle4455 wrote: | Just use clickhouse-local or duckdb. Handling data measured in | terabytes is easy. | Animats wrote: | Rust's serde-json will iterate over a file of JSON without | difficulty, and will write one from an iterative process without | building it all in memory. I routinely create and read multi- | gigabyte JSON files. They're debug dumps of the the scene my | metaverse viewer is looking at. | | Streaming from large files was routine for XML, but for some | reason, JSON users don't seem to work with streams much. | ginko wrote: | This is something I did recently. We have this binary format we | use for content traces. You can dump it to JSON, but that turns a | ~10GB into a ~100GB file. I needed to check some aspects of this | with Python, so I used ijson[1] to parse the JSON without having | to keep it in memory. | | The nice thing is that our dumping tool can also output JSON to | STDOUT so you don't even need to dump the JSON representation to | the hard disk. Just open the tool in a subprocess and pipe the | output to the ijson parser. Pretty handy. | | [1] https://pypi.org/project/ijson/ | maCDzP wrote: | I like SQLite and JSON columns. I wonder how fast it would be if | you save the whole JSON file in one record and then query SQLite. | I bet it's fast. | | You could probably use that one record to then build tables in | SQLite that you can query. | funstuff007 wrote: | Anyone who's generating multi-GB JSON files on purpose has some | explaining to do. | ghshephard wrote: | Logs. jsonl is a popular streaming format. | funstuff007 wrote: | I guess, but you can grep JSONL just like you can a regular | log file. As such, you don't need any sophisticated tools as | discussed in this article. | | > 2. Each Line is a Valid JSON Value | | > 3. Line Separator is '\n' | | https://jsonlines.org/ | ghshephard wrote: | Yes - 100% I spend _hours_ a day blasting through line json | and I _always_ pre-filter with egrep, and only move to | things like jq with the hopefully (dramatically) reduced | log size. | | Also - with linejson - you can just grab the first 10,000 | or so lines and tweak your query with that before throwing | it against the full log structure as well. | | With that said - this entire thread has been gold - lots of | useful strategies for working with large json files. | 19h wrote: | To analyze and process the pushshift Reddit comment & submission | archives we used Rust with simd-json and currently get to around | 1 - 2GB/s (that's including the decompression of the zstd | stream). Still takes a load of time when the decompressed files | are 300GB+. | | Weirdly enough we ended up networking a bunch of Apple silicon | MacBooks together as the Ryzen 32C servers didn't even closely | match its performance :/ | cube2222 wrote: | OctoSQL[0] or DuckDB[1] will most likely be much simpler, while | going through 10 GB of JSON in a couple seconds at most. | | Disclaimer: author of OctoSQL | | [0]: https://github.com/cube2222/octosql | | [1]: https://duckdb.org/ | hamilyon2 wrote: | Clickhouse is the best way to analyze 10GB sized json by far. | | Latest bunch of features add near-native json support. Coupled | with ability to add extracted columns make the whole process | easy. It is fast, you can use familiar SQL syntax, not | constrainted to RAM limits. | | It is a bit hard if you want to iteratively process file line-by | line or use advanced SQL. And you have one-time cost of writing | schema. Apart from that, I can't think of any downsides. | | Edit: clarify a bit | fdajojiocsjo wrote: | [dead] | tylerhannan wrote: | There was an interesting article on this recently... | | https://news.ycombinator.com/item?id=31004563 | | It prompted quite some conversation and discussion and, in the | end, an updated benchmark across a variety of tools | https://colab.research.google.com/github/dcmoura/spyql/blob/... | conveniently right in the 10GB dataset size. | mastax wrote: | Dask looks really cool, I hope I remember it exists next time I | need it. | | I've been pretty baffled, and disappointed, by how bad Python is | at parallel processing. Yeah, yeah, I know: The GIL. But so much | time and effort has been spent engineering around every other | flaw in Python and yet this part is still so bad. I've tried | every "easy to use" parallelism library that gets recommended and | none of them has satisfied. Always: "couldn't pickle this | function" or spawning loads of processes that use up all my RAM | for no visible reason but don't use any CPU or make any | indication of progress. I'm sure I'm missing something, I'm not a | Python guy. But every other language I've used has an easy to use | stateless parallel map that hasn't given me any trouble. | isoprophlex wrote: | I've been seeing python at least once every week for a looooong | time. Years. A decade maybe. You are not missing anything. It's | a big steamy pile of horse manure. | isoprophlex wrote: | Nice writeup, but is jq & GNU parallel or a notebook full of | python spaghetti the best (least complex) tool for the job? | | DuckDB might be nice here, too. See | https://duckdb.org/2023/03/03/json.html | samwillis wrote: | DuckDB is awesome. As a comparison, I have a dataset that | starts life as a 35gb set of json files. Imported into Postgres | it's ~6gb, and a key query I run takes 3 min 33 seconds. | | Imported into DuckDB (still about ~6gb for all columns), the | same SQL query takes 1.1 second! | | The key thing is that the columns (for all rows) the query | scans total only about 100mb, so DuckDB has a lot less to scan. | But on top of that it's vectorised query execution is | incredibly quick. | | https://mobile.twitter.com/samwillis/status/1633213350002798... | pletnes wrote: | I found that exporting big tables as a bunch of parquet files | is faster and uses less memory than duckdb's internal format. | pletnes wrote: | Duckdb is fantastic. Doesn't need a schema, either. | zeitlupe wrote: | Spark is my favorite tool to deal with jsons. It can read as many | jsons - in any format located in any even nested folder structure | - as you want, offers parallelization, and is great to flatten | structs. I've never run into memory issues (or never ran out of | workarounds) so far. | pidge wrote: | Yeah, given that everything is now multi-core, it makes sense | to use a natively parallel tool for anything compute-bound. And | Spark will happily run locally and (unlike previous big data | paradigms) doesn't require excessive mental contortions. | | Of course while you're at it, you should probably just convert | all your JSON into Parquet to speed up successive queries... | cpuguy83 wrote: | Jq does support slurp mode so you should be able to do this using | that... granted I've never attempted this and the syntax is very | different. | | --- edit --- | | I used the wrong term, the correct term is streaming mode. | Groxx wrote: | It does work, but it is a _huge_ headache to use, in part | because the documentation around it is nowhere near enough to | understand how to use it. | | If I used it regularly I'd probably develop a feel for it and | be much faster - it is _reasonable_ , just abnormal and | extremely low level, and much harder to use with other jq | stuff. But I almost always start looking for alternatives well | before I reach that point. | hprotagonist wrote: | i would seriously consider sqlite-utils here. | | https://sqlite-utils.datasette.io/en/stable/cli.html | qbasic_forever wrote: | Was going to post the same thing, I suspect converting the | dataset to a SQLite db would be infinitely more fast and | productive than pecking away at it with pandas and such. | sgarland wrote: | At this size, I doubt it. While SQLite can read JSON if | compiled with support for it, it stores it as TEXT. The only | native indexing possible for that that I'm aware of is full- | text search, and I suspect the cardinality of JSON characters | would make that inefficient. Not to mention that the author | stated they didn't have enough memory to store the entire | file, so with a DB you'd be reading from disk. | | MySQL or Postgres with their native JSON datatypes _might_ be | faster, but you still have to load it in, and | storing/indexing it in either of those is [0] its own [1] | special nightmare full of footguns. | | Having done similar text manipulation and searches with giant | CSV files, parallel and xsv [2] is the way to go. | | [0]: https://dev.mysql.com/doc/refman/8.0/en/json.html | | [1]: https://www.postgresql.org/docs/current/datatype- | json.html | | [2]: https://github.com/BurntSushi/xsv | simonw wrote: | The trick sqlite-utils uses is to automatically create a | SQLite table schema that matches the shape of the JSON | object - specifically of you give it a JSON array of | objects. | | It doesn't handle streaming JSON out of the box though, so | you'd need to write some custom code on top of something | like ijson to avoid loading the entire JSON file into | memory first. | philwelch wrote: | SQLite is great for datasets that fit comfortably into | memory, but otherwise it starts to struggle. | simonw wrote: | That's not accurate: SQLite works great with data that's | too big to fit in memory provided you create indexes and | write queries that can take advantage of them - just like | any other relational database (MySQL/PostgreSQL etc). | fbdab103 wrote: | Come again? It is a database. The entire raison-d'etre is | for managing bigger-than-memory data. On the SQLite | forum[0], Dr. Hipp indicated a customer approached them | that they were using a ~40 TB SQLite database. Now, you | could argue it is less good at that scale, particularly | analytical workloads, but SQLite has an enormous amount of | headroom that you are unlikely to encounter. | | [0]: https://sqlite.org/forum/info/6e13e1911f1fe39c | hprotagonist wrote: | happily, i have multiple gigabytes of memory ... | philwelch wrote: | Sure, but a 40 GB SQLite database on a machine with 16 GB | of RAM is not gonna be happy | qbasic_forever wrote: | You're not going to do better with pandas or similar | tools. If it can't fit in memory, it's going to be | painful. SQLite is the least painful in my experience, | and it sets you up for working with the data in a proper | DB like postgres or similar for when you get fed up with | the memory constraints. | philwelch wrote: | I wouldn't use pandas in that situation either. | kosherhurricane wrote: | In the original article, they were trying to read a 10GB | file into 30GB RAM. | Groxx wrote: | tbh my usual strategy is to drop into a real programming language | and use whatever JSON stream parsing exists there, and dump the | contents into a half-parsed file that can be split with `split`. | Then you can use "normal" tools on one of those pieces for fast | iteration, and simply `cat * | ...` for the final slow run on all | the data. | | Go is quite good for this, as it's extremely permissive about | errors and structure, has very good performance, and comes with a | streaming parser in the standard library. It's pretty easy to be | finished after only a couple minutes, and you'll be bottlenecked | on I/O unless you did something truly horrific. | | And when jq isn't enough because you need to do joins or | something, shove it into SQLite. Add an index or three. It'll | massively outperform almost anything else unless you need rich | text content searches (and even then, a fulltext index might be | just as good), and it's plenty happy with a terabyte of data. | rvanlaar wrote: | Recently had 28GB json of IOT data with no guarantees on the data | structure inside. | | Used simdjson [1] together with python bindings [2]. Achieved | massive speedups for analyzing the data. Before it was in the | order of minutes, then it became fast enough to not leave my | desk. Reading from disk became the bottleneck, not cpu power and | memory. | | [1] https://github.com/simdjson/simdjson [2] | https://pysimdjson.tkte.ch/ | jmmv wrote: | Some random comments: | | * A few GBs of data isn't really that much. Even /considering/ | the use of cloud services just for this sounds crazy to me... but | I'm sure there are people out there that believe it's the only | way to do this (not the author, fortunately). | | * "You might find out that the data doesn't fit into RAM (which | it well might, JSON is a human-readable format after all)" -- if | I'm reading this right, the author is saying that the parsed data | takes _more_ space than the JSON version? JSON is a text format | and interning it into proper data structures is likely going to | take _less_ space, not more. | | * "When you're ~trial-and-error~iteratively building jq commands | as I do, you'll quickly grow tired of having to wait about a | minute for your command to succeed" -- well, change your workflow | then. When tackling new queries, it's usually a good idea to | reduce the data set. Operate on a few records until you have the | right query so that you can iterate as fast as possible. Only | once you are confident with the query, run it on the full data. | | * Importing the data into a SQLite database may be better overall | for exploration. Again, JSON is slow to operate on because it's | text. Pay the cost of parsing only once. | | * Or write a custom little program that streams data from the | JSON file without buffering it all in memory. JSON parsing | libraries are plentiful so this should not take a lot of code in | your favorite language. | vlovich123 wrote: | > JSON is a text format and interning it into proper data | structures is likely going to take _less_ space, not more. | | ~~Yeah but parsing it can require ~2x the RAM available and | push you into swap / make it not possible.~~ | | > Or write a custom little program that streams data from the | JSON file without buffering it all in memory. JSON parsing | libraries are plentiful so this should not take a lot of code | in your favorite language. | | What is the state of SAX JSON parsing? I used yajl a long time | ago but not sure if that's still the state of the art (and it's | C interface was not the easiest to work with). | | EDIT: Actually, I think the reason is that you typically will | have pointers (8 bytes) in place of 2 byte demarcations in the | text version (eg "", {}, [] become pointers). It's very hard to | avoid that (maybe impossible? Not sure) and no surprise that | Python has a problem with this. | saidinesh5 wrote: | There's libraries like ijson. At my last job, i wrote a | wrapper around this C++ library called rapidjson to parse | streaming json. It worked quite well. | | Here's the write up about it: | https://dinesh.cloud/2022/streaming-json-for-fun-and-profit/ | And here's the code: https://github.com/multiversal- | ventures/json-buffet | | The API isn't the best. I'd have preferred an iterator based | solution as opposed to this callback based one. But we worked | with what rapidjson gave us for the proof of concept. The | reason for this specific implementation was we wanted to | build an index to query the server directly about it's huge | json files (compressed size of 20+GB per file) using http | range queries. | bastawhiz wrote: | > JSON is a text format and interning it into proper data | structures is likely going to take _less_ space, not more. | | If you're parsing to structs, yes. Otherwise, no. Each object | key is going to be a short string, which is going to have some | amount of overhead. You're probably storing the objects as hash | tables, which will necessarily be larger than the two bytes | needed to represent them as text (and probably far more than | you expect, so they have enough free space for there to be | sufficiently few hash collisions). | | JSON numbers are also 64-bit floats, which will almost | universally take up more bytes per number than their serialized | format for most JSON data. | vlovich123 wrote: | I think even structs have this problem because typically you | heap allocate all the structs/arrays. You could try to arena | allocate contiguous objects in place, but that sounds hard | enough that I doubt that anyone bothers. Using a SAX parser | is almost certainly the tool you want to use. | matja wrote: | > JSON numbers are also 64-bit floats | | In common implementations they are, but RFC 8259 and ECMA-404 | do not specify the range, precision, or underlying | implementation for the storage of numbers in JSON. | | A implementation that guarantees interoperability between all | implementations of JSON would use an arbitrary-sized number | format, but they seldom do. | | No idea what ISO/IEC 21778:2017 says because it's not free. | bastawhiz wrote: | I mean, I suppose you could use a numeric format that takes | more than 64 bits, but you certainly couldn't use one that | uses fewer. You have no way to know whether a number | _should be_ an int or a float. And since JavaScript (the JS | in JSON) encodes with 64 bits, you 'd be losing precision | if you choose to use any less. The point is that an integer | written as one byte in the JSON _will_ take up more space | (at least seven bytes more) when decoded, all the way up to | numbers that take seven characters. | gpderetta wrote: | (subsets of )ASN.1, LEB128, and of course UTF-8 are all | examples of encoding binary integers with variable | length. | bastawhiz wrote: | Sure, and JSON is as well. 0 takes three fewer bytes to | encode than 1234. But if I parse JSON and my numbers come | back as a type backed by UTF-8, I'm going to be left | confused and angry. Nobody expects their JSON parser to | give them back numeric values that their CPU can't | natively perform operations on. | zigzag312 wrote: | > if I'm reading this right, the author is saying that the | parsed data takes _more_ space than the JSON version? JSON is a | text format and interning it into proper data structures is | likely going to take _less_ space, not more. | | Utf8 json strings will get converted to utf16 strings in some | languages, doubling the size of strings in memory compared to | the size on disk. | taeric wrote: | It still surprises me how many have the intuition that loading | the data will take more space than the file. | | Even more annoying when it is. (Compressed or binary formats | not withstanding.) | vlovich123 wrote: | Isn't that almost always true unless the format is | specifically designed to take the same amount on disk and in | RAM? If that's the case I think typically it means the file | on disk is larger than it needs to be. That's because you | have to convert the file to some kind of in-memory | representation that's easier to work with to manipulate it | and what not which requires structure that isn't necessary at | rest. | | Can you give an example where the file on disk is going to be | larger than what it is in memory? Provided you're not just | reading it and working with it as an opaque binary blob. | hnlmorg wrote: | XML would probably fall into that category. | | But I do agree with your point more generally speaking. | ot wrote: | > Or write a custom little program that streams data from the | JSON file without buffering it all in memory. JSON parsing | libraries are plentiful so this should not take a lot of code | in your favorite language. | | Several years ago I wrote a paper [1] on representing the parse | tree of a JSON document in a tiny fraction of the JSON size | itself, using succinct data structures. The representation | could be built with a single pass of the JSON, and basically | constant additional memory. | | The idea was to pre-process the JSON and then save the parse | tree, so it could be kept in memory over several passes of the | JSON data (which may not fit in memory), avoiding to re-do the | parsing work on each pass. | | I don't think I've seen this idea used anywhere, but I still | wonder if it could have applications :) | | [1] | http://groups.di.unipi.it/~ottavian/files/semi_index_cikm.pd... | saidinesh5 wrote: | > * "You might find out that the data doesn't fit into RAM | (which it well might, JSON is a human-readable format after | all)" -- if I'm reading this right, the author is saying that | the parsed data takes _more_ space than the JSON version? JSON | is a text format and interning it into proper data structures | is likely going to take _less_ space, not more. | | Not to mention, even when using bad data structures (eg. | hashmap of hashmaps..), One can just add a large enough | swapfile and brute force their way through it no? | [deleted] | the_duke wrote: | > JSON is a text format and interning it into proper data | structures is likely going to take _less_ space, not more. | | That depends a lot on the language and the json library. | | Lets take `{"ab": 22}` as an example. | | That's 10 bytes. | | In a language like Rust and using the serde library, this could | be deserialized directly into a struct with one integer, let's | pick a u32. So that would only be four bytes. | | But if it was deserialized to serdes dynamic Value type, this | would be : a HashMap<String, u32>, which has a constant size of | 48 bytes, plus an allocation of I don't know how much (first | allocation will cover more than one entry), plus 16 bytes | overhead for the string, plus 2 bytes for the actual string | contents, plus the 4 bytes for the u32. So that's already over | ~90 bytes, a lot more than the JSON. | | Dynamic languages like Python also have a lot of overhead for | all the objects. | | Keys can of course be interned, but not that many default JSON | parser libraries do that afaik. | btown wrote: | I feel like many of the same principles from V8 (the engine | for Chrome/Node.js) hidden classes would apply here. It's not | just that you're interning keys; you're actually creating a | set of optimized data classes that are transitioned to when | you encounter a key that wasn't seen before. In such a mode, | you're not even storing the keys, just a pointer to the type | information tree. Only when you have a sub-object whose keys | are truly unpredictable do you go into "dictionary mode (aka | 'slow mode')" - which I think says it all! | | https://v8.dev/docs/hidden-classes | DonHopkins wrote: | The key names of all the identically shaped objects are | repeated and can all be interned into sharing the same | strings, saving a huge amount of memory. That's why csv files | can be much more compact that the equivalent json files: they | only list the keys once in the first row. | mindslight wrote: | "ab" is 4 bytes, while a 64 bit pointer to an interned | string is 8 bytes. It would seem that the savings of CSV | would be better analogized to static typing - the type is | defined once, then each record only contains data. | | I had the same intuition as the original comment. But no, | the relative sizes of data formats aren't that | straightforward. One could intern symbols to say 16 bit | values, or one could infer structural types and compress | the data that way. But those are both creating additional | assumptions and processing that likely aren't done by | commonly available tools. | rootw0rm wrote: | i'm using bincode to serialize btrees and the files are | smaller than i expected...it's pretty cool. must be storing | my u32s as u16s or u8s sometimes. | pkaye wrote: | How about something like a Go interface which is essentially | two pointers and the data itself. | Groxx wrote: | That's the same thing they're describing, and equivalent to | what Go does for map[string]int64 (plus or minus minor | details about how strings are optimized). | | The map has its two slices (pointer + length + capacity = | 8*3 times two slices) and the string needs a separate | allocation somewhere because it's essentially a pointer to | a slice of bytes. All of which is true for almost all | reasonably efficiency-focused languages, Rust and Go | included - it's just how you make a compact hashmap. | Yoric wrote: | Does Go perform interning of strings? | Groxx wrote: | Compile-time strings, I think so? In that all identical | strings simply point to the same data. Otherwise no. | closeparen wrote: | I would read "proper data structure" charitably, as a struct | or record type, maybe a tuple. And if it must be a hashmap, | then at least using symbols and not strings for keys. | Nihilartikel wrote: | If you're doing interactive analysis, converting the json to | parquet is a great first step.. After that duckdb or spark are a | good way to go. I only fall back to spark if some aggregations | are too big to fit in RAM. Spark spills to disk and subdivides | the physical plans better in my experience.. | lmeyerov wrote: | Yep! | | We do the switch to parquet, and then as they say, use dask so | we can stick with python for interesting bits as SQL is | relatively anti-productive there | | Interestingly, most of the dask can actually be dask_cudf and | cudf nowadays: dask/pandas on a GPU, so can stay in the same | computer, no need for distributed, even if TBs etc of json | pradeepchhetri wrote: | Well if you need to convert json to parquet to do anything | fast, then what is the meaning ? You will end up wasting way | more resource in that conversion itself that your benefit is | all equalized in the cost of extra storage utilization (since | now you have json and parquet files both). The whole point is | to do fast operations in json itself. Try out | clickhouse/clickhouse-local. | closeparen wrote: | If you're doing interactive analysis, generally you're going | to have multiple queries, so it can be worthwhile to pay the | conversion cost once upfront. You don't necessarily retain | the JSON form, or at least not for as long. | [deleted] | thakoppno wrote: | Would sampling the JSON down to 20MB and running jq | experimentally until one has found an adequate solution be a | decent alternative approach? | | It depends on the dataset one supposes. | epalm wrote: | Yeah, I do this when querying sql databases. I limit the data | to some small/local range, iteratively work on the query, and | when I'm happy with the local results, I remove the filter and | get the big results. | ddulaney wrote: | I really like using line-delimited JSON [0] for stuff like this. | If you're looking at a multi-GB JSON file, it's often made of a | large number of individual objects (e.g. semi-structured JSON log | data or transaction records). | | If you can get to a point where each line is a reasonably-sized | JSON file, a lot of things gets way easier. jq will be streaming | by default. You can use traditional Unixy tools (grep, sed, etc.) | in the normal way because it's just lines of text. And you can | jump to any point in the file, skip forward to the next line | boundary, and know that you're not in the middle of a record. | | The company I work for added line-delimited JSON output to lots | of our internal tools, and working with anything else feels | painful now. It scales up really well -- I've been able to do | things like process full days of OPRA reporting data in a bash | script. | | [0]: https://jsonlines.org/ ___________________________________________________________________ (page generated 2023-03-18 23:00 UTC)