[HN Gopher] A SQLite extension for reading large files line-by-line ___________________________________________________________________ A SQLite extension for reading large files line-by-line Author : polyrand Score : 104 points Date : 2022-07-30 16:45 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | cube2222 wrote: | Hey! | | OctoSQL[0] author here, this is really impressive! I like this | much more than the approach taken by other sqlite-based tools | which first load stuff into SQLite and then let you query it. | | On the other hand, it does have a cons that it doesn't | automatically infer the schema of the input JSON and you still | have to manually parse the raw lines. Maybe it would be possible | to surmount this by exposing a json-dedicated file reading | function which also does SQL inference (I'm not knowledgable | about SQLite internals)? | | One piece of feedback is with regard to the benchmarks: I think | it would be worth it to add additional benchmarks which work on | slightly more complex datasets than the one used here. I did a | comparison of this vs OctoSQL on the Brazil dataset, and - as | expected - sqlite-lines wiped the floor with it. However, then I | ran the following queries on a slightly more complex dataset (the | Amazon review dataset in this case, from SPyQL's benchmark | notebook[1]): ~> time OCTOSQL_NO_TELEMETRY=1 | ./octosql "SELECT COUNT(*), AVG(overall) | FROM books.json | WHERE reviewerName = 'James'" +-------+-------------------+ | | count | avg_overall | +-------+-------------------+ | | 3010 | 4.402325581395349 | +-------+-------------------+ | real 0m49.805s user 0m32.169s sys 0m9.163s | ~> time ./lines0-linux-amd64-sqlite3 :memory: "SELECT COUNT(*), | AVG(json_extract(line, '$.overall')) | FROM lines_read('books.json') | WHERE json_extract(line, '$.reviewerName') = 'James'" | 3010|4.40232558139535 real 1m47.933s user 1m27.024s | sys 0m11.559s | | and as you can see, the results go in a very different direction. | | But anyhow, congrats on the project, and I'm pumped to see what | you come up with next! | | [0]: https://github.com/cube2222/octosql | | [1]: | https://github.com/dcmoura/spyql/blob/master/notebooks/json_... | alexgarcia-xyz wrote: | Hey thanks for sharing! | | Re infering schema of input JSON: That would be slick! Though | SQLite does have some limitations here with table-valued | functions vs virtual table. I won't go into the specifics, but | something like this isn't possible in SQLite: | select name, age from lines_json_read('students.json') | | The "name" and "age" dynamic columns aren't possible when using | the "table function" syntax, but something like this is | possible using traditional "virtual table" syntax: | create virtual table students using | lines_json_read(filename="students.json"); select | name, age from students; | | It's a small difference, but definitely possible! Though | parsing JSON in C is tricky, but would definitely accept | contributions that figure it out. | | And re benchmarks - thanks for sharing! Yeah, they're pretty | basic, so would love to add more complex ones. With the | books.json example, I think what's happening is that in | SQLite's JSON function, it parses the JSON each time in each | json_extract function - so it parses twice for each row in that | query. I also suspect that the long strings in "reviewText" | might slow down, but can't be sure. Once I get some free time | I'll add OctoSQL to the benchmark suite and this new books | dataset | tstack wrote: | Another option would be to create the virtual table used for | the table-valued function using a schema to define the | columns: CREATE VIRTUAL TABLE | students_json_read USING lines_json_read(schema="students- | schema.json); | | Then, that one table could be used for multiple files: | SELECT * FROM students_json_read('school1-students.json'); | SELECT * FROM students_json_read('school2-students.json'); | simonw wrote: | I love this example from the docs: select | line -> '$.id' as id, line -> '$.name' as name | from lines_read("my-file.ndjson"); | | This is using the new -> JSON operator which was added to SQLite | a few months ago. | | The lines_read() thing there is a table-valued function, which | means it returns a virtual table that you can query. This is a | streaming operation which means it's safe to run it against a | 100GB+ file without worrying about it sucking all of the data | into memory at once. | | Where this gets really useful is when you combine it with a | create table statement: create table names as | select line -> '$.id' as id, line -> | '$.name' as name from lines_read("my-file.ndjson"); | | This should efficiently create a table with that exact subset of | the data pulled from the newline-delimited JSON file. | jokoon wrote: | I've read that spatialite is not well maintained, is that true? | bicijay wrote: | Thats interesting, is there any way to add "custom delimiter | templates" (or alternative tools)? I have some files where each | line contains different kind of informations, but those are | separated by their index on the row, so for example, customer | name is the letter index 3 to letter index 10. | alexgarcia-xyz wrote: | Hmm, so you have lines that have fields in fixed indexed | positions? Maybe something line this: $ cat | hn.txt 11alex a 12brian b 12craig c | $ cat test.sql .header on .mode box .load | dist/lines0 select substr(line, 1, 2) as id, | substr(line, 3, 7) as name, substr(line, 11, 2) as age | from lines_read('hn.txt') $ sqlite3x :memory: '.read | test.sql' +----+---------+-----+ | id | name | | age | +----+---------+-----+ | 11 | alex | a | | | 12 | brian | b | | 12 | craig | c | | +----+---------+-----+ | | Unfortunately SQLite's substr function is a bit awkward, but if | each field has constant indicies it may work | wodenokoto wrote: | These I believe are what is called fixed width files. | | If you use pandas, there is a reader for those `pd.read_fwf()` | | I'm sure there are many more languages that support this, it's | just the only one where I had a need. | bicijay wrote: | Exactly that, all banks here in Brazil are exporting | transactions with this kind of format (Its called CNAB file | over here and they are probably using COBOL systems). You | have different kinds of CNAB files, CNAB 200 where you have | lines with 200 characters, CNAB 400... | | Inside those files you have multiple chunks and every line | contains information about a transaction. So if you wanna get | the transaction amount you would look for all characters | between character n10 and n20 for example. (Those positions | are fixed and follow the bank template). | | We had success parsing it with python/node, but i think | spinning up a sqllite instance, reading straight from the | file and doing some calculations would be much faster. | qbasic_forever wrote: | Very cool looking! I love using SQL for ad-hoc data analysis, my | brain just groks the language so much more easily than other data | query languages like python + pandas. | randyrand wrote: | nice simple & reliable approach to a hard perf problem. | freecodyx wrote: | I like the fact that author is extensively documenting the source | code, following the sqlite spirit | alexgarcia-xyz wrote: | Hey, author here, happy to answer any questions! Also checkout | this notebook for a deeper dive into sqlite-lines, along with a | slick WASM demonstration and more thoughts on the codebase itself | https://observablehq.com/@asg017/introducing-sqlite-lines | | I really dig SQLite, and I believe SQLite extensions will push it | to another level. I rarely reach for Pandas or other | "traditional" tools and query languages, and instead opt for | plain ol' SQLite and other extensions. As a shameless plug, I | recently started a blog series on SQLite and related tools and | extensions if you want to learn more! Next week I'll be | publishing more SQLite extensions for parsing HTML + making HTTP | requests https://observablehq.com/@asg017/a-new-sqlite-blog- | series | | A few other SQLite extensions: | | - xlite, for reading Excel files, in Rust | https://github.com/x2bool/xlite | | - sqlean, several small SQLite extensions in C | https://github.com/nalgeon/sqlean | | - mergestat, several SQLite extensions for developers (mainly | Github's API) in Go https://github.com/mergestat/mergestat | simonw wrote: | I'm so interested to hear about the HTTP requests stuff. | | I've played a tiny bit with SQLite Python extensions that | perform HTTP requests, but I'm very unsure of the implications | of this kind of thing. | | What happens to the rest of the SQLite query engine if your | custom function hits a page like https://httpbin.org/delay/10 | with a deliberate 10 second delay before the request returns? | | Maybe this is fine - your transaction takes 10 seconds longer | but everything else continues to work as normal. I've not done | the work to figure that out. | alexgarcia-xyz wrote: | Yup, if it hits https://httpbin.org/delay/10 then the entire | script will delay for 10 seconds (unless you have a custom | lower timeout setting). Which means you can only make HTTP | requests in serial requests, which sucks, but much simpler | than coordinating with other tools | | It'll come out in a week, but lmk if you want to play with it | beforehand! | simonw wrote: | How will it affect other SQLite connections against the | same database? | | My guess is they'll be OK, but presumably any transactions | will lock tables for the duration? | alexgarcia-xyz wrote: | I think you're right, tho not an expert with multiple | connections. I think if the DB is in WAL mode and you're | only issuing read-only requests, then there will be no | lock and long HTTP requests won't lock anything up. But | if you issue a write-request that locks tables, then | other write-requests will be blocked with long requests. | | But if you're in journal mode then there's not much you | can do. Please correct me if I'm wrong! | simonw wrote: | How difficult do you think it would be compile SpatiaLite (a | big hairy C codebase) to WASM and load it as a dynamic module, | like you've done with sqlite-lines? | alexgarcia-xyz wrote: | This has been done by other people already! | | - https://github.com/jvail/spl.js | | - https://observablehq.com/@visionscarto/hello-spl-js | simonw wrote: | Oh wow! I wonder how hard it would be to load that module | into https://github.com/simonw/datasette-lite | | Hah, turns out I have an issue for that already, which I'd | forgotten about: https://github.com/simonw/datasette- | lite/issues/30 ___________________________________________________________________ (page generated 2022-07-30 23:00 UTC)