[HN Gopher] A SQLite extension for reading large files line-by-line
       ___________________________________________________________________
        
       A SQLite extension for reading large files line-by-line
        
       Author : polyrand
       Score  : 104 points
       Date   : 2022-07-30 16:45 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | cube2222 wrote:
       | Hey!
       | 
       | OctoSQL[0] author here, this is really impressive! I like this
       | much more than the approach taken by other sqlite-based tools
       | which first load stuff into SQLite and then let you query it.
       | 
       | On the other hand, it does have a cons that it doesn't
       | automatically infer the schema of the input JSON and you still
       | have to manually parse the raw lines. Maybe it would be possible
       | to surmount this by exposing a json-dedicated file reading
       | function which also does SQL inference (I'm not knowledgable
       | about SQLite internals)?
       | 
       | One piece of feedback is with regard to the benchmarks: I think
       | it would be worth it to add additional benchmarks which work on
       | slightly more complex datasets than the one used here. I did a
       | comparison of this vs OctoSQL on the Brazil dataset, and - as
       | expected - sqlite-lines wiped the floor with it. However, then I
       | ran the following queries on a slightly more complex dataset (the
       | Amazon review dataset in this case, from SPyQL's benchmark
       | notebook[1]):                 ~> time OCTOSQL_NO_TELEMETRY=1
       | ./octosql "SELECT COUNT(*), AVG(overall)
       | FROM books.json
       | WHERE reviewerName = 'James'"       +-------+-------------------+
       | | count |    avg_overall    |       +-------+-------------------+
       | |  3010 | 4.402325581395349 |       +-------+-------------------+
       | real 0m49.805s       user 0m32.169s       sys 0m9.163s
       | ~> time ./lines0-linux-amd64-sqlite3 :memory: "SELECT COUNT(*),
       | AVG(json_extract(line, '$.overall'))
       | FROM lines_read('books.json')
       | WHERE json_extract(line, '$.reviewerName') = 'James'"
       | 3010|4.40232558139535       real 1m47.933s       user 1m27.024s
       | sys 0m11.559s
       | 
       | and as you can see, the results go in a very different direction.
       | 
       | But anyhow, congrats on the project, and I'm pumped to see what
       | you come up with next!
       | 
       | [0]: https://github.com/cube2222/octosql
       | 
       | [1]:
       | https://github.com/dcmoura/spyql/blob/master/notebooks/json_...
        
         | alexgarcia-xyz wrote:
         | Hey thanks for sharing!
         | 
         | Re infering schema of input JSON: That would be slick! Though
         | SQLite does have some limitations here with table-valued
         | functions vs virtual table. I won't go into the specifics, but
         | something like this isn't possible in SQLite:
         | select name, age from lines_json_read('students.json')
         | 
         | The "name" and "age" dynamic columns aren't possible when using
         | the "table function" syntax, but something like this is
         | possible using traditional "virtual table" syntax:
         | create virtual table students using
         | lines_json_read(filename="students.json");              select
         | name, age from students;
         | 
         | It's a small difference, but definitely possible! Though
         | parsing JSON in C is tricky, but would definitely accept
         | contributions that figure it out.
         | 
         | And re benchmarks - thanks for sharing! Yeah, they're pretty
         | basic, so would love to add more complex ones. With the
         | books.json example, I think what's happening is that in
         | SQLite's JSON function, it parses the JSON each time in each
         | json_extract function - so it parses twice for each row in that
         | query. I also suspect that the long strings in "reviewText"
         | might slow down, but can't be sure. Once I get some free time
         | I'll add OctoSQL to the benchmark suite and this new books
         | dataset
        
           | tstack wrote:
           | Another option would be to create the virtual table used for
           | the table-valued function using a schema to define the
           | columns:                   CREATE VIRTUAL TABLE
           | students_json_read USING lines_json_read(schema="students-
           | schema.json);
           | 
           | Then, that one table could be used for multiple files:
           | SELECT * FROM students_json_read('school1-students.json');
           | SELECT * FROM students_json_read('school2-students.json');
        
       | simonw wrote:
       | I love this example from the docs:                   select
       | line -> '$.id' as id,           line -> '$.name' as name
       | from lines_read("my-file.ndjson");
       | 
       | This is using the new -> JSON operator which was added to SQLite
       | a few months ago.
       | 
       | The lines_read() thing there is a table-valued function, which
       | means it returns a virtual table that you can query. This is a
       | streaming operation which means it's safe to run it against a
       | 100GB+ file without worrying about it sucking all of the data
       | into memory at once.
       | 
       | Where this gets really useful is when you combine it with a
       | create table statement:                   create table names as
       | select             line -> '$.id' as id,             line ->
       | '$.name' as name           from lines_read("my-file.ndjson");
       | 
       | This should efficiently create a table with that exact subset of
       | the data pulled from the newline-delimited JSON file.
        
       | jokoon wrote:
       | I've read that spatialite is not well maintained, is that true?
        
       | bicijay wrote:
       | Thats interesting, is there any way to add "custom delimiter
       | templates" (or alternative tools)? I have some files where each
       | line contains different kind of informations, but those are
       | separated by their index on the row, so for example, customer
       | name is the letter index 3 to letter index 10.
        
         | alexgarcia-xyz wrote:
         | Hmm, so you have lines that have fields in fixed indexed
         | positions? Maybe something line this:                 $ cat
         | hn.txt        11alex    a       12brian   b       12craig   c
         | $ cat test.sql       .header on       .mode box       .load
         | dist/lines0       select         substr(line, 1, 2) as id,
         | substr(line, 3, 7) as name,         substr(line, 11, 2) as age
         | from lines_read('hn.txt')       $ sqlite3x :memory: '.read
         | test.sql'       +----+---------+-----+       | id |  name   |
         | age |       +----+---------+-----+       | 11 | alex    | a   |
         | | 12 | brian   | b   |       | 12 | craig   | c   |
         | +----+---------+-----+
         | 
         | Unfortunately SQLite's substr function is a bit awkward, but if
         | each field has constant indicies it may work
        
         | wodenokoto wrote:
         | These I believe are what is called fixed width files.
         | 
         | If you use pandas, there is a reader for those `pd.read_fwf()`
         | 
         | I'm sure there are many more languages that support this, it's
         | just the only one where I had a need.
        
           | bicijay wrote:
           | Exactly that, all banks here in Brazil are exporting
           | transactions with this kind of format (Its called CNAB file
           | over here and they are probably using COBOL systems). You
           | have different kinds of CNAB files, CNAB 200 where you have
           | lines with 200 characters, CNAB 400...
           | 
           | Inside those files you have multiple chunks and every line
           | contains information about a transaction. So if you wanna get
           | the transaction amount you would look for all characters
           | between character n10 and n20 for example. (Those positions
           | are fixed and follow the bank template).
           | 
           | We had success parsing it with python/node, but i think
           | spinning up a sqllite instance, reading straight from the
           | file and doing some calculations would be much faster.
        
       | qbasic_forever wrote:
       | Very cool looking! I love using SQL for ad-hoc data analysis, my
       | brain just groks the language so much more easily than other data
       | query languages like python + pandas.
        
       | randyrand wrote:
       | nice simple & reliable approach to a hard perf problem.
        
       | freecodyx wrote:
       | I like the fact that author is extensively documenting the source
       | code, following the sqlite spirit
        
       | alexgarcia-xyz wrote:
       | Hey, author here, happy to answer any questions! Also checkout
       | this notebook for a deeper dive into sqlite-lines, along with a
       | slick WASM demonstration and more thoughts on the codebase itself
       | https://observablehq.com/@asg017/introducing-sqlite-lines
       | 
       | I really dig SQLite, and I believe SQLite extensions will push it
       | to another level. I rarely reach for Pandas or other
       | "traditional" tools and query languages, and instead opt for
       | plain ol' SQLite and other extensions. As a shameless plug, I
       | recently started a blog series on SQLite and related tools and
       | extensions if you want to learn more! Next week I'll be
       | publishing more SQLite extensions for parsing HTML + making HTTP
       | requests https://observablehq.com/@asg017/a-new-sqlite-blog-
       | series
       | 
       | A few other SQLite extensions:
       | 
       | - xlite, for reading Excel files, in Rust
       | https://github.com/x2bool/xlite
       | 
       | - sqlean, several small SQLite extensions in C
       | https://github.com/nalgeon/sqlean
       | 
       | - mergestat, several SQLite extensions for developers (mainly
       | Github's API) in Go https://github.com/mergestat/mergestat
        
         | simonw wrote:
         | I'm so interested to hear about the HTTP requests stuff.
         | 
         | I've played a tiny bit with SQLite Python extensions that
         | perform HTTP requests, but I'm very unsure of the implications
         | of this kind of thing.
         | 
         | What happens to the rest of the SQLite query engine if your
         | custom function hits a page like https://httpbin.org/delay/10
         | with a deliberate 10 second delay before the request returns?
         | 
         | Maybe this is fine - your transaction takes 10 seconds longer
         | but everything else continues to work as normal. I've not done
         | the work to figure that out.
        
           | alexgarcia-xyz wrote:
           | Yup, if it hits https://httpbin.org/delay/10 then the entire
           | script will delay for 10 seconds (unless you have a custom
           | lower timeout setting). Which means you can only make HTTP
           | requests in serial requests, which sucks, but much simpler
           | than coordinating with other tools
           | 
           | It'll come out in a week, but lmk if you want to play with it
           | beforehand!
        
             | simonw wrote:
             | How will it affect other SQLite connections against the
             | same database?
             | 
             | My guess is they'll be OK, but presumably any transactions
             | will lock tables for the duration?
        
               | alexgarcia-xyz wrote:
               | I think you're right, tho not an expert with multiple
               | connections. I think if the DB is in WAL mode and you're
               | only issuing read-only requests, then there will be no
               | lock and long HTTP requests won't lock anything up. But
               | if you issue a write-request that locks tables, then
               | other write-requests will be blocked with long requests.
               | 
               | But if you're in journal mode then there's not much you
               | can do. Please correct me if I'm wrong!
        
         | simonw wrote:
         | How difficult do you think it would be compile SpatiaLite (a
         | big hairy C codebase) to WASM and load it as a dynamic module,
         | like you've done with sqlite-lines?
        
           | alexgarcia-xyz wrote:
           | This has been done by other people already!
           | 
           | - https://github.com/jvail/spl.js
           | 
           | - https://observablehq.com/@visionscarto/hello-spl-js
        
             | simonw wrote:
             | Oh wow! I wonder how hard it would be to load that module
             | into https://github.com/simonw/datasette-lite
             | 
             | Hah, turns out I have an issue for that already, which I'd
             | forgotten about: https://github.com/simonw/datasette-
             | lite/issues/30
        
       ___________________________________________________________________
       (page generated 2022-07-30 23:00 UTC)