[HN Gopher] Dsq: Commandline tool for running SQL queries agains...
       ___________________________________________________________________
        
       Dsq: Commandline tool for running SQL queries against JSON, CSV,
       Parquet, etc.
        
       Author : eatonphil
       Score  : 233 points
       Date   : 2022-01-11 15:20 UTC (7 hours ago)
        
 (HTM) web link (datastation.multiprocess.io)
 (TXT) w3m dump (datastation.multiprocess.io)
        
       | houqp wrote:
       | Neat! I have also built a similar project in Rust
       | https://github.com/roapi/roapi/tree/main/columnq-cli :)
        
         | eatonphil wrote:
         | Nice project!
        
       | [deleted]
        
       | nonbirithm wrote:
       | They really ought to be a standard, language-agnostic way for
       | querying nested config file formats with identifiers by now. For
       | all the flak XML gets, at least it has XPath. YAML has YAML Path,
       | but it's only supported by a single Python lib.
        
         | emmanueloga_ wrote:
         | Augeas comes to mind [1], although it seems like a pretty niche
         | project (since you mention a "standard" ...).
         | 
         | 1: http://augeas.net
        
       | newman314 wrote:
       | dsq references a benchmark done by q
       | (https://github.com/harelba/q/blob/master/test/BENCHMARK.md) that
       | indicates that octosql is significantly slower.
       | 
       | However, octosql's GH repo claims otherwise.
       | 
       | Does anyone have any real world experience that they can share on
       | these tools?
        
         | eatonphil wrote:
         | Yeah frankly the q benchmark isn't the best even though dsq
         | compares not terribly in it. It isn't well documented and
         | exercises a very limited amount of functionality and isn't very
         | rigorous from what I can see. That said, the caching q does is
         | likely very solid (and not something dsq does).
         | 
         | The biggest risk (both in terms of SQL compliance and
         | performance) I think with octosql (and cube2222 is here
         | somewhere to disagree with me if I'm wrong) is that they have
         | their own entire SQL engine whereas textql, q and dsq use
         | SQLite. But q is also in Python whereas textql, octosql, and
         | dsq are in Go.
         | 
         | In the next few weeks I'll be posting some benchmarks that I
         | hope are a little fairer (or at least well-documented and
         | reproducible). Though of course it would be appropriate to have
         | independent benchmarks too since I now have a dog in the fight.
         | 
         | On a tangent, once the go-duckdb binding [0] matures I'd love
         | to offer duckdb as an alternative engine flag within dsq (and
         | DataStation). Would be neat to see.
         | 
         | [0] https://github.com/marcboeker/go-duckdb
        
           | em500 wrote:
           | Another interesting comparison would be clickhouse-local:
           | 
           | https://altinity.com/blog/2019/6/11/clickhouse-local-the-
           | pow...
        
             | eatonphil wrote:
             | Awesome post, thanks for the link. I had no clue Clickhouse
             | could do this. Shows off some more CLI tools in there too.
             | 
             | The big issue with ClickHouse is the incredibly non-
             | standard SQL dialect and the random bugs that remain. It's
             | an amazing project for analytics but you definitely have to
             | be willing to hack around its SQL language (I say this as a
             | massive fan of ClickHouse).
             | 
             | I wonder: does this mean I can embed ClickHouse in
             | arbitrary software as a library? I'd be curious to provide
             | that as an option in dsq.
        
               | em500 wrote:
               | > I wonder: does this mean I can embed ClickHouse in
               | arbitrary software as a library? I'd be curious to
               | provide that as an option in dsq.
               | 
               | Not sure, but it's Apache licensed, so you likely can
               | make it work if you want to. But realize that
               | clickhouse(-local) is much heavier than sqlite / duckdb
               | based solutions: the compiled binary is around 200mb
               | iirc.
        
               | eatonphil wrote:
               | Ah yeah that's pretty large.
        
           | cube2222 wrote:
           | Hey there! OctoSQL author here.
           | 
           | I've released OctoSQL v0.4.0 recently[0] which is a grand
           | rewrite and is 2-3 orders of magnitude faster than OctoSQL
           | was before. It's also much more robust overall with static
           | typing and the plugin system for easy extension. The q
           | benchmarks are over a year old and haven't been updated to
           | reflect that yet.
           | 
           | Take a look at the README[1] for more details.
           | 
           | My benchmarks should be reproducible, you can find the script
           | in the benchmarks/ repo directory.
           | 
           | Btw if we're already talking @eatonphil I'd appreciate you
           | updating your benchmarks to reflect these changes.
           | 
           | As far as the custom query engine goes - yes, there are both
           | pros and cons to that. In the case of OctoSQL I'm building
           | something that's much more dynamic - a full-blown dataflow
           | engine - and can subscribe to multiple datasources to
           | dynamically update queries as source data changes. This also
           | means it can support streaming datasources. That is not
           | possible with the other systems. It also means I don't have
           | to load _everything_ into a SQLite database before querying -
           | I can optimize which columns I need to even read.
           | 
           | OctoSQL also let's you work with actual databases in your
           | queries - like PostgreSQL or MySQL - and pushes predicates
           | down to them, so it doesn't have to dump your whole database
           | tables. That's useful if you need to do cross-db joins, or
           | JSON-file-with-database joins.
           | 
           | As far as SQL compliance goes it gets hairy in the details -
           | as usual. The overall dialect is based on MySQL as I'm using
           | a parser based on vitess's one, but I don't support some
           | syntax, and add original syntax too (type assertions,
           | temporal extensions, object and type notation).
           | 
           | Stay tuned for a lot more supported datasources, as the
           | plugin system lets me work on that much quicker.
           | 
           | [0]:https://github.com/cube2222/octosql/releases/tag/v0.4.0
           | 
           | [1]:https://github.com/cube2222/octosql#readme
        
             | eatonphil wrote:
             | Neat! I'll give your script a shot.
        
       | eatonphil wrote:
       | I shared the first version of this in a Show HN last month [0].
       | The big update since then is that it now supports loading
       | multiple files and doing SQL joins on them. You can see examples
       | of that in this post.
       | 
       | The repo is here [1] and the README has a comparison against some
       | of the other great tools in the space like q, octosql, and
       | textql.
       | 
       | [0] https://news.ycombinator.com/item?id=29643835
       | 
       | [1] https://github.com/multiprocessio/dsq
        
         | data_ders wrote:
         | way cool Phil! I've been struggling to grok the value prop of
         | data station for a while now, but dsq crystallizes it for me!
        
           | eatonphil wrote:
           | Interesting! and thanks! It has definitely not been
           | straightforward to explain. Mostly my fault not working on
           | tutorials, docs and videos. Thankfully DataStation
           | performance, features and install-time are now in a pretty
           | good spot so I am going to be focusing more on the education
           | side.
        
       | avmich wrote:
       | LINQ went the other way - a language for queries against (among
       | other things) relational databases, different than SQL.
       | 
       | Wonder how, say, jq would look like working with tables in RDBMS.
        
       | rahimiali wrote:
       | If you're curious how they've adapted SQL to query JSON, here is
       | the trick: The JSON "Must be an array of objects. Nested object
       | fields are ignored".
        
       | chrisweekly wrote:
       | See also "lnav" -- https://lnav.org -- for an awesome CLI
       | powertool with embeded SQLite.
        
         | cschneid wrote:
         | Thank you for this, it's a tool I didn't realize I was missing.
        
       | da39a3ee wrote:
       | I really like this idea. jq is great too, but the jq language
       | just seems to be too different from anything else for me, so it
       | only works if I use it every day, which I don't. But I guess an
       | issue is that I'll still need to use jq in order to get my data
       | into the suitable array-of-objects format for dsq?
        
         | eatonphil wrote:
         | For now yeah. Eventually I'd like to support pre transforms or
         | something within dsq itself.
        
       | wolfi1 wrote:
       | just for clarification: one needs datastation in order for dsq to
       | run? so dsq is not self-contained? (I'm not familiar with the Go
       | eco system)
        
         | eatonphil wrote:
         | Good question! No you don't need to install DataStation to run
         | dsq. dsq just imports Go libraries from DataStation during
         | compile-time.
         | 
         | dsq is a totally standalone binary.
        
       | thriftwy wrote:
       | I dream of a database which would use CSV (with metadata/indices
       | in CSV comment lines) as its storage. You can even use commented-
       | out padding to fit data to blocks, etc.
       | 
       | Imagine when you don't have to dump/convert data because you can
       | always open it with OpenOffice.
        
         | bachmeier wrote:
         | What I've done in the past is:
         | 
         | Column 1 is the creation timestamp
         | 
         | Column 2 is the modification timestamp
         | 
         | On read, you update if the creation timestamp exists but the
         | modification timestamp is later, otherwise you insert. Your
         | app(s) can do a simple file append for writes. You even have
         | the full version history at every point in time.
         | 
         | I did a lot of looking but didn't find a command-line tool that
         | automated this process. It works fine for small projects of
         | e.g. 100,000 records. Wouldn't work well for things like a
         | notes app, because you'd be storing every modification as a new
         | entry.
        
         | remcinerney wrote:
         | That's interesting. I'm trying to imagine your workflow, and
         | thinking about what serverless SQL platforms like Amazon Athena
         | let you do now - i.e., you can more or less dump CSV files in
         | blob storage and query them. Is that what you meant?
        
         | LittlePeter wrote:
         | In PostgreSQL you can use file_fdw extension:
         | 
         | https://www.postgresql.org/docs/current/file-fdw.html
        
           | bachmeier wrote:
           | I think the parent comment was about reading and writing csv
           | files. The documentation you linked says "Access to data
           | files is currently read-only."
        
             | LittlePeter wrote:
             | I missed that. You are right.
        
         | mr_toad wrote:
         | PrestoDB. Or one of the many SaaS offerings like Athena,
         | BigQuery or USQL.
        
         | cerved wrote:
         | I'm not sure I share that dream..
         | 
         | CSV is so horribly non-standardized and horrible to parse. JSON
         | appears a much more suitable candidate
        
         | sundarurfriend wrote:
         | recfiles [1] are a sort of poor man's plaintext database that
         | you can edit with any text editor. I found manually entering
         | the data mildly annoying and repetitive, but visidata [2]
         | supports the format, so I've been meaning to learn to use that
         | for easier data entry and for tabular visualization.
         | 
         | [1]
         | https://www.gnu.org/software/recutils/manual/recutils.html#I...
         | [2] https://www.visidata.org/
         | 
         | Bonus: A nice concise intro that someone wrote last week:
         | https://chrismanbrown.gitlab.io/28.html
        
         | laumars wrote:
         | AWS Athena can do this. Dump your CSVs in S3 and query them in
         | Athena.
         | 
         | Personally I use sqlite for smaller datasets and wrap that
         | around a CSV importer.
        
           | thriftwy wrote:
           | Why dump CSVs when you can outright store in them?
        
             | laumars wrote:
             | Sorry, I don't understand your question. But just in case
             | this answers it:
             | 
             | S3 is cloud storage on AWS. Athena can work directly off
             | the CSVs stored on S3.
             | 
             | Where I said "dump" it was just a colourful way of saying
             | "transfer your files to...". I appreciate "dump" can also
             | mean different things with databases so maybe that wasn't
             | the best choice of word in my part. Sorry for any confusion
             | there.
        
       ___________________________________________________________________
       (page generated 2022-01-11 23:00 UTC)