[HN Gopher] Dsq: Commandline tool for running SQL queries agains... ___________________________________________________________________ Dsq: Commandline tool for running SQL queries against JSON, CSV, Parquet, etc. Author : eatonphil Score : 233 points Date : 2022-01-11 15:20 UTC (7 hours ago) (HTM) web link (datastation.multiprocess.io) (TXT) w3m dump (datastation.multiprocess.io) | houqp wrote: | Neat! I have also built a similar project in Rust | https://github.com/roapi/roapi/tree/main/columnq-cli :) | eatonphil wrote: | Nice project! | [deleted] | nonbirithm wrote: | They really ought to be a standard, language-agnostic way for | querying nested config file formats with identifiers by now. For | all the flak XML gets, at least it has XPath. YAML has YAML Path, | but it's only supported by a single Python lib. | emmanueloga_ wrote: | Augeas comes to mind [1], although it seems like a pretty niche | project (since you mention a "standard" ...). | | 1: http://augeas.net | newman314 wrote: | dsq references a benchmark done by q | (https://github.com/harelba/q/blob/master/test/BENCHMARK.md) that | indicates that octosql is significantly slower. | | However, octosql's GH repo claims otherwise. | | Does anyone have any real world experience that they can share on | these tools? | eatonphil wrote: | Yeah frankly the q benchmark isn't the best even though dsq | compares not terribly in it. It isn't well documented and | exercises a very limited amount of functionality and isn't very | rigorous from what I can see. That said, the caching q does is | likely very solid (and not something dsq does). | | The biggest risk (both in terms of SQL compliance and | performance) I think with octosql (and cube2222 is here | somewhere to disagree with me if I'm wrong) is that they have | their own entire SQL engine whereas textql, q and dsq use | SQLite. But q is also in Python whereas textql, octosql, and | dsq are in Go. | | In the next few weeks I'll be posting some benchmarks that I | hope are a little fairer (or at least well-documented and | reproducible). Though of course it would be appropriate to have | independent benchmarks too since I now have a dog in the fight. | | On a tangent, once the go-duckdb binding [0] matures I'd love | to offer duckdb as an alternative engine flag within dsq (and | DataStation). Would be neat to see. | | [0] https://github.com/marcboeker/go-duckdb | em500 wrote: | Another interesting comparison would be clickhouse-local: | | https://altinity.com/blog/2019/6/11/clickhouse-local-the- | pow... | eatonphil wrote: | Awesome post, thanks for the link. I had no clue Clickhouse | could do this. Shows off some more CLI tools in there too. | | The big issue with ClickHouse is the incredibly non- | standard SQL dialect and the random bugs that remain. It's | an amazing project for analytics but you definitely have to | be willing to hack around its SQL language (I say this as a | massive fan of ClickHouse). | | I wonder: does this mean I can embed ClickHouse in | arbitrary software as a library? I'd be curious to provide | that as an option in dsq. | em500 wrote: | > I wonder: does this mean I can embed ClickHouse in | arbitrary software as a library? I'd be curious to | provide that as an option in dsq. | | Not sure, but it's Apache licensed, so you likely can | make it work if you want to. But realize that | clickhouse(-local) is much heavier than sqlite / duckdb | based solutions: the compiled binary is around 200mb | iirc. | eatonphil wrote: | Ah yeah that's pretty large. | cube2222 wrote: | Hey there! OctoSQL author here. | | I've released OctoSQL v0.4.0 recently[0] which is a grand | rewrite and is 2-3 orders of magnitude faster than OctoSQL | was before. It's also much more robust overall with static | typing and the plugin system for easy extension. The q | benchmarks are over a year old and haven't been updated to | reflect that yet. | | Take a look at the README[1] for more details. | | My benchmarks should be reproducible, you can find the script | in the benchmarks/ repo directory. | | Btw if we're already talking @eatonphil I'd appreciate you | updating your benchmarks to reflect these changes. | | As far as the custom query engine goes - yes, there are both | pros and cons to that. In the case of OctoSQL I'm building | something that's much more dynamic - a full-blown dataflow | engine - and can subscribe to multiple datasources to | dynamically update queries as source data changes. This also | means it can support streaming datasources. That is not | possible with the other systems. It also means I don't have | to load _everything_ into a SQLite database before querying - | I can optimize which columns I need to even read. | | OctoSQL also let's you work with actual databases in your | queries - like PostgreSQL or MySQL - and pushes predicates | down to them, so it doesn't have to dump your whole database | tables. That's useful if you need to do cross-db joins, or | JSON-file-with-database joins. | | As far as SQL compliance goes it gets hairy in the details - | as usual. The overall dialect is based on MySQL as I'm using | a parser based on vitess's one, but I don't support some | syntax, and add original syntax too (type assertions, | temporal extensions, object and type notation). | | Stay tuned for a lot more supported datasources, as the | plugin system lets me work on that much quicker. | | [0]:https://github.com/cube2222/octosql/releases/tag/v0.4.0 | | [1]:https://github.com/cube2222/octosql#readme | eatonphil wrote: | Neat! I'll give your script a shot. | eatonphil wrote: | I shared the first version of this in a Show HN last month [0]. | The big update since then is that it now supports loading | multiple files and doing SQL joins on them. You can see examples | of that in this post. | | The repo is here [1] and the README has a comparison against some | of the other great tools in the space like q, octosql, and | textql. | | [0] https://news.ycombinator.com/item?id=29643835 | | [1] https://github.com/multiprocessio/dsq | data_ders wrote: | way cool Phil! I've been struggling to grok the value prop of | data station for a while now, but dsq crystallizes it for me! | eatonphil wrote: | Interesting! and thanks! It has definitely not been | straightforward to explain. Mostly my fault not working on | tutorials, docs and videos. Thankfully DataStation | performance, features and install-time are now in a pretty | good spot so I am going to be focusing more on the education | side. | avmich wrote: | LINQ went the other way - a language for queries against (among | other things) relational databases, different than SQL. | | Wonder how, say, jq would look like working with tables in RDBMS. | rahimiali wrote: | If you're curious how they've adapted SQL to query JSON, here is | the trick: The JSON "Must be an array of objects. Nested object | fields are ignored". | chrisweekly wrote: | See also "lnav" -- https://lnav.org -- for an awesome CLI | powertool with embeded SQLite. | cschneid wrote: | Thank you for this, it's a tool I didn't realize I was missing. | da39a3ee wrote: | I really like this idea. jq is great too, but the jq language | just seems to be too different from anything else for me, so it | only works if I use it every day, which I don't. But I guess an | issue is that I'll still need to use jq in order to get my data | into the suitable array-of-objects format for dsq? | eatonphil wrote: | For now yeah. Eventually I'd like to support pre transforms or | something within dsq itself. | wolfi1 wrote: | just for clarification: one needs datastation in order for dsq to | run? so dsq is not self-contained? (I'm not familiar with the Go | eco system) | eatonphil wrote: | Good question! No you don't need to install DataStation to run | dsq. dsq just imports Go libraries from DataStation during | compile-time. | | dsq is a totally standalone binary. | thriftwy wrote: | I dream of a database which would use CSV (with metadata/indices | in CSV comment lines) as its storage. You can even use commented- | out padding to fit data to blocks, etc. | | Imagine when you don't have to dump/convert data because you can | always open it with OpenOffice. | bachmeier wrote: | What I've done in the past is: | | Column 1 is the creation timestamp | | Column 2 is the modification timestamp | | On read, you update if the creation timestamp exists but the | modification timestamp is later, otherwise you insert. Your | app(s) can do a simple file append for writes. You even have | the full version history at every point in time. | | I did a lot of looking but didn't find a command-line tool that | automated this process. It works fine for small projects of | e.g. 100,000 records. Wouldn't work well for things like a | notes app, because you'd be storing every modification as a new | entry. | remcinerney wrote: | That's interesting. I'm trying to imagine your workflow, and | thinking about what serverless SQL platforms like Amazon Athena | let you do now - i.e., you can more or less dump CSV files in | blob storage and query them. Is that what you meant? | LittlePeter wrote: | In PostgreSQL you can use file_fdw extension: | | https://www.postgresql.org/docs/current/file-fdw.html | bachmeier wrote: | I think the parent comment was about reading and writing csv | files. The documentation you linked says "Access to data | files is currently read-only." | LittlePeter wrote: | I missed that. You are right. | mr_toad wrote: | PrestoDB. Or one of the many SaaS offerings like Athena, | BigQuery or USQL. | cerved wrote: | I'm not sure I share that dream.. | | CSV is so horribly non-standardized and horrible to parse. JSON | appears a much more suitable candidate | sundarurfriend wrote: | recfiles [1] are a sort of poor man's plaintext database that | you can edit with any text editor. I found manually entering | the data mildly annoying and repetitive, but visidata [2] | supports the format, so I've been meaning to learn to use that | for easier data entry and for tabular visualization. | | [1] | https://www.gnu.org/software/recutils/manual/recutils.html#I... | [2] https://www.visidata.org/ | | Bonus: A nice concise intro that someone wrote last week: | https://chrismanbrown.gitlab.io/28.html | laumars wrote: | AWS Athena can do this. Dump your CSVs in S3 and query them in | Athena. | | Personally I use sqlite for smaller datasets and wrap that | around a CSV importer. | thriftwy wrote: | Why dump CSVs when you can outright store in them? | laumars wrote: | Sorry, I don't understand your question. But just in case | this answers it: | | S3 is cloud storage on AWS. Athena can work directly off | the CSVs stored on S3. | | Where I said "dump" it was just a colourful way of saying | "transfer your files to...". I appreciate "dump" can also | mean different things with databases so maybe that wasn't | the best choice of word in my part. Sorry for any confusion | there. ___________________________________________________________________ (page generated 2022-01-11 23:00 UTC)