[HN Gopher] Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and... ___________________________________________________________________ Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet Author : houqp Score : 161 points Date : 2022-09-24 15:59 UTC (7 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | mmastrac wrote: | The one thing everyone here is missing so far is that it's a Rust | binary, distributed on PyPi. That's brilliant. | jonahx wrote: | Can you explain the advantages of this vs cargo? | proto_lambda wrote: | cargo is not a binary distribution. | houqp wrote: | Most users already have pip installed, so they won't need to | install a rust toolchain. | simonw wrote: | I wrote a bit about this pattern here: | https://simonwillison.net/2022/May/23/bundling-binary-tools-... | einpoklum wrote: | You can get a statically-linked binary release from GitHub | which depends on nothing (I think). | samwillis wrote: | I'm all in on using PyPI for binary distribution. Couple that | with Python Venv and you have a brilliant system for per | project dependancies. | | I created this project for distributing Node via PyPI: | https://pypi.org/project/nodejs-bin/ | henrydark wrote: | It is pretty cool. py-spy has also been doing this for a few | years | | https://github.com/benfred/py-spy | playingalong wrote: | Bye bye jq and your awful query syntax. | gavinray wrote: | 1) roapi is built with some wicked cool tech | | 2) the author once answered some questions I posted on | Datafusion, so they're cool in my book | | Here are my anecdotes. | tootie wrote: | AWS Athena offers something similar. You can build tables off of | structured text files (like log files) in S3 and run SQL queries. | ramraj07 wrote: | What's the performance like though? | bachmeier wrote: | As I commented on a recent similar discussion, these tools can't | be used for update or insert. As useful as querying might be, | it's terribly misleading to claim to "run SQL" if you can't | change the data, since that's such a critical part of an SQL | database. | TAForObvReasons wrote: | The title is an editorialization. The project is very careful | to emphasize that it is for reading data: | | > Create full-fledged APIs for slowly moving datasets without | writing a single line of code. | | Even the name of the project "ROAPI" has "read only" in the | name. | gavinray wrote: | Question: I've built something that supports full CRUD, and | queries that span multiple data sources with optimization and | pushdown | | What kind of headline would make you want to read/try such a | thing? | | (I'm planning on announcing it + releasing code on HN but have | never done so before) | porker wrote: | Show HN: Read and update Arrow, Parquet and xxxx files using | SQL | gavinray wrote: | It works on databases and arbitrary data sources too though | tomrod wrote: | 90% of SQL usage, or more, is select in slowly changing data | contexts. | andygrove wrote: | I think it is worth pointing out that this tool does support | querying Delta Lake (the author of ROAPI is also a major | contributor the native Rust implementation of Delta Lake). | Delta Lake certainly supports transactions, so ROAPI can query | transactional data, although the writes would not go through | ROAPI. | mgradowski wrote: | What you're really saying is that the database presented in OP | is not useful because it only handles DQL. | | 1. SQL can be thought of as being composed of several smaller | lanuages: DDL, DQL, DML, DCL. | | 2. columnq-cli is only a CLI to a query engine, not a database. | As such, it only supports DQL by design. | | 3. I have the impression that outside of data engineering/DBA, | people are rarely taught the distinction between OLTP and OLAP | workloads [1]. The latter often utilizes immutable data | structures (e.g. columnar storage with column compression), or | provides limited DML support, see e.g. the limitations of the | DELETE statement in ClickHouse [2], or the list of supported | DML statements in Amazon Athena [3]. My point -- as much as | this tool is useless for transactional workloads, it is | perfectly capable of some analytical workloads. | | [1] Opinion, not a fact. | | [2] https://clickhouse.com/docs/en/sql- | reference/statements/dele... | | [3] https://docs.aws.amazon.com/athena/latest/ug/functions- | opera... | ebfe1 wrote: | This is cool...Totally reminded me about several tools pop up on | HN every now and then in the past for similar task so i did a | quick search: | | clickhouse-local - https://news.ycombinator.com/item?id=22457767 | | q - https://news.ycombinator.com/item?id=27423276 | | textql - https://news.ycombinator.com/item?id=16781294 | | simpql- https://news.ycombinator.com/item?id=25791207 | | We need a benchmark i think..;) | tanin wrote: | Shameless plug. A desktop app: https://superintendent.app | skybrian wrote: | Looks like it also supports SQLite for input, but not for output. | That might be a nice addition. | whimsicalism wrote: | Trino can do this as well. | cube2222 wrote: | This looks really cool! Especially using datafusion underneath | means that it probably is blazingly fast. | | If you like this, I recommend taking a look at OctoSQL[0], which | I'm the author of. | | It's plenty fast and easier to add new data sources for as | external plugins. | | It can also handle endless streams of data natively, so you can | do running groupings on i.e. tailed JSON logs. | | Additionally, it's able to push down predicates to the database | below, so if you're selecting 10 rows from a 1 billion row table, | it'll just get those 10 rows instead of getting them all and | filtering in memory. | | [0]: https://github.com/cube2222/octosql ___________________________________________________________________ (page generated 2022-09-24 23:00 UTC)