[HN Gopher] Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and...
       ___________________________________________________________________
        
       Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet
        
       Author : houqp
       Score  : 161 points
       Date   : 2022-09-24 15:59 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mmastrac wrote:
       | The one thing everyone here is missing so far is that it's a Rust
       | binary, distributed on PyPi. That's brilliant.
        
         | jonahx wrote:
         | Can you explain the advantages of this vs cargo?
        
           | proto_lambda wrote:
           | cargo is not a binary distribution.
        
           | houqp wrote:
           | Most users already have pip installed, so they won't need to
           | install a rust toolchain.
        
         | simonw wrote:
         | I wrote a bit about this pattern here:
         | https://simonwillison.net/2022/May/23/bundling-binary-tools-...
        
         | einpoklum wrote:
         | You can get a statically-linked binary release from GitHub
         | which depends on nothing (I think).
        
         | samwillis wrote:
         | I'm all in on using PyPI for binary distribution. Couple that
         | with Python Venv and you have a brilliant system for per
         | project dependancies.
         | 
         | I created this project for distributing Node via PyPI:
         | https://pypi.org/project/nodejs-bin/
        
       | henrydark wrote:
       | It is pretty cool. py-spy has also been doing this for a few
       | years
       | 
       | https://github.com/benfred/py-spy
        
       | playingalong wrote:
       | Bye bye jq and your awful query syntax.
        
       | gavinray wrote:
       | 1) roapi is built with some wicked cool tech
       | 
       | 2) the author once answered some questions I posted on
       | Datafusion, so they're cool in my book
       | 
       | Here are my anecdotes.
        
       | tootie wrote:
       | AWS Athena offers something similar. You can build tables off of
       | structured text files (like log files) in S3 and run SQL queries.
        
         | ramraj07 wrote:
         | What's the performance like though?
        
       | bachmeier wrote:
       | As I commented on a recent similar discussion, these tools can't
       | be used for update or insert. As useful as querying might be,
       | it's terribly misleading to claim to "run SQL" if you can't
       | change the data, since that's such a critical part of an SQL
       | database.
        
         | TAForObvReasons wrote:
         | The title is an editorialization. The project is very careful
         | to emphasize that it is for reading data:
         | 
         | > Create full-fledged APIs for slowly moving datasets without
         | writing a single line of code.
         | 
         | Even the name of the project "ROAPI" has "read only" in the
         | name.
        
         | gavinray wrote:
         | Question: I've built something that supports full CRUD, and
         | queries that span multiple data sources with optimization and
         | pushdown
         | 
         | What kind of headline would make you want to read/try such a
         | thing?
         | 
         | (I'm planning on announcing it + releasing code on HN but have
         | never done so before)
        
           | porker wrote:
           | Show HN: Read and update Arrow, Parquet and xxxx files using
           | SQL
        
             | gavinray wrote:
             | It works on databases and arbitrary data sources too though
        
         | tomrod wrote:
         | 90% of SQL usage, or more, is select in slowly changing data
         | contexts.
        
         | andygrove wrote:
         | I think it is worth pointing out that this tool does support
         | querying Delta Lake (the author of ROAPI is also a major
         | contributor the native Rust implementation of Delta Lake).
         | Delta Lake certainly supports transactions, so ROAPI can query
         | transactional data, although the writes would not go through
         | ROAPI.
        
         | mgradowski wrote:
         | What you're really saying is that the database presented in OP
         | is not useful because it only handles DQL.
         | 
         | 1. SQL can be thought of as being composed of several smaller
         | lanuages: DDL, DQL, DML, DCL.
         | 
         | 2. columnq-cli is only a CLI to a query engine, not a database.
         | As such, it only supports DQL by design.
         | 
         | 3. I have the impression that outside of data engineering/DBA,
         | people are rarely taught the distinction between OLTP and OLAP
         | workloads [1]. The latter often utilizes immutable data
         | structures (e.g. columnar storage with column compression), or
         | provides limited DML support, see e.g. the limitations of the
         | DELETE statement in ClickHouse [2], or the list of supported
         | DML statements in Amazon Athena [3]. My point -- as much as
         | this tool is useless for transactional workloads, it is
         | perfectly capable of some analytical workloads.
         | 
         | [1] Opinion, not a fact.
         | 
         | [2] https://clickhouse.com/docs/en/sql-
         | reference/statements/dele...
         | 
         | [3] https://docs.aws.amazon.com/athena/latest/ug/functions-
         | opera...
        
       | ebfe1 wrote:
       | This is cool...Totally reminded me about several tools pop up on
       | HN every now and then in the past for similar task so i did a
       | quick search:
       | 
       | clickhouse-local - https://news.ycombinator.com/item?id=22457767
       | 
       | q - https://news.ycombinator.com/item?id=27423276
       | 
       | textql - https://news.ycombinator.com/item?id=16781294
       | 
       | simpql- https://news.ycombinator.com/item?id=25791207
       | 
       | We need a benchmark i think..;)
        
         | tanin wrote:
         | Shameless plug. A desktop app: https://superintendent.app
        
       | skybrian wrote:
       | Looks like it also supports SQLite for input, but not for output.
       | That might be a nice addition.
        
       | whimsicalism wrote:
       | Trino can do this as well.
        
       | cube2222 wrote:
       | This looks really cool! Especially using datafusion underneath
       | means that it probably is blazingly fast.
       | 
       | If you like this, I recommend taking a look at OctoSQL[0], which
       | I'm the author of.
       | 
       | It's plenty fast and easier to add new data sources for as
       | external plugins.
       | 
       | It can also handle endless streams of data natively, so you can
       | do running groupings on i.e. tailed JSON logs.
       | 
       | Additionally, it's able to push down predicates to the database
       | below, so if you're selecting 10 rows from a 1 billion row table,
       | it'll just get those 10 rows instead of getting them all and
       | filtering in memory.
       | 
       | [0]: https://github.com/cube2222/octosql
        
       ___________________________________________________________________
       (page generated 2022-09-24 23:00 UTC)