[HN Gopher] Apache Arrow Flight SQL: Accelerating Database Access
       ___________________________________________________________________
        
       Apache Arrow Flight SQL: Accelerating Database Access
        
       Author : ionychal
       Score  : 82 points
       Date   : 2022-02-16 14:55 UTC (8 hours ago)
        
 (HTM) web link (arrow.apache.org)
 (TXT) w3m dump (arrow.apache.org)
        
       | carlineng wrote:
       | Mark Raasveldt and Hannes Muhleisen at CWI wrote about these
       | challenges in their 2017 paper Don't Hold My Data Hostage [0].
       | It's great to see progress on this front.
       | 
       | [0] https://ir.cwi.nl/pub/26415
        
       | eatonphil wrote:
       | tldr; it's a new wire protocol that columnar databases can opt
       | into that is better optimized for columnar data than ODBC/JDBC.
       | 
       | Key quotes:                   While it can be directly used for
       | database access, it is not a direct replacement for JDBC/ODBC.
       | Instead, Flight SQL serves as a concrete wire protocol/driver
       | implementation that can support a JDBC/ODBC driver and reduces
       | implementation burden on databases.              Flight SQL means
       | database servers can implement a standard interface that is
       | designed around Apache Arrow and columnar data from the start.
       | Just like how Arrow provides a standard in-memory format, Flight
       | SQL saves developers from having to design and implement an
       | entirely new wire protocol. As mentioned, Flight already
       | implements features like encryption on the wire and
       | authentication of requests, which databases do not need to re-
       | implement.
        
         | gavinray wrote:
         | This is true, but to clarify and expand a bit, you add support
         | for FlightSQL to some datasource by adding Protobuf RPC
         | handlers to it. This doesn't necessarily have to run in-
         | process, the FlightSQL RPC service might run side-by-side.
         | 
         | The datasource also doesn't necessarily have to be a database.
         | It just needs to implement the RPC methods. So you could
         | implement your own data-access layer over any datasource (flat
         | file, an external API, a regular DBMS, etc) by providing an
         | implementation for "get all schemas", "get all tables",
         | "execute a query", etc.
         | 
         | Some useful quotes:                 > The idea is that the
         | driver is a "one-size-fits-all" driver -- a user or tool vendor
         | only needs to supply a generic driver that can connect to an
         | infinite number of databases. This is even future-proof -- if a
         | new database comes out, it can work with existing tools as long
         | as an Arrow Flight SQL endpoint is provided. In fact, by adding
         | an Arrow Flight SQL endpoint they would automatically enable
         | JDBC connectivity too.            > Not only will Arrow Flight
         | SQL reduce the technical burden on applications and users, but
         | it leverages Arrow, which means it will provide better
         | performance. ... having a single reference JDBC driver allows
         | any data source that adds an Arrow Flight SQL endpoint to get
         | JDBC "for free" as an onramp. So the selling point is add an
         | Arrow Flight SQL endpoint to your data source and automatically
         | get JDBC connectivity.
         | 
         | I have been experimenting in my free time with building a
         | platform that autogenerates GraphQL CRUD API's on top of
         | arbitrary datasources.
         | 
         | It lets you do federated/distributed queries and cross-
         | databsource joins.
         | 
         | So you can connect it to a Postgres DB, a MySQL DB, a MongoDB,
         | and a CSV file and then run a query that joins and filters data
         | across them in a single operation.
         | 
         | I am using Apache Calcite for this, but am interested in
         | potentially offloading the query plan execution using
         | FlightSQL/Substrait[0] for better performance, since I am
         | targeting OLTP workloads and it's latency-sensitive.
         | 
         | https://github.com/GavinRay97/GraphQLCalcite
         | 
         | [0] https://substrait.io/
        
       | lmeyerov wrote:
       | Exciting, remember whiteboard conversations about this years ago,
       | and benchmarking early ideas like Turbodc :)
       | 
       | I could not find: Is there a list of conformant DBs & their
       | implementation statuses? Been curious both for Flight and Flight
       | SQL....
        
         | lidavidm wrote:
         | There is not a list (though right now it is zero,
         | unfortunately); Dremio has been doing some work around this
         | (and contributed the spec and implementation) though I can't
         | speak for their plans.
         | 
         | Just a minor distinction - Flight is lower-level and might be
         | used to build a wire protocol for a database, but a "database
         | supporting Flight" still requires something like Flight SQL to
         | specify how the database expects to serve requests over Flight.
        
       | pauldix wrote:
       | This is great and makes a ton of sense as a refinement to their
       | work on Flight previously. InfluxDB IOx already supports Flight,
       | but we'll almost certainly be updating to support Flight SQL
       | before we launch. We've been thinking about adding Postgres wire
       | protocol support, but this would be even better if enough
       | downstream clients and projects end up adding SQL Flight support.
       | 
       | Great work to the Arrow team!
        
         | pauldix wrote:
         | Oh, and for anyone interested in pitching in on the Rust
         | implementation, there's an issue logged here along with some
         | discussion: https://github.com/apache/arrow-rs/issues/1323
        
       | cogman10 wrote:
       | I'm not tuned into Arrow all that much. I've read some of the
       | about and stuff, but the code examples (to my eye) look really
       | complex and complicated. [1]
       | 
       | Could someone point me to a more glossy "arrow flight sql for
       | dummies" examples? What I'm gleaning from this (or am I wrong?)
       | is you could use a JDBC driver + arrow jdbc client and write...
       | SQL? Or is it something a lot different?
       | 
       | Is this the sort of thing where you could just add a plugin to
       | postgres and be arrowified or something?
       | 
       | [1]
       | https://github.com/apache/arrow/blob/release-7.0.0/java/flig...
        
         | lidavidm wrote:
         | There are several community members actively working on making
         | things more accessible in general. (Flight SQL definitely needs
         | some polish here.)
         | 
         | For Arrow itself, you could start with the Cookbook [1] (and
         | feel free to file issues if there are things that are unclear,
         | or that you would like to learn more about [2])
         | 
         | The example there is very long, but much of it is repetitive
         | (that said, it could certainly be improved). But to your
         | specific question, what you would eventually do is write a
         | plugin or proxy for Postgres (or another database) implementing
         | these RPC methods. Then you could submit SQL queries using the
         | Flight SQL client (which calls those RPC methods internally) to
         | get back columnar data. You could wrap this all in a JDBC
         | driver as well (which is being actively worked on [3]), in
         | which case you'd be mostly ignorant of what's going on
         | underneath; the benefit there would (hopefully) be faster data
         | access due to the lower layers being columnar (and zero-copy
         | where possible), and enabling parallelism due to Flight
         | (Postgres might not support this for instance, but other
         | databases could, if they could expose internal parallelism or
         | partitioning). Hope that helps.
         | 
         | [1]: https://arrow.apache.org/cookbook/ [2]:
         | https://github.com/apache/arrow-cookbook/ [3]:
         | https://github.com/apache/arrow/pull/12254
        
           | cogman10 wrote:
           | Ok, I think I'm getting it. Then some of the remaining work
           | (it sounds like) is getting that flight sql proxy/plugin
           | setup for 3rd party datastores but once that is done you
           | could hook up and run either flight directly through the
           | flight client or via ODBC/JDBC wrappers that make it feel a
           | little more sqly.
        
       ___________________________________________________________________
       (page generated 2022-02-16 23:00 UTC)