[HN Gopher] Apache Arrow Flight SQL: Accelerating Database Access ___________________________________________________________________ Apache Arrow Flight SQL: Accelerating Database Access Author : ionychal Score : 82 points Date : 2022-02-16 14:55 UTC (8 hours ago) (HTM) web link (arrow.apache.org) (TXT) w3m dump (arrow.apache.org) | carlineng wrote: | Mark Raasveldt and Hannes Muhleisen at CWI wrote about these | challenges in their 2017 paper Don't Hold My Data Hostage [0]. | It's great to see progress on this front. | | [0] https://ir.cwi.nl/pub/26415 | eatonphil wrote: | tldr; it's a new wire protocol that columnar databases can opt | into that is better optimized for columnar data than ODBC/JDBC. | | Key quotes: While it can be directly used for | database access, it is not a direct replacement for JDBC/ODBC. | Instead, Flight SQL serves as a concrete wire protocol/driver | implementation that can support a JDBC/ODBC driver and reduces | implementation burden on databases. Flight SQL means | database servers can implement a standard interface that is | designed around Apache Arrow and columnar data from the start. | Just like how Arrow provides a standard in-memory format, Flight | SQL saves developers from having to design and implement an | entirely new wire protocol. As mentioned, Flight already | implements features like encryption on the wire and | authentication of requests, which databases do not need to re- | implement. | gavinray wrote: | This is true, but to clarify and expand a bit, you add support | for FlightSQL to some datasource by adding Protobuf RPC | handlers to it. This doesn't necessarily have to run in- | process, the FlightSQL RPC service might run side-by-side. | | The datasource also doesn't necessarily have to be a database. | It just needs to implement the RPC methods. So you could | implement your own data-access layer over any datasource (flat | file, an external API, a regular DBMS, etc) by providing an | implementation for "get all schemas", "get all tables", | "execute a query", etc. | | Some useful quotes: > The idea is that the | driver is a "one-size-fits-all" driver -- a user or tool vendor | only needs to supply a generic driver that can connect to an | infinite number of databases. This is even future-proof -- if a | new database comes out, it can work with existing tools as long | as an Arrow Flight SQL endpoint is provided. In fact, by adding | an Arrow Flight SQL endpoint they would automatically enable | JDBC connectivity too. > Not only will Arrow Flight | SQL reduce the technical burden on applications and users, but | it leverages Arrow, which means it will provide better | performance. ... having a single reference JDBC driver allows | any data source that adds an Arrow Flight SQL endpoint to get | JDBC "for free" as an onramp. So the selling point is add an | Arrow Flight SQL endpoint to your data source and automatically | get JDBC connectivity. | | I have been experimenting in my free time with building a | platform that autogenerates GraphQL CRUD API's on top of | arbitrary datasources. | | It lets you do federated/distributed queries and cross- | databsource joins. | | So you can connect it to a Postgres DB, a MySQL DB, a MongoDB, | and a CSV file and then run a query that joins and filters data | across them in a single operation. | | I am using Apache Calcite for this, but am interested in | potentially offloading the query plan execution using | FlightSQL/Substrait[0] for better performance, since I am | targeting OLTP workloads and it's latency-sensitive. | | https://github.com/GavinRay97/GraphQLCalcite | | [0] https://substrait.io/ | lmeyerov wrote: | Exciting, remember whiteboard conversations about this years ago, | and benchmarking early ideas like Turbodc :) | | I could not find: Is there a list of conformant DBs & their | implementation statuses? Been curious both for Flight and Flight | SQL.... | lidavidm wrote: | There is not a list (though right now it is zero, | unfortunately); Dremio has been doing some work around this | (and contributed the spec and implementation) though I can't | speak for their plans. | | Just a minor distinction - Flight is lower-level and might be | used to build a wire protocol for a database, but a "database | supporting Flight" still requires something like Flight SQL to | specify how the database expects to serve requests over Flight. | pauldix wrote: | This is great and makes a ton of sense as a refinement to their | work on Flight previously. InfluxDB IOx already supports Flight, | but we'll almost certainly be updating to support Flight SQL | before we launch. We've been thinking about adding Postgres wire | protocol support, but this would be even better if enough | downstream clients and projects end up adding SQL Flight support. | | Great work to the Arrow team! | pauldix wrote: | Oh, and for anyone interested in pitching in on the Rust | implementation, there's an issue logged here along with some | discussion: https://github.com/apache/arrow-rs/issues/1323 | cogman10 wrote: | I'm not tuned into Arrow all that much. I've read some of the | about and stuff, but the code examples (to my eye) look really | complex and complicated. [1] | | Could someone point me to a more glossy "arrow flight sql for | dummies" examples? What I'm gleaning from this (or am I wrong?) | is you could use a JDBC driver + arrow jdbc client and write... | SQL? Or is it something a lot different? | | Is this the sort of thing where you could just add a plugin to | postgres and be arrowified or something? | | [1] | https://github.com/apache/arrow/blob/release-7.0.0/java/flig... | lidavidm wrote: | There are several community members actively working on making | things more accessible in general. (Flight SQL definitely needs | some polish here.) | | For Arrow itself, you could start with the Cookbook [1] (and | feel free to file issues if there are things that are unclear, | or that you would like to learn more about [2]) | | The example there is very long, but much of it is repetitive | (that said, it could certainly be improved). But to your | specific question, what you would eventually do is write a | plugin or proxy for Postgres (or another database) implementing | these RPC methods. Then you could submit SQL queries using the | Flight SQL client (which calls those RPC methods internally) to | get back columnar data. You could wrap this all in a JDBC | driver as well (which is being actively worked on [3]), in | which case you'd be mostly ignorant of what's going on | underneath; the benefit there would (hopefully) be faster data | access due to the lower layers being columnar (and zero-copy | where possible), and enabling parallelism due to Flight | (Postgres might not support this for instance, but other | databases could, if they could expose internal parallelism or | partitioning). Hope that helps. | | [1]: https://arrow.apache.org/cookbook/ [2]: | https://github.com/apache/arrow-cookbook/ [3]: | https://github.com/apache/arrow/pull/12254 | cogman10 wrote: | Ok, I think I'm getting it. Then some of the remaining work | (it sounds like) is getting that flight sql proxy/plugin | setup for 3rd party datastores but once that is done you | could hook up and run either flight directly through the | flight client or via ODBC/JDBC wrappers that make it feel a | little more sqly. ___________________________________________________________________ (page generated 2022-02-16 23:00 UTC)