[HN Gopher] Apache Drill's Future ___________________________________________________________________ Apache Drill's Future Author : bsg75 Score : 90 points Date : 2020-05-20 18:46 UTC (4 hours ago) (HTM) web link (mail-archives.apache.org) (TXT) w3m dump (mail-archives.apache.org) | vhold wrote: | Please correct me if I have this wrong, but my vague | understanding is that the data representation heart of Apache | Drill lives on in the rather active Apache Arrow project. | | https://stackoverflow.com/questions/53533506/what-is-the-dif... | | https://github.com/apache/arrow/commit/e6905effbb9383afd2423... | | And the platform/tools side of Drill now lives on as Dremio, | which uses Apache Arrow. | | https://github.com/dremio/dremio-oss | | So the essence of Drill still lives, but it became half Apache | project and half vendor controlled and supported, and the root of | that split is now orphaned. | cube2222 wrote: | Sad to see this happen as I really like the idea! | | If you're interested in Drill, check out OctoSQL[0]. It shares | the same vision of querying multiple datasources using pure SQL, | and pushing down as much operations as possible to the underlying | datasource. | | Moreover, there's been a huge rewrite under way this past year, | ready to use on the master branch, yet unreleased however (will | be available soon). | | It adds Kafka and Parquet support and most importantly first- | class unbounded stream support. Including temporal SQL extensions | for working with event time metadata (instead of system time) so | you can use stuff like live updated time window aggregations on | incoming kafka streams. It also now uses on-disk badger storage | as the primary way to store its state, so you can do Group Bys / | Joins with lots of keys, and restarts of OctoSQL won't alter the | final result (exactly-once semantics). | | Make sure to check it out, it's also very simple to get going | locally! | | Disclosure: I'm one of the main contributors. | | [0]:https://github.com/cube2222/octosql | bsg75 wrote: | Does OctoSQL support reading columnar compressed formats (ex. | Parquet) from distributed storage (ex. S3) ? | cube2222 wrote: | No, we don't support Excel, JSON, CSV and Parquet datasources | other than local files yet, though that's definitely planned | and would be very easy to add. | willvarfar wrote: | How does it compare to prestosql? | cube2222 wrote: | Much less mature. | | There's no distributed execution yet. | | Easy to get going locally. | | I'm not sure but I think Presto is in-memory? We're | optimizing for SSD disks to easily support big states, and | achieve durability this way. SSD disks are still plenty fast. | | Definitely better streaming support. We're very much | concentrating on good streaming ergonomics. OctoSQL isn't a | batch execution engine at its heart, it's a streaming one. | chrisjc wrote: | Sad to hear this since I've been following Drill for a while. | From what I understand Drill was based on the google Dremel | paper, hence the name. | | https://research.google/pubs/pub36632/ | | Wondering if maybe Spark REPL or Apache Zeppelin might be a | decent replacement for Drill. | rhacker wrote: | One thing I like about Spark is the number of data sources it | supports, including its ability to write out to a table or | whatever. If you want to set up a JDBC or ODBC server you can | also access spark SQL : | | https://spark.apache.org/docs/latest/sql-distributed-sql-eng... | | From an analytics perspective, a lot of people like the ability | to connect to a JDBC or ODBC source. | kkwteh wrote: | I was thinking of using Apache Drill for converting CSVs to | Parquet. What else do people use for that? | poorman wrote: | https://pandas.pydata.org/pandas-docs/stable/reference/api/p... | nerdponx wrote: | +1, in my experience the Pandas CSV parser seems very robust | and more than sufficiently fast. | | The only faster/better CSV parser I've used with any | frequency is the fread function in the R package data.table. | When I used R in the past, Parquet was much less popular, but | I think now the arrow package supports writing to Parquet. | ecnahc515 wrote: | Maybe Presto/Athena | lurker458 wrote: | I've also been looking for that. In an ideal world there would | be a small, fast, standalone cli tool that can convert csv to | parquet. There is a (sadly, unfinished) parquet writer Rust | library in the Arrow repository that looks promising. All | approaches I've tried so far (spark, pyarrow, drill, ...) | require everything and the kitchen sink. So far I've settled on | a java cli tool that uses jackson + org.apache.parquet | internally, but it's cpu bound and has a huge amount of maven | dependencies. | mochomocha wrote: | pyarrow | bsg75 wrote: | Spark | falaki wrote: | Spark community put a lot of effort into a feature-rich CSV | data source. Often times the most challenging part of | ingesting CSV is parsing it. There are many flavors each with | different corner cases for NULLs, comments, headers, etc. | PaulHoule wrote: | How many Apache projects are there in the "Big Data" space? It | seems every time I look around I see a new one. | chrisjc wrote: | 49 according to the project list page. Some of which are in | attic, some in incubation. | | https://projects.apache.org/projects.html?category#big-data | | Were you looking for a single project for all your big-data | needs? | catawbasam wrote: | Looking for some continuity after those splashy Strata | presentations. | TallGuyShort wrote: | Strata is very corporate. If you want what you see in | Strata presentations, buy it from the vendors (Cloudera, | Microsoft, etc.) | | What you see on apache.org is what gets put in | presentations at say, the Hadoop Summit. | | edit: On a more helpful note, Apache Spark is probably as | close as you can get to a single project for all your big | data needs if that is what one wants out-of-the-box from an | open-source project. It includes a SQL framework, streaming | framework, either bundles or improves upon more general | work done in Hadoop, etc. It can be pretty vendor- | controlled at times, but it's birth was in academia, making | it pretty different from the other projects that were | mostly born as components in already established commercial | platforms. There are pros and cons to that, of course. | bradhe wrote: | I wonder how many apache projects actually survive this sort of | event? | agacera wrote: | Apache Drill is an interesting project, from all the MPP engines | that appeared a few years ago, it was the most similar one to | BigQuery (the first public version) and the most flexible. | | However, the competion was fierce and each Big Data vendor (MapR, | Cloudera and HortonWorks) was pushing its own solution: Drill, | Impala and Hive on Tez. Competion is always a good thing, but it | fragmented the user base too much so no clear winner emerged. | | At the same time, Spark SQL got sufficiently better to replace | these tools in most use cases and Presto (from Facebook) got the | traction and the user base that none of these projects had by | being vendor agnostic (and its adoption by AWS in Athena and EMR | also helped boost its popularity). | bsg75 wrote: | IIRC, earlier in the project, a differentiator for Drill was to | be the ability to run drillbit processes across servers, and | run distributed queries from one of them with Zookeeper as a | coordinator. This would have been a simple approach to | distributed queries where secondary extract and loading into a | distributed filesystem or Parquet file was not desired [1] | | Unfortunately to date, distributed queries will fail if the | paths _and_ files are not symmetric in name - all file paths | and names must exist on all nodes - therefore the "in situ" | approach is not available. It appears the project focused on | querying distributed file systems like HDFS and S3 and | therefore had a lot of competition. | | I hope some group picks up where HPE orphaned Drill after the | MapR acquisition and pivots to a pure distributed worker | approach. Running a drillbit on nodes where the data originates | could be useful, the original example was SQL over http logs | directly from webservers. | | [1] https://mapr.com/blog/drill-your-big-data-today-apache- | drill... | qeternity wrote: | I've not spent much time, but I've never exactly understood | what Presto is. Is it just map reduce across databases? | WookieRushing wrote: | Its basically a compute engine that maintains all state in | memory and does distributed computations similarly to Spark. | | The big thing it adds is that it isn't stuck to any storage | format. It has a connector that lets you load data into it | from basically anything, from mysql dbs to hdfs files to | whatever. So you can do cross database joins and just not | care about where the data lives. You can also output to | almost any database too. | qeternity wrote: | Not being snarky, but is that basically a sql interface for | map reduce across databases? | Boxxed wrote: | It's basically federated SQL. Nothing to do with | map/reduce, really. | qeternity wrote: | It loads everything from the data sources (presumably | pushing as much down to the underlying database) and then | does sql in ram. Pretty much the definition of map | reduce. Federated sql wouldn't give speed up across a | single database but presto does. | kyllo wrote: | Is there any form of distributed query engine in use | today that _doesn 't_ fit the definition of the MapReduce | pattern? Is describing a distributed query engine as | MapReduce still a meaningful distinction from some other, | non-MapReduce approach? | chrisjc wrote: | "Presto is an open-source distributed SQL query engine | optimized for low-latency, ad-hoc analysis of data. It | supports the ANSI SQL standard, including complex queries, | aggregations, joins, and window functions. Presto can process | data from multiple data sources including the Hadoop | Distributed File System (HDFS) and Amazon S3" | | TIL that Presto is available in EMR. | bsg75 wrote: | Its a distributed SQL engine that can query files from | various database engines (via connectors or JDBC drivers), | including structured file formats like CSV or Parquet (using | the Hive metastore). | | Presto does not manage storage itself, but instead focuses on | fronting those data sources with a single access point, with | the option to federate (join) different sources in a single | query. | dmix wrote: | Some context | | Drill: https://en.wikipedia.org/wiki/Apache_Drill | | MapR sold to Hewlett-Packard Enterprise (HPE): | https://en.wikipedia.org/wiki/MapR | timClicks wrote: | I'm really impressed with the misson of Drill - write SQL for | disparate data sources - but I've actually never installed it. | When I have a bunch of parquet/csv/... files sitting around, I | can normally slurp them in with pandas. | srl wrote: | For those interested, the relevant rules seem to be here: | https://www.apache.org/foundation/voting.html | | As far as I can tell, the implication is that there are now fewer | than three people interested enough to participate in code | reviews, and ASF rules require at least three +1 votes for | basically anything to happen. | rectang wrote: | That's basically the idea though the details are subtly | different. | | What the ASF won't let you do if you can't muster the votes is | actually make a _release_ -- that takes 3 votes from people on | the Drill PMC (Project Management Committee). If you can 't get | 3 PMC votes, the project cannot even make security releases and | must be retired. | | As to who can commit code, from the ASF's standpoint any person | with commit rights can do so at any time. However, the project | may impose additional constraints, such as requiring a code | review. ___________________________________________________________________ (page generated 2020-05-20 23:00 UTC)