[HN Gopher] Apache Drill's Future
       ___________________________________________________________________
        
       Apache Drill's Future
        
       Author : bsg75
       Score  : 90 points
       Date   : 2020-05-20 18:46 UTC (4 hours ago)
        
 (HTM) web link (mail-archives.apache.org)
 (TXT) w3m dump (mail-archives.apache.org)
        
       | vhold wrote:
       | Please correct me if I have this wrong, but my vague
       | understanding is that the data representation heart of Apache
       | Drill lives on in the rather active Apache Arrow project.
       | 
       | https://stackoverflow.com/questions/53533506/what-is-the-dif...
       | 
       | https://github.com/apache/arrow/commit/e6905effbb9383afd2423...
       | 
       | And the platform/tools side of Drill now lives on as Dremio,
       | which uses Apache Arrow.
       | 
       | https://github.com/dremio/dremio-oss
       | 
       | So the essence of Drill still lives, but it became half Apache
       | project and half vendor controlled and supported, and the root of
       | that split is now orphaned.
        
       | cube2222 wrote:
       | Sad to see this happen as I really like the idea!
       | 
       | If you're interested in Drill, check out OctoSQL[0]. It shares
       | the same vision of querying multiple datasources using pure SQL,
       | and pushing down as much operations as possible to the underlying
       | datasource.
       | 
       | Moreover, there's been a huge rewrite under way this past year,
       | ready to use on the master branch, yet unreleased however (will
       | be available soon).
       | 
       | It adds Kafka and Parquet support and most importantly first-
       | class unbounded stream support. Including temporal SQL extensions
       | for working with event time metadata (instead of system time) so
       | you can use stuff like live updated time window aggregations on
       | incoming kafka streams. It also now uses on-disk badger storage
       | as the primary way to store its state, so you can do Group Bys /
       | Joins with lots of keys, and restarts of OctoSQL won't alter the
       | final result (exactly-once semantics).
       | 
       | Make sure to check it out, it's also very simple to get going
       | locally!
       | 
       | Disclosure: I'm one of the main contributors.
       | 
       | [0]:https://github.com/cube2222/octosql
        
         | bsg75 wrote:
         | Does OctoSQL support reading columnar compressed formats (ex.
         | Parquet) from distributed storage (ex. S3) ?
        
           | cube2222 wrote:
           | No, we don't support Excel, JSON, CSV and Parquet datasources
           | other than local files yet, though that's definitely planned
           | and would be very easy to add.
        
         | willvarfar wrote:
         | How does it compare to prestosql?
        
           | cube2222 wrote:
           | Much less mature.
           | 
           | There's no distributed execution yet.
           | 
           | Easy to get going locally.
           | 
           | I'm not sure but I think Presto is in-memory? We're
           | optimizing for SSD disks to easily support big states, and
           | achieve durability this way. SSD disks are still plenty fast.
           | 
           | Definitely better streaming support. We're very much
           | concentrating on good streaming ergonomics. OctoSQL isn't a
           | batch execution engine at its heart, it's a streaming one.
        
       | chrisjc wrote:
       | Sad to hear this since I've been following Drill for a while.
       | From what I understand Drill was based on the google Dremel
       | paper, hence the name.
       | 
       | https://research.google/pubs/pub36632/
       | 
       | Wondering if maybe Spark REPL or Apache Zeppelin might be a
       | decent replacement for Drill.
        
         | rhacker wrote:
         | One thing I like about Spark is the number of data sources it
         | supports, including its ability to write out to a table or
         | whatever. If you want to set up a JDBC or ODBC server you can
         | also access spark SQL :
         | 
         | https://spark.apache.org/docs/latest/sql-distributed-sql-eng...
         | 
         | From an analytics perspective, a lot of people like the ability
         | to connect to a JDBC or ODBC source.
        
       | kkwteh wrote:
       | I was thinking of using Apache Drill for converting CSVs to
       | Parquet. What else do people use for that?
        
         | poorman wrote:
         | https://pandas.pydata.org/pandas-docs/stable/reference/api/p...
        
           | nerdponx wrote:
           | +1, in my experience the Pandas CSV parser seems very robust
           | and more than sufficiently fast.
           | 
           | The only faster/better CSV parser I've used with any
           | frequency is the fread function in the R package data.table.
           | When I used R in the past, Parquet was much less popular, but
           | I think now the arrow package supports writing to Parquet.
        
         | ecnahc515 wrote:
         | Maybe Presto/Athena
        
         | lurker458 wrote:
         | I've also been looking for that. In an ideal world there would
         | be a small, fast, standalone cli tool that can convert csv to
         | parquet. There is a (sadly, unfinished) parquet writer Rust
         | library in the Arrow repository that looks promising. All
         | approaches I've tried so far (spark, pyarrow, drill, ...)
         | require everything and the kitchen sink. So far I've settled on
         | a java cli tool that uses jackson + org.apache.parquet
         | internally, but it's cpu bound and has a huge amount of maven
         | dependencies.
        
         | mochomocha wrote:
         | pyarrow
        
         | bsg75 wrote:
         | Spark
        
           | falaki wrote:
           | Spark community put a lot of effort into a feature-rich CSV
           | data source. Often times the most challenging part of
           | ingesting CSV is parsing it. There are many flavors each with
           | different corner cases for NULLs, comments, headers, etc.
        
       | PaulHoule wrote:
       | How many Apache projects are there in the "Big Data" space? It
       | seems every time I look around I see a new one.
        
         | chrisjc wrote:
         | 49 according to the project list page. Some of which are in
         | attic, some in incubation.
         | 
         | https://projects.apache.org/projects.html?category#big-data
         | 
         | Were you looking for a single project for all your big-data
         | needs?
        
           | catawbasam wrote:
           | Looking for some continuity after those splashy Strata
           | presentations.
        
             | TallGuyShort wrote:
             | Strata is very corporate. If you want what you see in
             | Strata presentations, buy it from the vendors (Cloudera,
             | Microsoft, etc.)
             | 
             | What you see on apache.org is what gets put in
             | presentations at say, the Hadoop Summit.
             | 
             | edit: On a more helpful note, Apache Spark is probably as
             | close as you can get to a single project for all your big
             | data needs if that is what one wants out-of-the-box from an
             | open-source project. It includes a SQL framework, streaming
             | framework, either bundles or improves upon more general
             | work done in Hadoop, etc. It can be pretty vendor-
             | controlled at times, but it's birth was in academia, making
             | it pretty different from the other projects that were
             | mostly born as components in already established commercial
             | platforms. There are pros and cons to that, of course.
        
       | bradhe wrote:
       | I wonder how many apache projects actually survive this sort of
       | event?
        
       | agacera wrote:
       | Apache Drill is an interesting project, from all the MPP engines
       | that appeared a few years ago, it was the most similar one to
       | BigQuery (the first public version) and the most flexible.
       | 
       | However, the competion was fierce and each Big Data vendor (MapR,
       | Cloudera and HortonWorks) was pushing its own solution: Drill,
       | Impala and Hive on Tez. Competion is always a good thing, but it
       | fragmented the user base too much so no clear winner emerged.
       | 
       | At the same time, Spark SQL got sufficiently better to replace
       | these tools in most use cases and Presto (from Facebook) got the
       | traction and the user base that none of these projects had by
       | being vendor agnostic (and its adoption by AWS in Athena and EMR
       | also helped boost its popularity).
        
         | bsg75 wrote:
         | IIRC, earlier in the project, a differentiator for Drill was to
         | be the ability to run drillbit processes across servers, and
         | run distributed queries from one of them with Zookeeper as a
         | coordinator. This would have been a simple approach to
         | distributed queries where secondary extract and loading into a
         | distributed filesystem or Parquet file was not desired [1]
         | 
         | Unfortunately to date, distributed queries will fail if the
         | paths _and_ files are not symmetric in name - all file paths
         | and names must exist on all nodes - therefore the "in situ"
         | approach is not available. It appears the project focused on
         | querying distributed file systems like HDFS and S3 and
         | therefore had a lot of competition.
         | 
         | I hope some group picks up where HPE orphaned Drill after the
         | MapR acquisition and pivots to a pure distributed worker
         | approach. Running a drillbit on nodes where the data originates
         | could be useful, the original example was SQL over http logs
         | directly from webservers.
         | 
         | [1] https://mapr.com/blog/drill-your-big-data-today-apache-
         | drill...
        
         | qeternity wrote:
         | I've not spent much time, but I've never exactly understood
         | what Presto is. Is it just map reduce across databases?
        
           | WookieRushing wrote:
           | Its basically a compute engine that maintains all state in
           | memory and does distributed computations similarly to Spark.
           | 
           | The big thing it adds is that it isn't stuck to any storage
           | format. It has a connector that lets you load data into it
           | from basically anything, from mysql dbs to hdfs files to
           | whatever. So you can do cross database joins and just not
           | care about where the data lives. You can also output to
           | almost any database too.
        
             | qeternity wrote:
             | Not being snarky, but is that basically a sql interface for
             | map reduce across databases?
        
               | Boxxed wrote:
               | It's basically federated SQL. Nothing to do with
               | map/reduce, really.
        
               | qeternity wrote:
               | It loads everything from the data sources (presumably
               | pushing as much down to the underlying database) and then
               | does sql in ram. Pretty much the definition of map
               | reduce. Federated sql wouldn't give speed up across a
               | single database but presto does.
        
               | kyllo wrote:
               | Is there any form of distributed query engine in use
               | today that _doesn 't_ fit the definition of the MapReduce
               | pattern? Is describing a distributed query engine as
               | MapReduce still a meaningful distinction from some other,
               | non-MapReduce approach?
        
           | chrisjc wrote:
           | "Presto is an open-source distributed SQL query engine
           | optimized for low-latency, ad-hoc analysis of data. It
           | supports the ANSI SQL standard, including complex queries,
           | aggregations, joins, and window functions. Presto can process
           | data from multiple data sources including the Hadoop
           | Distributed File System (HDFS) and Amazon S3"
           | 
           | TIL that Presto is available in EMR.
        
           | bsg75 wrote:
           | Its a distributed SQL engine that can query files from
           | various database engines (via connectors or JDBC drivers),
           | including structured file formats like CSV or Parquet (using
           | the Hive metastore).
           | 
           | Presto does not manage storage itself, but instead focuses on
           | fronting those data sources with a single access point, with
           | the option to federate (join) different sources in a single
           | query.
        
       | dmix wrote:
       | Some context
       | 
       | Drill: https://en.wikipedia.org/wiki/Apache_Drill
       | 
       | MapR sold to Hewlett-Packard Enterprise (HPE):
       | https://en.wikipedia.org/wiki/MapR
        
       | timClicks wrote:
       | I'm really impressed with the misson of Drill - write SQL for
       | disparate data sources - but I've actually never installed it.
       | When I have a bunch of parquet/csv/... files sitting around, I
       | can normally slurp them in with pandas.
        
       | srl wrote:
       | For those interested, the relevant rules seem to be here:
       | https://www.apache.org/foundation/voting.html
       | 
       | As far as I can tell, the implication is that there are now fewer
       | than three people interested enough to participate in code
       | reviews, and ASF rules require at least three +1 votes for
       | basically anything to happen.
        
         | rectang wrote:
         | That's basically the idea though the details are subtly
         | different.
         | 
         | What the ASF won't let you do if you can't muster the votes is
         | actually make a _release_ -- that takes 3 votes from people on
         | the Drill PMC (Project Management Committee). If you can 't get
         | 3 PMC votes, the project cannot even make security releases and
         | must be retired.
         | 
         | As to who can commit code, from the ASF's standpoint any person
         | with commit rights can do so at any time. However, the project
         | may impose additional constraints, such as requiring a code
         | review.
        
       ___________________________________________________________________
       (page generated 2020-05-20 23:00 UTC)