[HN Gopher] InfluxDB is betting on Rust and Apache Arrow for nex...
       ___________________________________________________________________
        
       InfluxDB is betting on Rust and Apache Arrow for next-gen data
       store
        
       Author : mhall119
       Score  : 105 points
       Date   : 2020-11-10 18:26 UTC (4 hours ago)
        
 (HTM) web link (www.influxdata.com)
 (TXT) w3m dump (www.influxdata.com)
        
       | thamer wrote:
       | I'm using Apache Arrow to store CSV-like data and it works
       | amazingly well.
       | 
       | The datasets I work with contains a few billion records with 5-10
       | fields each, almost all 64-bit longs; I originally started with
       | CSV and soon switched to a binary format with fixed-sized records
       | (mmap'd) which gave great performance improvements, but the
       | flexibility, size gains due to columnar compression and the much
       | greater performance of Arrow for queries that span a single
       | column or a small number of them won me over.
       | 
       | For anyone who has to process even a few million records locally,
       | I would highly recommend it.
        
         | stevesimmons wrote:
         | Arrow + Parquet is brilliant!
         | 
         | Right now I'm writing tools in Python (Python!) to analyse
         | several 100TB datasets in S3. Each dataset is made up of 1000+
         | 6GB parquet files (tables UNLOADed from AWS Redshift db).
         | Parquet's columnar compression gives a 15x reduction in on-disk
         | size. Parquet also stores chunk metadata at the end of each
         | file, allowing reads to skip over most data that isn't
         | relevant.
         | 
         | And once in memory, the Arrow format gives zero-copy
         | compatibility with Numpy and Pandas.
         | 
         | If you try this with Python, make sure you use the latest 2.0.0
         | version of PyArrow [1]. Two other interesting libraries for
         | manipulating PyArrow Tables and ChunkedArrays are fletcher [2]
         | and graphique[3].
         | 
         | [1] I use: conda install -c conda-forge pyarrow python-snappy
         | 
         | [2] https://github.com/xhochy/fletcher
         | 
         | [3] https://github.com/coady/graphique
        
           | pauldix wrote:
           | Yeah, Parquet is awesome. One of the things we really want to
           | do here is to push DataFusion (the Rust based SQL execution
           | engine) to work on Parquet files, but to push down predicates
           | and other things and operate on the data while it's
           | compressed.
           | 
           | You pay such a high overhead marshalling that data into an
           | Arrow RecordBatch. Best thing ever is to work with the
           | Parquet file and not even decompress the chunks that you
           | don't need. Of course, this assumes that you're writing
           | summary statistics as part of the metadata, which we plan to
           | do.
        
             | nevi-me wrote:
             | There's rudimentary statistics support, but I've found that
             | it accounts for a lot of the write time (I wrote some tests
             | last weekend, I can ping your team when I put them on GH).
             | 
             | Improving our stats writing could yield a lot of benefits.
             | I'll open JIRAs for this in the next few days.
        
           | gizmodo59 wrote:
           | Parquet + Arrow reminds me of a fast SQL engine on data lake
           | called Dremio.
           | 
           | https://www.dremio.com/webinars/apache-arrow-calcite-
           | parquet...
           | 
           | They also have an OSS version in GitHub.
        
             | pauldix wrote:
             | They're heavily into Arrow. A few years ago they
             | contributed Gandiva, an LLVM expression compiler for super
             | fast processing.
             | https://arrow.apache.org/blog/2018/12/05/gandiva-donation/
             | 
             | It's one of the reasons I like being all in on Arrow. Why
             | do everything ourselves when a ton of other smart people
             | are working on this too?
        
               | nevi-me wrote:
               | Talking about Gandiva, something that's open for taking:
               | https://issues.apache.org/jira/browse/ARROW-5315
               | (creating Gandiva bindings for Rust).
               | 
               | I think DataFusion is mature enough that we could plug in
               | Gandiva into it.
               | 
               | Disclaimer: I work on the Arrow Rust implementation
        
               | pauldix wrote:
               | Ah yes, of course, hi Nevi :). Thank you again for all
               | your work on the Rust implementation. We're obviously big
               | fans.
               | 
               | Gandiva bindings is definitely something we should look
               | into, but I'm guessing there's much lower hanging fruit
               | within DataFusion in terms of optimizing, particularly
               | for our use case.
        
               | nevi-me wrote:
               | Thanks Paul :)
               | 
               | I think with compute functions/kernels, we're sitting
               | under a grapevine, so we'll be able to add a lot to Arrow
               | without yet needing Gandiva bindings. The Rust stdsimd
               | [0] work will also enable us to better use SIMD in ~ a
               | year from now (I hope)
               | 
               | [0] https://github.com/rust-lang/stdsimd
        
         | polskibus wrote:
         | Did you have to implement your-format to Arrow conversion
         | /abstraction layer manually or was it already available? Could
         | you give out some pointers on how to roll out own binary-
         | format-to-arrow query engine? Why didn't you use parquet for
         | storage?
        
           | mhall119 wrote:
           | InfluxDB IOx will use Parquet files for persistent storage
        
       | pauldix wrote:
       | InfluxDB creator here. I've actually been working on this one
       | myself and am excited to answer any questions. The project is
       | InfluxDB IOx (short for iron oxide, pronounced eye-ox). Among
       | other things, it's an in-memory columnar database with object
       | storage as the persistence layer.
        
         | Yoric wrote:
         | What's it best suited for?
        
         | rubiquity wrote:
         | In that sense, InfluxDB will start to have a decent amount of
         | overlap with something like Snowflake or Redshift.
        
           | mhall119 wrote:
           | Snowflake was actually specifically mentioned as an example
           | of who else is doing this in the blog post
        
           | pauldix wrote:
           | Yes, this will likely be the case, but it's not a specific
           | goal. We're focusing our efforts right now on the InfluxDB
           | time series use cases we see most. More general data
           | warehousing work isn't our focus, but we expect to pick up
           | quite a bit of that along the way as we develop and as the
           | underlying Arrow tools develop.
        
       | sciurus wrote:
       | InfluxData is arguably playing catch-up with Thanos, Cortex, and
       | other scale-out Prometheus backends for the metrics use case.
       | Given that, I wonder why they decided to write a new storage
       | backend from scratch instead of building on the work Thano and
       | Cortex have done. Those two competing projects are successfully
       | sharing a lot of code that allows all data to be stored in object
       | storage like S3.
       | 
       | https://grafana.com/blog/2020/07/29/how-blocks-storage-in-co...
        
         | throwawaytsdb wrote:
         | I agree, they appear to be playing catch up on many fronts.
         | Notably, with Cortex, Tempo, and Loki, Grafana Labs seem to
         | have pulled way ahead in advancing a successful open-source
         | cloud observability strategy.
         | 
         | InfluxData have a long history of writing (and rewriting) their
         | own storage engines, so choosing to do it again is
         | unsurprising. I guess this sort of hints that the current
         | TSM/TSI have probably reached their performance and scalability
         | limits and will be EOL before too long.
         | 
         | What I find interesting is that this project is already almost
         | a year old and only has six contributors (two of whom look like
         | external contractors). It seems more like a fun side project
         | than the future core of the database that is supposed to be
         | deployed into production next year.
        
           | pauldix wrote:
           | I think the best new projects are created by a small focused
           | team. Adding too many people too early actually slows things
           | down. But, of course, I'm biased.
           | 
           | The thing about this getting to production next year is that
           | we're doing it in our cloud, which is a services based system
           | where we can bring all sorts of operational tooling to bear.
           | Out of band backups, usage of cloud native services, shadow
           | serving, red/green deploys, and all sorts of things.
           | Basically, it's easier to deploy a service to production once
           | you've built a suite of operational tools to make it possible
           | to do reliably while testing under production workloads that
           | don't actually face the customer.
           | 
           | As for us rewriting the core of the database, that's true.
           | But I think you're unrealistic about what the data systems
           | look like in closed source SaaS providers as they advance
           | through orders of magnitude of scale. Hint: they rewrite and
           | redo their systems.
           | 
           | As for Grafna, MetricsTank was their first, Cortex wasn't
           | developed there, and Loki and Tempo look like interesting
           | projects.
           | 
           | None of those things has the exact same goal as InfluxDB. And
           | InfluxDB isn't meant to be open source DataDog. That's not
           | our thing. We want to be a platform for building time series
           | applications across many use cases, some of which might be
           | modern observability. It also doesn't preclude you from
           | pairing InfluxDB with those other tools.
        
           | mhall119 wrote:
           | To be clear, InfluxDB IOx will be using Apache Arrow,
           | DataFusion and Parquet, and contributing to those upstream
           | projects, not creating our own thing.
        
           | nemothekid wrote:
           | > _What I find interesting is that this project is already
           | almost a year old and only has six contributors_
           | 
           | I don't think this is a fair criticism. Postgres only has 7
           | core contributors - and influxDB is far less complex
           | targeting a much simpler use case.
        
             | throwawaytsdb wrote:
             | PostgreSQL might have only seven people on the "core team",
             | but the active contributors list is much longer:
             | https://wiki.postgresql.org/wiki/Committers
             | 
             | Furthermore, hundreds of people have been responsible for
             | its development over the years:
             | https://www.postgresql.org/community/contributors/
             | 
             | As an additional, possibly more relevant example - the
             | Cortex project has 158 contributors:
             | https://github.com/cortexproject/cortex
             | 
             | My larger point is that, in contrast with the new project,
             | InfluxDB currently has ~400 contributors. I'm certain that
             | many dozens of those were involved in getting the current
             | storage engine to a stable place. And now that hard work is
             | on a path to being deprecated by moving to a completely new
             | language and set of underlying technologies.
             | 
             | Taking the project from a handful of contributors to a
             | production-ready technology within an existing ecosystem is
             | a non-trivial task. I'm sure it will come together
             | eventually, but the commitment to ship it "early next year"
             | seems unlikely to me.
        
               | pauldix wrote:
               | We'll be producing builds early next year. Those won't be
               | anything we're recommending for production. Our goal is
               | to have an early alpha in our own cloud environment by
               | the end of Q1. I stress alpha. But we'll also have a
               | bunch of tooling around it (which we've already built for
               | the other parts of our cloud product) to backup data out
               | of band, monitor it, shadow test it against production
               | workloads, etc.
               | 
               | We're also building on top of the work of a bunch of
               | others that have built Arrow and libraries within the
               | Rust ecosystem.
               | 
               | When the open source GAs, I don't really know. But we're
               | doing this out in the open so people can see, comment,
               | and maybe even contribute. Who knows, maybe after a few
               | years you'll be a convert ;)
        
         | Thaxll wrote:
         | Also: https://github.com/VictoriaMetrics/VictoriaMetrics from
         | the guy that did FastHTTP ( the fastest HTTP lib for Go ).
        
         | pauldix wrote:
         | Those two systems are designed to work with Prometheus style
         | metrics, which are very specific. You have metrics, labels,
         | float values, and millisecond epochs.
         | 
         | I'm not totally sure how they index things, but I would guess
         | that it's by time series with an inverted index style mapping
         | for the metric and label data to underlying time series. This
         | means they'll have the same problems with working with high
         | cardinality data that I outlined in the blog post.
         | 
         | InfluxDB aims to hit a broader audience over just metrics. We
         | think the table model is great, particularly for event time
         | series, which we want to be best in class for. A columnar
         | database is better suited for analytics queries, and given the
         | right structure is every bit as good for metrics queries.
        
       | candiddevmike wrote:
       | What a ride. You're close to releasing Influx 2.0 without a clear
       | migration strategy for your customers, and then you think it's a
       | good idea to announce yet another storage rewrite? Why should
       | customers stick with you guys when you have a track record for
       | shipping half baked software, rewriting it, and leaving people
       | out in the cold?
        
         | pauldix wrote:
         | Users of InfluxDB 1.x can upgrade to 2.0 today. It's an in-
         | place upgrade and just requires a quick command to make it
         | work. Further, InfluxDB 2.0 also has the API for InfluxDB 1.x.
         | We've been putting out InfluxDB 1.x releases while we've been
         | developing 2.0. One of the reasons we waited so long to
         | finalize GA for 2.0 was because we had to make sure there was a
         | clean migration path and there is.
         | 
         | For our cloud 1 customers, they'll be able to upgrade to our
         | cloud 2 offering, but in the meantime, their existing
         | installations get the same 24x7 coverage and service we've been
         | providing for years.
         | 
         | As for how this will be deployed, it will be a seamless
         | transition for our cloud customers when we do so. Data,
         | monitoring and analytics companies replace their backend data
         | planes multiple times over the course of their lifetime.
         | 
         | For our Enterprise customers, we'll provide an upgrade path,
         | but not until this product is mature enough to ship an on-
         | premise binary that won't get a chance to get upgraded but for
         | a few times a year.
         | 
         | The only difference here is that we're doing it as open source.
         | They always do theirs behind closed doors. I'm sure most of our
         | users and many of our customers prefer our open source
         | approach.
        
         | mhall119 wrote:
         | InfluxDB 1.x user can upgrade to InfluxDB 2.0 pretty easily,
         | there is an `upgrade` command that will convert your metadata
         | from 1.x to 2.0.
         | 
         | Your time series data doesn't even need to be touched, it'll
         | "just work" after the upgrade.
        
           | candiddevmike wrote:
           | This is different than the previous `influxd migrate` command
           | that never seemed to work, right?
        
             | mhall119 wrote:
             | For a while the 2.0 development branch was using a
             | different storage file format than 1.x, which required
             | migrating your time series data.
             | 
             | But by the 2.0 Release Candidate that was reverted so that
             | it will use the same file format as 1.x, and the 2.0
             | functionality was backported onto that, so the upgrade path
             | for 1.x to 2.0 is much simpler now than it was going to be.
        
       | jdub wrote:
       | I didn't see it in the post, but a huge amount of this is due to
       | Andy Grove's work on Rust implementations of Apache Arrow and
       | DataFusion.
       | 
       | I imagine he's happy where he is, but I hope there's some
       | opportunity for InfluxDB to give credit and support for his great
       | work.
        
         | pauldix wrote:
         | It absolutely is and we've been contributing back. A even
         | bigger amount of this is based on Wes McKinney's work on Arrow.
         | Andy is great and he's been helpful as we've been working with
         | DataFusion.
        
       | gizmodo59 wrote:
       | Another SQL engine on data lake that heavily uses arrow is
       | Dremio.
       | 
       | https://www.dremio.com/webinars/apache-arrow-calcite-parquet...
       | 
       | https://github.com/dremio/dremio-oss
       | 
       | If you have parquet on S3 using an engine like Dremio can give
       | you good results. Some key innovations in open source on data
       | analytics:
       | 
       | Arrow - Columnar in memory format Gandiva - LLVM based execution
       | kernel Arrow flight - Wire protocol based on arrow Project Nessie
       | - A git like workflow for data lakes
       | 
       | https://arrow.apache.org/
       | https://arrow.apache.org/docs/format/Flight.html
       | https://arrow.apache.org/blog/2018/12/05/gandiva-donation/
       | https://github.com/projectnessie/nessie
        
         | StreamBright wrote:
         | What service could I replace Athena / PrestoDB that uses Apache
         | Arrow?
        
           | gizmodo59 wrote:
           | Looking at Dremio's website they seem to be a good competitor
           | to presto/Athena for some use cases.
           | 
           | Alternative solutions depends on your use case. If it's about
           | querying S3 data then Dremio/Athena/Presto are good.
        
       | jaymebb wrote:
       | Very cool! - I'm curious how far Influx will then move into being
       | a general purpose columnar database system outside of typical
       | timeseries workloads - moving more into being a general purpose
       | OLAP DB for analytical "data science" workload?
       | 
       | Will there be any type of transactional guarantees (ACID) using
       | MVCC or similar?
       | 
       | Is the execution engine vectorised?
        
         | pauldix wrote:
         | Execution is vectorized, but that's Arrow really. We'd like
         | this to be useful for general OLAP workloads, but the focus for
         | the next year is definitely going to be on our bread and butter
         | time series stuff.
         | 
         | That being said, Arrow Flight will be a first class RPC
         | mechanism, which makes it quite nice for data science stuff as
         | you can get data in to a dataframe in Pandas or R in a few
         | lines of code and with almost zero
         | serialization/deserialization overhead.
         | 
         | This isn't meant to be a transactional system. More like a data
         | processing system for data in object storage. I'm curious what
         | your need is there for OLAP workloads, can you tell me more?
        
       ___________________________________________________________________
       (page generated 2020-11-10 23:02 UTC)