hngopher.com

       [HN Gopher] IOx: InfluxData's New Storage Engine
       ___________________________________________________________________
        
       IOx: InfluxData's New Storage Engine
        
       Author : resizeitplz
       Score  : 129 points
       Date   : 2022-10-26 14:51 UTC (8 hours ago)
        
 (HTM) web link (www.influxdata.com)
 (TXT) w3m dump (www.influxdata.com)
        
       | toinbis wrote:
       | Happy longtime Influxdb user here. I wanted to congratulate Paul
       | and the team on reaching this milestone. Followed IOx development
       | a bit - can't wait to finally test it out!
        
       | michael_j_ward wrote:
       | Just want to say congratulations to the team!
       | 
       | 2 years and 9,500+ commits is a hell of a feat.
       | 
       | https://github.com/influxdata/influxdb_iox
        
       | okay_dude_q wrote:
        
       | mildbyte wrote:
       | Just wanted to also give a shout out to Apache DataFusion[0] that
       | IOx relies on a lot (and contributes to as well!).
       | 
       | It's a framework for writing query engines in Rust that takes
       | care of a lot of heavy lifting around parsing SQL, type casting,
       | constructing and transforming query plans and optimizing them.
       | It's pluggable, making it easy to write custom data sources,
       | optimizer rules, query nodes etc.
       | 
       | It's has very good single-node performance (there's even a way to
       | compile it with SIMD support) and Ballista [1] extends that to
       | build it into a distributed query engine.
       | 
       | Plenty of other projects use it besides IOx, including
       | VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily
       | using it to build Seafowl [2], an analytical database that's
       | optimized for running SQL queries directly from the user's
       | browser (caching, CDNs, low latency, some WASM support, all that
       | fun stuff).
       | 
       | [0] https://github.com/apache/arrow-datafusion
       | 
       | [1] https://github.com/apache/arrow-ballista
       | 
       | [2] https://github.com/splitgraph/seafowl
        
         | pauldix wrote:
         | DataFusion is great, we're happy to be contributing to it. Also
         | excited to see so many people around the world picking it up
         | and contributing as well. With our development efforts on IOx,
         | it's like a strong tailwind. But we put a ton of effort into
         | helping manage community efforts (thanks, alamb! our developer
         | on IOx that is also on the Arrow PMC).
        
           | andygrove wrote:
           | Original author of DataFusion/Ballista here. Having alamb and
           | others from InfluxData involved has been a huge help in
           | driving the project forward and helping build an active
           | community behind the project. It is genuinely hard to keep up
           | with the momentum these days!
        
             | menaerus wrote:
             | Hi, I just had a glance over the DataFusion project. Very
             | interesting work out there which I will be definitely
             | keeping the track of but I've got a genuine question. Do
             | you sometimes find development in Rust a little bit
             | challenging for large-scale and performance sensitive type
             | of work?
             | 
             | I say this because I've noticed more than several PRs
             | fixing (large) performance regressions which to my
             | understanding were mostly introduced due to unforeseen or
             | unexpected Rust compiler subtleties which would then lead
             | to less than optimal code generation. One example of such
             | event was a naive and simply looking abstraction that was
             | introduced and which brought down the performance by
             | something like 50% in TPC-H benchmarks. This really struck
             | me a little bit, especially because it seems quite hard to
             | identify the root cause, and I would like to hear the
             | experiences from the first hand. Thanks a bunch!
        
           | nevi-me wrote:
           | Your initial experiments and decision to build on arrow-rs
           | has been great for the project. Thank you and everyone
           | involved.
        
         | [deleted]
        
         | ignoramous wrote:
         | > _We 're heavily using it to build Seafowl, an analytical
         | database that's optimized for running SQL queries directly from
         | the user's browser..._
         | 
         | Interesting. Where does _seafowl_ fit in when I compare it
         | with, say, data-stack-in-a-box approach, for ex: meltano + dbt
         | + duckdb + superset [0]? Is my thinking right that _seafowl_
         | possibly replaces both _duckdb_ (with IOx) and superset (if
         | there 's a web front-end)?
         | 
         | Incidentally, dagster had an article up just yesterday making a
         | case for poor-man's datalake with dbt + dagster + duckdb [1].
         | What does _splitgraph_ replace if I were to use _it_ in a
         | similar setup?
         | 
         | Thanks.
         | 
         | [0] https://archive.is/DxU1e
         | 
         | [1] https://archive.is/5ikU4
        
           | mildbyte wrote:
           | Great question! With Seafowl, the idea is different from what
           | the modern data stack addresses. It's trying to simplify
           | public-facing Web-based visualizations: apps that need to run
           | analytical queries on large datasets and can be accessed by
           | users all around the world. This is why we made the query API
           | easily cacheable by CDNs and Seafowl itself easy to deploy at
           | the edge, e.g. with Fly.io.
           | 
           | It's a fairly different use case from DuckDB (query execution
           | for Web applications vs fast embedded analytical database for
           | notebooks) and the rest of the modern data stack (which
           | mostly is about analytics internal to a company). Just to
           | clarify, we're not related to IOx directly (only via us both
           | using Apache DataFusion).
           | 
           | If we had to place Seafowl _inside_ of the modern data stack,
           | it'd be mostly a warehouse, but one that is optimized for
           | being queried from the Internet, rather than by a limited set
           | of internal users. Or, a potential use case could be
           | extracting internal data from your warehouse to Seafowl in
           | order to build public applications that use it.
           | 
           | We don't currently ship a Web front-end and so can't serve as
           | a replacement to Superset: it's exposed to the developer as
           | an HTTP API that can be queried directly from the end user's
           | Web browser. But we have some ideas around a frontend
           | component: some kind of a middleware, where the Web app can
           | pre-declare the queries it will need to run at build time and
           | we can compute some pre-aggregations to speed those up at
           | runtime. Currently we recommend querying it with Observable
           | [0] for an end-to-end query + visualization experience (or
           | use a different viz library like d3/Vega).
           | 
           | Re: the second question about Splitgraph for a data lake, the
           | intention behind Splitgraph is to orchestrate all those tools
           | and there the use case is indeed the modern data stack in a
           | box. It's kind of similar to dbt Labs's Sinter [1] which was
           | supposed to be the end-to-end data platform before they
           | focused on dbt and dbt Cloud instead: being able to run
           | Airbyte ingestion, dbt transformations, be a data warehouse
           | (using PostgreSQL and a columnar store extension), let users
           | organize and discover data at the same time. There's a lot of
           | baggage in Splitgraph though, as we moved through a few
           | iterations of the product (first Git/Docker for data, then a
           | platform for the modern data stack). Currently we're thinking
           | about how to best integrate Splitgraph and Seafowl in order
           | to build a managed pay-as-you-go Seafowl, kind of like Fauna
           | [2] for analytics.
           | 
           | Hope this helps!
           | 
           | [0] https://observablehq.com/@seafowl/interactive-
           | visualization-...
           | 
           | [1] https://www.getdbt.com/blog/whats-in-a-name/
           | 
           | [2] https://fauna.com/
        
       | PaulWaldman wrote:
       | >Unbounded cardinality
       | 
       | This has been the largest criticism of InfluxDB in the past.
       | Kudos to the team for acknowledging and solving it!
       | 
       | > IOx supports SQL natively and our cloud customers can connect
       | using Postgres-compatible clients like psql, Grafana's Postgres
       | data source, and BI tools like PowerBI and Tableau.
       | 
       | Initially InfluxDB had InfluxQL, a SQL like language for querying
       | data. Then they transitioned to Flux, indicating it was superior
       | to writing complex SQL queries over time series data. Now they
       | are highlighting native SQL support. Since this was only
       | announced today, hopefully there will be clear messaging on which
       | query languages will be supported going forward.
       | 
       | It's also worth noting that queries can also be executed over an
       | HTTP API that platforms like PowerBI can consume today.
       | 
       | >First introduced in 2020 as the open source project InfluxDB
       | IOx, the new storage engine is the product of sustained
       | development by InfluxData and considerable contribution from the
       | InfluxDB open source developer community. Today, the new engine
       | based on IOx arrives first in InfluxData's multi-tenant InfluxDB
       | Cloud service, available to developers worldwide.
       | 
       | Will this later be available in an OSS package for self-hosting?
        
         | pauldix wrote:
         | Hi, post author and founder of InfluxDB here. We're supporting
         | Flux (our scripting and query language), InfluxQL (our original
         | SQL like language), and SQL (specifically the Postgres dialect
         | as that's what DataFusion supports). The query engine is
         | DataFusion, which is part of the Apache Arrow project. We
         | contribute to it significantly. So that's what's built in
         | natively. We support Flux and InfluxQL through separate Go
         | processes that use an API to connect to the core DB. Although
         | we're working on native InfluxQL support (it's a Rust based
         | InfluxQL parser that will yield DataFusion logical query
         | plans).
         | 
         | Right now we're focused on our cloud offering. We'll have
         | official open source releases and documentation in the future.
        
         | minhazm wrote:
         | The SQL support is likely because they're using DataFusion
         | which already has pretty good SQL support, so it's sort of
         | "free".
         | 
         | https://arrow.apache.org/datafusion/user-guide/sql/sql_statu...
        
           | alamb wrote:
           | Author here -- it is "free" in the sense that all the effort
           | we put into DataFusion flows directly into IOx. But we do put
           | a lot of effort into DataFusion
        
             | minhazm wrote:
             | I didn't mean to imply it's free as in no effort goes into.
             | Just that the underlying library provides it so it's less
             | effort on top of the already significant effort going into
             | DataFusion itself.
        
               | alamb wrote:
               | Ah -- got it! This is the beauty of aligning ourselves
               | with technologies like Arrow, Parquet and DataFusion. We
               | can share as well as benefit from the efforts of the
               | broader community
        
       | _peter_ wrote:
       | Isn't InfluxDB rewriting their storage engine for the nth time?
       | It makes me have a little less faith in their project to be
       | honest.
        
         | mhall119 wrote:
         | The original TSM engine is still used by InfluxDB v2 OSS.
         | 
         | The InfluxDB Cloud platform uses a variation of TSM that's
         | tailored for a distributed SaaS rather than stand-alone nodes
         | (this was originally intended to be used in InfluxDB v2 OSS as
         | well, but alpha-testing showed that the old engine performed
         | better there so it ultimately was reverted for the beta
         | release).
         | 
         | So IOx is really the first major new storage engine in
         | InfluxDB.
        
         | [deleted]
        
         | dgnorton wrote:
         | Member of the engineering team here - I would break the history
         | into 3 phases:
         | 
         | 1) Alpha / Beta phase where we experimented with several off-
         | the-shelf key-value stores (RocksDB, LevelDB, & BoltDB). During
         | this early phase, we learned from observing a wide variety of
         | workloads / use-cases that we needed a custom built engine to
         | achieve our early performance goals. But, using these off-the-
         | shelf key-value stores allowed our (at the time) very small
         | team to focus on developing a useful beta product and gathering
         | user feedback.
         | 
         | 2) TSM storage engine for 1.0 - Developed from scratch based on
         | our learnings from phase 1, this was the first production
         | storage engine that shipped with 1.0 in 2016 and carried us
         | through 2.0. It served as the workhorse for 3 - 4 years as both
         | the number of users and size of their workloads skyrocketed,
         | eventually bumping into architectural limits of TSM.
         | 
         | 3) IOx - equipped with a larger engineering team and years of
         | experience with a wide variety of workloads and use-cases, IOx
         | was developed to handle rapidly growing time series workloads
         | that users need to handle.
        
         | c4wrd wrote:
         | I would argue the other way and praise them for the storage
         | engine changes. Each iteration has had drawbacks, but based on
         | the real-world reported usage they've made decisions to better
         | support what customers are asking for and actually running
         | into, as opposed to trying to iterate on the same engine over
         | and over and making assumptions of real-world usage. Sure,
         | there are drawbacks, but at the end of the day they're
         | continuing to make good improvements for their customers.
        
       | mrsun wrote:
       | Will InfluxDB IOx eventually replace InfluxDB v2?
        
         | mhall119 wrote:
         | IOx is the data storage layer. It will replace the current TSM
         | data storage system in InfluxDB, but it won't replace InfluxDB
         | as a whole.
        
           | digerata wrote:
           | Personally, very excited to see this happening. Huge
           | congrats!
           | 
           | Some constructive criticism around naming... You don't have
           | to have Flux in every single damn thing you create!
           | 
           | InfluxDB IOx is not replacing InfluxDB v2 because... It's
           | just a new storage engine.
           | 
           | For querying we have Flux or InfluxQL...
        
       | otoolep wrote:
       | Congrats to the team at InfluxDB - great to see this released.
        
       ___________________________________________________________________
       (page generated 2022-10-26 23:00 UTC)