[HN Gopher] IOx: InfluxData's New Storage Engine ___________________________________________________________________ IOx: InfluxData's New Storage Engine Author : resizeitplz Score : 129 points Date : 2022-10-26 14:51 UTC (8 hours ago) (HTM) web link (www.influxdata.com) (TXT) w3m dump (www.influxdata.com) | toinbis wrote: | Happy longtime Influxdb user here. I wanted to congratulate Paul | and the team on reaching this milestone. Followed IOx development | a bit - can't wait to finally test it out! | michael_j_ward wrote: | Just want to say congratulations to the team! | | 2 years and 9,500+ commits is a hell of a feat. | | https://github.com/influxdata/influxdb_iox | okay_dude_q wrote: | mildbyte wrote: | Just wanted to also give a shout out to Apache DataFusion[0] that | IOx relies on a lot (and contributes to as well!). | | It's a framework for writing query engines in Rust that takes | care of a lot of heavy lifting around parsing SQL, type casting, | constructing and transforming query plans and optimizing them. | It's pluggable, making it easy to write custom data sources, | optimizer rules, query nodes etc. | | It's has very good single-node performance (there's even a way to | compile it with SIMD support) and Ballista [1] extends that to | build it into a distributed query engine. | | Plenty of other projects use it besides IOx, including | VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily | using it to build Seafowl [2], an analytical database that's | optimized for running SQL queries directly from the user's | browser (caching, CDNs, low latency, some WASM support, all that | fun stuff). | | [0] https://github.com/apache/arrow-datafusion | | [1] https://github.com/apache/arrow-ballista | | [2] https://github.com/splitgraph/seafowl | pauldix wrote: | DataFusion is great, we're happy to be contributing to it. Also | excited to see so many people around the world picking it up | and contributing as well. With our development efforts on IOx, | it's like a strong tailwind. But we put a ton of effort into | helping manage community efforts (thanks, alamb! our developer | on IOx that is also on the Arrow PMC). | andygrove wrote: | Original author of DataFusion/Ballista here. Having alamb and | others from InfluxData involved has been a huge help in | driving the project forward and helping build an active | community behind the project. It is genuinely hard to keep up | with the momentum these days! | menaerus wrote: | Hi, I just had a glance over the DataFusion project. Very | interesting work out there which I will be definitely | keeping the track of but I've got a genuine question. Do | you sometimes find development in Rust a little bit | challenging for large-scale and performance sensitive type | of work? | | I say this because I've noticed more than several PRs | fixing (large) performance regressions which to my | understanding were mostly introduced due to unforeseen or | unexpected Rust compiler subtleties which would then lead | to less than optimal code generation. One example of such | event was a naive and simply looking abstraction that was | introduced and which brought down the performance by | something like 50% in TPC-H benchmarks. This really struck | me a little bit, especially because it seems quite hard to | identify the root cause, and I would like to hear the | experiences from the first hand. Thanks a bunch! | nevi-me wrote: | Your initial experiments and decision to build on arrow-rs | has been great for the project. Thank you and everyone | involved. | [deleted] | ignoramous wrote: | > _We 're heavily using it to build Seafowl, an analytical | database that's optimized for running SQL queries directly from | the user's browser..._ | | Interesting. Where does _seafowl_ fit in when I compare it | with, say, data-stack-in-a-box approach, for ex: meltano + dbt | + duckdb + superset [0]? Is my thinking right that _seafowl_ | possibly replaces both _duckdb_ (with IOx) and superset (if | there 's a web front-end)? | | Incidentally, dagster had an article up just yesterday making a | case for poor-man's datalake with dbt + dagster + duckdb [1]. | What does _splitgraph_ replace if I were to use _it_ in a | similar setup? | | Thanks. | | [0] https://archive.is/DxU1e | | [1] https://archive.is/5ikU4 | mildbyte wrote: | Great question! With Seafowl, the idea is different from what | the modern data stack addresses. It's trying to simplify | public-facing Web-based visualizations: apps that need to run | analytical queries on large datasets and can be accessed by | users all around the world. This is why we made the query API | easily cacheable by CDNs and Seafowl itself easy to deploy at | the edge, e.g. with Fly.io. | | It's a fairly different use case from DuckDB (query execution | for Web applications vs fast embedded analytical database for | notebooks) and the rest of the modern data stack (which | mostly is about analytics internal to a company). Just to | clarify, we're not related to IOx directly (only via us both | using Apache DataFusion). | | If we had to place Seafowl _inside_ of the modern data stack, | it'd be mostly a warehouse, but one that is optimized for | being queried from the Internet, rather than by a limited set | of internal users. Or, a potential use case could be | extracting internal data from your warehouse to Seafowl in | order to build public applications that use it. | | We don't currently ship a Web front-end and so can't serve as | a replacement to Superset: it's exposed to the developer as | an HTTP API that can be queried directly from the end user's | Web browser. But we have some ideas around a frontend | component: some kind of a middleware, where the Web app can | pre-declare the queries it will need to run at build time and | we can compute some pre-aggregations to speed those up at | runtime. Currently we recommend querying it with Observable | [0] for an end-to-end query + visualization experience (or | use a different viz library like d3/Vega). | | Re: the second question about Splitgraph for a data lake, the | intention behind Splitgraph is to orchestrate all those tools | and there the use case is indeed the modern data stack in a | box. It's kind of similar to dbt Labs's Sinter [1] which was | supposed to be the end-to-end data platform before they | focused on dbt and dbt Cloud instead: being able to run | Airbyte ingestion, dbt transformations, be a data warehouse | (using PostgreSQL and a columnar store extension), let users | organize and discover data at the same time. There's a lot of | baggage in Splitgraph though, as we moved through a few | iterations of the product (first Git/Docker for data, then a | platform for the modern data stack). Currently we're thinking | about how to best integrate Splitgraph and Seafowl in order | to build a managed pay-as-you-go Seafowl, kind of like Fauna | [2] for analytics. | | Hope this helps! | | [0] https://observablehq.com/@seafowl/interactive- | visualization-... | | [1] https://www.getdbt.com/blog/whats-in-a-name/ | | [2] https://fauna.com/ | PaulWaldman wrote: | >Unbounded cardinality | | This has been the largest criticism of InfluxDB in the past. | Kudos to the team for acknowledging and solving it! | | > IOx supports SQL natively and our cloud customers can connect | using Postgres-compatible clients like psql, Grafana's Postgres | data source, and BI tools like PowerBI and Tableau. | | Initially InfluxDB had InfluxQL, a SQL like language for querying | data. Then they transitioned to Flux, indicating it was superior | to writing complex SQL queries over time series data. Now they | are highlighting native SQL support. Since this was only | announced today, hopefully there will be clear messaging on which | query languages will be supported going forward. | | It's also worth noting that queries can also be executed over an | HTTP API that platforms like PowerBI can consume today. | | >First introduced in 2020 as the open source project InfluxDB | IOx, the new storage engine is the product of sustained | development by InfluxData and considerable contribution from the | InfluxDB open source developer community. Today, the new engine | based on IOx arrives first in InfluxData's multi-tenant InfluxDB | Cloud service, available to developers worldwide. | | Will this later be available in an OSS package for self-hosting? | pauldix wrote: | Hi, post author and founder of InfluxDB here. We're supporting | Flux (our scripting and query language), InfluxQL (our original | SQL like language), and SQL (specifically the Postgres dialect | as that's what DataFusion supports). The query engine is | DataFusion, which is part of the Apache Arrow project. We | contribute to it significantly. So that's what's built in | natively. We support Flux and InfluxQL through separate Go | processes that use an API to connect to the core DB. Although | we're working on native InfluxQL support (it's a Rust based | InfluxQL parser that will yield DataFusion logical query | plans). | | Right now we're focused on our cloud offering. We'll have | official open source releases and documentation in the future. | minhazm wrote: | The SQL support is likely because they're using DataFusion | which already has pretty good SQL support, so it's sort of | "free". | | https://arrow.apache.org/datafusion/user-guide/sql/sql_statu... | alamb wrote: | Author here -- it is "free" in the sense that all the effort | we put into DataFusion flows directly into IOx. But we do put | a lot of effort into DataFusion | minhazm wrote: | I didn't mean to imply it's free as in no effort goes into. | Just that the underlying library provides it so it's less | effort on top of the already significant effort going into | DataFusion itself. | alamb wrote: | Ah -- got it! This is the beauty of aligning ourselves | with technologies like Arrow, Parquet and DataFusion. We | can share as well as benefit from the efforts of the | broader community | _peter_ wrote: | Isn't InfluxDB rewriting their storage engine for the nth time? | It makes me have a little less faith in their project to be | honest. | mhall119 wrote: | The original TSM engine is still used by InfluxDB v2 OSS. | | The InfluxDB Cloud platform uses a variation of TSM that's | tailored for a distributed SaaS rather than stand-alone nodes | (this was originally intended to be used in InfluxDB v2 OSS as | well, but alpha-testing showed that the old engine performed | better there so it ultimately was reverted for the beta | release). | | So IOx is really the first major new storage engine in | InfluxDB. | [deleted] | dgnorton wrote: | Member of the engineering team here - I would break the history | into 3 phases: | | 1) Alpha / Beta phase where we experimented with several off- | the-shelf key-value stores (RocksDB, LevelDB, & BoltDB). During | this early phase, we learned from observing a wide variety of | workloads / use-cases that we needed a custom built engine to | achieve our early performance goals. But, using these off-the- | shelf key-value stores allowed our (at the time) very small | team to focus on developing a useful beta product and gathering | user feedback. | | 2) TSM storage engine for 1.0 - Developed from scratch based on | our learnings from phase 1, this was the first production | storage engine that shipped with 1.0 in 2016 and carried us | through 2.0. It served as the workhorse for 3 - 4 years as both | the number of users and size of their workloads skyrocketed, | eventually bumping into architectural limits of TSM. | | 3) IOx - equipped with a larger engineering team and years of | experience with a wide variety of workloads and use-cases, IOx | was developed to handle rapidly growing time series workloads | that users need to handle. | c4wrd wrote: | I would argue the other way and praise them for the storage | engine changes. Each iteration has had drawbacks, but based on | the real-world reported usage they've made decisions to better | support what customers are asking for and actually running | into, as opposed to trying to iterate on the same engine over | and over and making assumptions of real-world usage. Sure, | there are drawbacks, but at the end of the day they're | continuing to make good improvements for their customers. | mrsun wrote: | Will InfluxDB IOx eventually replace InfluxDB v2? | mhall119 wrote: | IOx is the data storage layer. It will replace the current TSM | data storage system in InfluxDB, but it won't replace InfluxDB | as a whole. | digerata wrote: | Personally, very excited to see this happening. Huge | congrats! | | Some constructive criticism around naming... You don't have | to have Flux in every single damn thing you create! | | InfluxDB IOx is not replacing InfluxDB v2 because... It's | just a new storage engine. | | For querying we have Flux or InfluxQL... | otoolep wrote: | Congrats to the team at InfluxDB - great to see this released. ___________________________________________________________________ (page generated 2022-10-26 23:00 UTC)