[HN Gopher] Materialize Raises a $32M Series B
       ___________________________________________________________________
        
       Materialize Raises a $32M Series B
        
       Author : austinbirch
       Score  : 156 points
       Date   : 2020-12-02 15:55 UTC (7 hours ago)
        
 (HTM) web link (materialize.com)
 (TXT) w3m dump (materialize.com)
        
       | jstrong wrote:
       | congrats to Frank McSherry and the rest of the materialized team!
       | very impressed by your project.
        
       | npiit wrote:
       | I wonder if BSL becomes the new standard for open source
       | commercial products. It's a good trade-off between freedom and
       | real world business pressure.
        
         | nickstinemates wrote:
         | Doubt it. Lots of aversion to the license given its limited use
         | and some ambiguous terms/education around the various windows.
        
           | npiit wrote:
           | I think it can be, I know a few other potentially successful
           | examples like CockroachDB and ZeroTier. The BSL license makes
           | the entire project basically FOSS for you and me, but not for
           | the big sharks. Which I guess is much better for the world
           | compared to open-core and of course proprietary SaaS.
        
       | [deleted]
        
       | mrits wrote:
       | The headline refers to "incrementally updated materialize views".
       | How does a company get funding for a feature that has already
       | existed in other DBs for at least a decade?
       | 
       | E.g, Vertica refers to this as Live Aggregate Projections.
       | 
       | It's a cool concept but comes with huge caveats. Keeping tracking
       | of non-estimated cardinality for COUNT DISTINCT type queries, as
       | an example.
        
         | benesch wrote:
         | (Disclaimer: I'm one of the engineers at Materialize.)
         | 
         | > How does a company get funding for a feature that has already
         | existed in other DBs for at least a decade? ... It's a cool
         | concept but comes with huge caveats.
         | 
         | I think you answered your own question here. Incrementally-
         | maintained views in existing database systems typically come
         | with huge caveats. In Materialize, they largely don't.
         | 
         | Most other systems place severe restrictions on the kind of
         | queries that can be incrementally maintained, limiting the
         | queries to certain functions only, or aggregations only, or
         | only queries without joins--or if they do support maintaining
         | joins, often the joins must occur only on the involved tables'
         | keys. In Materialize, by contrast, there are approximately no
         | such restrictions. Want to incrementally-maintain a five-way
         | join where some of the join keys are expressions, not key
         | columns? No problem.
         | 
         | That's not to say there aren't _some_ caveats. We don 't yet
         | have a good story for incrementally-maintaining queries that
         | observe the current wall-clock time [0]. And our query
         | optimizer is still young (optimization of streaming queries is
         | a rather open research problem), so for some more complicated
         | queries you may not get the resource utilization you want out
         | of the box.
         | 
         | But, for many queries of impressive complexity, Materialize can
         | incrementally-maintain results far faster than competing
         | products--if those products can incrementally maintain those
         | queries at all.
         | 
         | The technology that makes Materialize special, in our opinion,
         | is a novel incremental-compute framework called differential
         | dataflow. There was an extensive HN discussion on the subject a
         | while back that you might be interested in [1].
         | 
         | [0]: https://github.com/MaterializeInc/materialize/issues/2439
         | 
         | [1]: https://news.ycombinator.com/item?id=22359769
        
           | Fede_V wrote:
           | This is one of my favorite types of HN comments: admits the
           | bias upfront, offers a meaningful technical answer, and links
           | to relevant documents for a deeper dive. Thank you so much!
        
           | mrits wrote:
           | Thanks for the explanation. I'm going to look more into this
           | as I'm working on a new service on top of Vertica. There is a
           | lot I don't like about Vertica and don't see alternatives
           | such as Snowflake to be much of an improvement.
        
           | jamesblonde wrote:
           | What about the other big problem ignored here: does your
           | streaming platform separate compute and storage?
           | 
           | Because GCP DataFlow does. Flink doesn't. DataFlow allows you
           | to elastically scale the compute you need (Snowflake,
           | Databricks). If you can't do that, materialized views will be
           | a more niche feature for bigger 24x7 deployments with
           | predictable workflows.
        
             | albertwang wrote:
             | As George points out above, we haven't added our native
             | persistence layer yet. Consistency guarantees are something
             | we care a lot, so for many scenarios, we leverage the
             | upstream datastore (often Kafka).
             | 
             | But to answer your question, yes, our intention is to
             | support separate cloud-native storage layers.
        
             | jacques_chester wrote:
             | My dim and distant recollection is that Beam and/or GCP
             | Data Flow require someone to implement PCollections and
             | PTransforms to get the benefit of that magic. That's not a
             | trivial exercise, compared to writing SQL.
        
         | frankmcsherry wrote:
         | Hi, I work at Materialize.
         | 
         | You can read about Vertica's "Live Aggregate Projections" here:
         | 
         | https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/An...
         | 
         | In particular, there are important constraints like (among
         | others)
         | 
         | > The projections can reference only one table.
         | 
         | In Materialize you can spin up just about any SQL92 query, join
         | eight relations together, have correlated subqueries, count
         | distinct if you want. It is then all maintained incrementally.
         | 
         | The lack of caveats is the main difference from the existing
         | systems.
        
         | jacques_chester wrote:
         | > _The headline refers to "incrementally updated materialize
         | views". How does a company get funding for a feature that has
         | already existed in other DBs for at least a decade?_
         | 
         | They're getting funding for doing it _much_ more efficiently.
         | 
         | I read into the background papers when it first popped up. This
         | is legitimate, deep computer science that other DBs don't yet
         | have.
        
         | hnmullany wrote:
         | Materialize is the real deal - completely different
         | architecture under the hood. Origin project is Timely Dataflow
         | & Naiad.
         | 
         | https://docs.rs/timely/0.11.1/timely/
        
       | acjohnson55 wrote:
       | I'm so psyched about Materialize.
       | 
       | An old coworker explained to me about how his previous company
       | used DBT to create many different projections of messy data to
       | serve many applications, rather than trying to come up with the
       | One Canonical Representation. It truly blew my mind in terms of
       | thinking about how to model data within a business.
       | 
       | The huge limitation with this vision is that it only works in
       | places where you can tolerate some pretty significant staleness.
       | So the promise of this approach excludes most OLTP applications.
       | I simply assumed it wouldn't be reasonable to create something
       | that allows for unconstrained SQL-based transformations in real
       | time, and that no one was working on this. Oh well.
       | 
       | But several months back, I discovered Materialize and it was an
       | "oh shit" moment. Someone was actually doing this, and in a very
       | first principles-driven approach. I'm really excited for how this
       | project evolves.
        
       | coinwitcher wrote:
       | This is interesting given what AWS just announced (AWS Glue
       | Elastic Views):
       | 
       | https://news.ycombinator.com/item?id=25267734
        
       | georgewfraser wrote:
       | Materialize has tackled the hardest problem in data warehousing,
       | materialized views, which has _never really been solved_ , and
       | built a solution on a completely new architecture. This solution
       | is useful by itself, but I'm also watching eagerly how their road
       | map [1] plays out, as they go back and build out features like
       | persistence and start to look more like a full-fledged data
       | warehouse, but one with the first correct implementation of
       | materialized views.
       | 
       | [1] https://materialize.com/blog-roadmap/
        
         | adamnemecek wrote:
         | How were previous implementations of materialized views
         | deficient?
        
           | ahupp wrote:
           | Here's a nice writeup of Materialize:
           | 
           | https://lucperkins.dev/blog/new-db-tech-1/#materialize
           | 
           | Not really mentioned here, but in standard postgres it might
           | be quite expensive to update the view so you can only do it
           | periodically. Materialize keeps that up-to-date continuously.
        
           | georgewfraser wrote:
           | Joins were unavailable or subject to extreme limitations. Or
           | just plain wrong!
        
         | [deleted]
        
         | jameslk wrote:
         | Isn't this pretty similar to what Dremio does?
        
           | benesch wrote:
           | Dremio is a batch processor, not a stream processor. The
           | fundamental difference is that a batch processor will need to
           | recompute a query from scratch whenever the input data
           | changes, while a stream processor can incrementally update
           | the existing query result based on the change to the input.
           | 
           | This can make a huge difference when making small changes to
           | large datasets. Materialize can incrementally compute small
           | changes to very complicated queries in just a few
           | milliseconds, while with batch processors you're looking at
           | latency in the hundreds of milliseconds, seconds, or minutes,
           | depending on the size of the data.
           | 
           | Another way of looking at it is that in batch processors,
           | latency scales with the size of the total data, while in
           | stream processors, latency scales with the size of the
           | _updates_ to the data.
        
             | jameslk wrote:
             | I see, thank you for the explanation!
        
         | btown wrote:
         | For a primer on materialized views, and one of the key
         | rationales for Materialize's existence, there's no better
         | presentation than Martin Kleppman's "Turning the Database
         | Inside-Out" (2015). (At my company it's required viewing for
         | engineers across our stack, because every data structure _is_ a
         | materialized view no matter where on frontend or backend that
         | data structure lives.)
         | 
         | https://www.confluent.io/blog/turning-the-database-inside-ou...
         | 
         | Confluent is building an incredible business helping companies
         | to build these types of systems on top of Kafka, Samza, and
         | architectural principles originally developed at LinkedIn, but
         | more along the lines of "if you'd like this query to be
         | answered, or this recommender system to be deployed for every
         | user, we can reliably code a data pipeline to do so at LinkedIn
         | scale" than "you can run this query right away against our OLAP
         | warehouse without knowing about distributed systems." (If it's
         | more nuanced than this please correct me!)
         | 
         | On the other hand, Materialize could allow businesses to
         | realize this architecture, with its vast benefits to
         | millisecond-scale data freshness and analytical flexibility,
         | simply by writing SQL queries as if it was a traditional
         | system. As its capabilities expand beyond parity with SQL
         | (though I agree that's absolutely the best place for them to
         | start and optimize), there are tremendous wins here that could
         | power the next generation of real-time systems.
         | 
         | EDIT: some clarifications and additional examples
        
           | Liron wrote:
           | I also wrote a primer for why the world needs Materialize
           | [1]. It had a big discussion on HN [2], and Materialize's
           | cofounder said it was part of his motivation [3].
           | 
           | [1] https://medium.com/@lironshapira/data-denormalization-is-
           | bro...
           | 
           | [2] https://news.ycombinator.com/item?id=12613586
           | 
           | [3]
           | https://twitter.com/narayanarjun/status/1241450203095465986
        
             | quodlibetor wrote:
             | Ha! Your blog post was one of the reasons that I trusted in
             | the future of Materialize enough to decide to work here!
             | 
             | I agree, that is exactly the problem that I, in particular,
             | think we are solving.
        
         | dataplayer wrote:
         | What exactly are "materialized views"?
        
           | jacques_chester wrote:
           | Suppose you have normalized your data schema, up to at least
           | 3NF, perhaps even further up to 4NF, 5NF or (as Codd
           | intended) BCNF.
           | 
           | Great! You are now largely liberated from introducing many
           | kinds of anomaly at insertion time. And you'll often only
           | need to write once for each datum (modulo implementation
           | details like write amplification), because a normalised
           | schema has "a place for everything and everything in its
           | place".
           | 
           | Now comes time to query the data. You write some joins, and
           | all is well. But a few things start to happen. One is that
           | writing joins over and over becomes laborious. What you'd
           | really like is some denormalised intermediary views, which
           | transform the fully-normalised base schema into something
           | that's more convenient to query. You can also use this to
           | create an isolation layer between the base schema and any
           | consumers, which will make future schema changes easier and
           | possibly improve security.
           | 
           | The logical endpoint of doing so is the Data Warehouse
           | (particularly in the Kimball/star schema/dimensional
           | modelling style). You project your normalised data, which you
           | have high confidence in, into a completely different shape
           | that is optimised for fast summarisation and exploration. You
           | use this as a read-only database, because it massively
           | duplicates a lot of information that could otherwise have
           | been derived via query (for example, instead of a single
           | "date" field, you have fields for day of week, day of month,
           | day of year, week of year, whether it's a holiday ... I've
           | built tables which include columns like "days until major
           | conference X" and "days since last quarterly release").
           | 
           | Now we reach the first problem. It's too slow! Projecting
           | that data from the normalised schema requires a lot of
           | storage and compute. You realise after some scratching that
           | your goal all along was to pay that cost upfront so that you
           | can reap the benefits at query time. What you want is a
           | _view_ that has the physical characteristics of a _table_.
           | Meaning you want to write out the results of the query, but
           | still treat it like a view. You 've "materialized" the view.
           | 
           | Now the second problem. Who, or what, does that projection?
           | Right now that role is filled by ETL, "Extract, Transform and
           | Load". Extract from the normalised system, transform it into
           | the denormalised version, then load that into a data
           | warehouse. Most places do this on a regular cadence, such as
           | nightly, because it just takes buckets and buckets of work to
           | regenerate the output every time.
           | 
           | Now enters Materialize, who have a secret weapon: timely
           | dataflow. The basic outcome is that instead of re-running an
           | _entire view query_ to regenerate the materialized view, they
           | can, from a given datum, determine exactly what will change
           | in the materialized view and _only_ update that. That makes
           | such views potentially thousands of times cheaper. You could
           | even run the normalised schema and the denormalised
           | projections on the same physical set of data -- no need for
           | the overhead and complexity of ETL, no need to run two
           | database systems, no need to _wait_ (without the added
           | complexity of a full streaming platform).
        
             | gen220 wrote:
             | That's a great description! Does materialize describe how
             | they implement timely dataflow?
             | 
             | At my current company, we have built some systems like
             | this. Where a downstream table is essentially a function of
             | a dozen upstream tables.
             | 
             | Whenever one of the upstream tables changes, it's primary
             | key is published to a queue, some worker translates this
             | upstream primary key into a set of downstream primary keys,
             | and publishes these downstream primary keys to a compacted
             | queue.
             | 
             | The compacted queue is read by another worker, that
             | "recomputes" each dirty key, one-at-a-time, which involves
             | fetching the latest-and-greatest version of each upstream
             | table.
             | 
             | This last worker is the bottleneck, but it's optimized by
             | per-key caching, so we only fetch the latest-and-greatest
             | version once per update. It can also be safely and
             | arbitrarily parallelized, since the stream they read from
             | is partitioned on key.
        
               | albertwang wrote:
               | Here's a 15-minute introduction to Timely Dataflow by
               | Frank, our co-founder:
               | https://www.youtube.com/watch?v=yOnPmVf4YWo
        
               | scott_s wrote:
               | > Does materialize describe how they implement timely
               | dataflow?
               | 
               | It's open source
               | (https://github.com/TimelyDataflow/timely-dataflow), and
               | also extensively written about both in academic research
               | papers and documentation for the project itself. The
               | GitHub repo has pointers to all of that. See also
               | differential dataflow
               | (https://github.com/timelydataflow/differential-
               | dataflow).
        
           | ako wrote:
           | It's a query of which you save the results in a cache
           | database table, so next time when it is queried, you can
           | provide the results from the cache.
           | 
           | Typically, in a traditional RDBMS, the query is defined as a
           | sql view, which you either have to manually refresh, or can
           | be refreshed periodically.
           | 
           | Using streaming systems like kafka, it's possible to
           | continously update the cached results based in the incoming
           | data, so the result is a near realtime up to date query
           | result.
           | 
           | Writing the stream processing to update the materialized view
           | can be complex, using SQL like materialize enables you to do,
           | makes it a lot more productive.
        
           | derefr wrote:
           | Let's start with views. A database view is a "stored query"
           | that presents itself as a table, that you can further query
           | against.
           | 
           | If you have a view "bar":                   CREATE VIEW bar
           | AS $$         SELECT x * 2 AS a, y + 1 AS b FROM foo
           | $$
           | 
           | and then you `SELECT a FROM bar`, then the "question" you're
           | really asking is just:                   SELECT a FROM
           | (SELECT x * 2 AS a, y + 1 AS b FROM foo)
           | 
           | -- which, with efficient query planning, boils down to
           | SELECT x * 2 AS a FROM foo
           | 
           | It's especially important to note that the `y + 1` expression
           | from the view definition isn't computed in this query. The
           | inner query from the view isn't "compiled" -- forced to be in
           | some shape -- but rather sits there in symbolic form,
           | "pasted" into your query, where the query planner can then
           | manipulate and optimize/streamline it further, to suit the
           | needs of the outer query.
           | 
           | -----
           | 
           | To _materialize_ something is to turn it from symbolic-
           | expression form, into  "hard" data -- a result-set of in-
           | memory row-tuples. Materialization is the "enumeration" in a
           | Streams abstraction, or the "thunk" in a lazy-evaluation
           | language. It's the master screw that forces all the activity
           | dependent on it -- that would otherwise stay abstract -- to
           | "really happen."
           | 
           | Databases _don 't_ materialize anything unless they're forced
           | to. If you do a query like                   SELECT false
           | FROM (SELECT * FROM foo WHERE x = 1)
           | 
           | ...no work (especially no IO) actually happens, because no
           | data from the inner query needs to be _materialized_ to
           | resolve the outer query.
           | 
           | Streaming data out of the DB to the user requires
           | serialization [= putting the data in a certain wire format],
           | and serialization requires materialization [= having the data
           | available in memory in order to read and re-format it.] So
           | whatever final shape the data returned from your outermost
           | query has when it "leaves" the DB, _that_ data will always
           | get materialized. But other processes internal to the DB may
           | sometimes require data to be materialized as well.
           | 
           | Materialization is costly -- it's usually the only thing
           | forcing the DB to actually read the data on disk, for any
           | columns it wasn't filtering by. Many of the optimizations in
           | RDBMSes -- like the elimination of that `y + 1` above -- have
           | the goal of avoiding materialization, and the disk-reads /
           | memory allocations / etc. that materialization requires.
           | 
           | -----
           | 
           | Those definitions out of the way, a "materialized view" is
           | something that acts similar to a view (i.e. is constructed in
           | terms of a stored query, and presents itself as a queriable
           | table) but which -- unlike a regular view -- has been pre-
           | materialized. The query for a matview is still stored, but at
           | some point in advance of querying, the RDBMS actually _runs_
           | that query, fully materializes the result-set from it, and
           | then caches it.
           | 
           | So, basically, a materialized view is a view with a cached
           | result-set.
           | 
           | Like any cache, this result-set cache increases read-time
           | efficiency in the case where the original computation was
           | costly. (There's no point in "upgrading" a view into a
           | matview if your queries against the plain view were already
           | cheap enough for your needs.)
           | 
           | But like any cache, it needs to be maintained, and can become
           | out-of-sync with its source.
           | 
           | Although materialized views are part of the SQL standard, not
           | all SQL RDBMSes implement them. MySQL/MariaDB does not, for
           | example. (Which is why you'll find that much of the software
           | world just pretends matviews don't exist when designing their
           | DB architectures. If it ever needs to run on MySQL, it can't
           | use matviews.)
           | 
           | The naive approach that some other RDBMSes (e.g. Postgres)
           | take to materialized views, is to only offer manual, full-
           | pass recalculation of the cached result-set, via some
           | explicit command (`REFRESH MATERIALIZED VIEW foo`). This
           | works with "small data"; but at scale, this approach can be
           | so time-consuming for large and complex backing queries, that
           | by the time cache is rebuilt, it's already out-of-date again!
           | 
           | Because there are RDBMSes that either don't have matviews, or
           | don't have _scalable_ matviews, many application developers
           | just avoid the RDBMS 's built in matview abstraction, and
           | build their own. Thus, another large swathe of the world's
           | database architecture either will use cron-jobs to regular
           | run+materialize a query, and then dump its results back into
           | a table in the same DB; or it will define on-
           | INSERT/UPDATE/DELETE triggers on "primary" tables, that
           | transform and upsert data into "secondary" denormalized
           | tables. These are both approaches to "simulating" matviews,
           | portably, on an RDBMS substrate that isn't guaranteed to have
           | them.
           | 
           | Other RDBMSes (e.g. Oracle, SQL Server, etc.) _do_ have
           | scalable materialized views, a.k.a.  "incrementally
           | materialized" views. These work less like a view with a
           | cache, and more like a secondary table with write-triggers on
           | primary tables to populate it -- but all handled under-the-
           | covers by the RDBMS itself. You just define the matview, and
           | the RDBMS sees the data-dependencies and sets up the write-
           | through data flow.
           | 
           | Incrementally-materialized views are great for what they're
           | designed for (reporting, mostly); but they aren't intended to
           | be the bedrock for an entire architecture. Building matviews
           | on top of matviews on top of matviews gets expensive fast,
           | because even fancy enterprise RDBMSes like Oracle don't
           | _realize_ , when populating table X, that writing to X will
           | in turn write to matview Y, which will in turn "fan out" to
           | matviews {A,B,C,D}, etc. These RDBMS's matviews were never
           | intended to support complex "dataflow graphs" of updates like
           | this, and so there's too much overhead (e.g. read-write
           | contention on index locks) to actually make these setups
           | practical. And it's very hard for these DBMSes to change
           | this, as their matviews' caches are fundamentally reliant on
           | _database table_ storage engines, which just aren 't the
           | right ADT to hold data with this sort of lifecycle.
           | 
           | -----
           | 
           | Materialize is an "RDBMS" (though it's not, really)
           | engineered from the ground up to make these sorts of dataflow
           | graphs of matviews-on-matviews-on-matviews practical, by
           | doing its caching completely differently.
           | 
           | Materialize looks like a SQL RDBMS from the outside, but
           | Materialize _is not_ a database -- not really. (Materialize
           | has no tables. You can 't "put" data in it!) Instead,
           | Materialize is a data _streaming_ platform, that caches any
           | intermediate materialized data it 's forced to construct
           | during the streaming process, so that other consumers can
           | work off those same intermediate representations, without
           | recomputing the data.
           | 
           | If you've ever worked with Akka's Streams, or Elixir's Flows,
           | or for that matter with Apache Beam (nee Google Dataflow),
           | Materalize is that same kind of pipeline. But where all the
           | plumbing work of creating intermediate representations --
           | normally a procedural map/reduce/partition kind of thing --
           | is done by defining SQL matviews; and where the final output
           | isn't a fixed output of the pipeline, but rather comes from
           | running an arbitrary SQL query against any arbitrary matview
           | defined in the system.
        
             | dragonwriter wrote:
             | > Most RDBMSes (e.g. Postgres) only offer manual (`REFRESH
             | MATERIALIZED VIEW foo`) full-pass recalculation of the
             | cached result-set for matviews.
             | 
             | "Most" here seems very much wrong, at least of major
             | products: Oracle has an option for on-commit (rather than
             | manual) and incremental/incremental-if-possible
             | (FAST/FORCED) refresh, so it is limited to neither only-
             | manual nor only-full-pass recalculation. SQL Server indexed
             | views (their matview solution) are automatically
             | incrementally updated as base tables change, they don't
             | even have an option for manual full-pass recalculation,
             | AFAICT. DB2 materialized query tables (their matview
             | solution) have an option for immediate (on-commit) refresh
             | (not sure if the algo here is always full-pass, but its at
             | a minimum not always manual.) Firebird and MySQL/MariaDB
             | don't have any support for materialized views at all
             | (though of course you can manually simulate them with
             | additional tables updated by triggers.) Postgres seems to
             | be the only major RDBMS with both material view _support_
             | and the limitation of only on-demand full-pass
             | recalculation of matviews (for that matter, except maybe
             | DB2 having the full-pass limitation, it seems to be the
             | only one with _either_ the only-manual _or_ only-full-pass
             | limitation.)
        
               | jacques_chester wrote:
               | I think that it's true that many databases offer
               | incremental updates and it's incorrect to say that manual
               | refreshes were the state of the art.
               | 
               | The important point is that Materialize can do it for
               | almost any query, very efficiently, compared to existing
               | options. That opens a lot of possibilities.
        
               | dragonwriter wrote:
               | > The important point is that Materialize can do it for
               | almost any query, very efficiently, compared to existing
               | options. That opens a lot of possibilities.
               | 
               | Yes, this does seem like a very big deal.
        
               | derefr wrote:
               | You're right; I updated my comment.
        
               | dragonwriter wrote:
               | That was a fantastic and illuminating update, thank you.
        
             | jacques_chester wrote:
             | This is an outstanding explanation. Much better than mine.
        
           | bluejekyll wrote:
           | "In computing, a materialized view is a database object that
           | contains the results of a query. For example, it may be a
           | local copy of data located remotely, or may be a subset of
           | the rows and/or columns of a table or join result, or may be
           | a summary using an aggregate function."
           | 
           | https://en.m.wikipedia.org/wiki/Materialized_view
        
           | findjashua wrote:
           | updated results of a query - eg if you do some aggregation or
           | filtering on a table, or join two tables, or anything of the
           | sort - materialized view will give you the updated results of
           | the query in a separate table
        
       | temuze wrote:
       | I'm glad more people are tackling this problem. There still isn't
       | a good solution to real-time aggregation data at large scale.
       | 
       | At a previous company, we dealt with huge data streams (~1TB data
       | / minute) and our customers expected real-time aggregations.
       | 
       | Making an in-house solution for this was incredibly difficult
       | because each customer's data differed wildly. For example:
       | 
       | - Customer A's shards might have so much cardinality where memory
       | becomes an issue.
       | 
       | - Customer B's shards might have so much throughput where CPU
       | becomes a constraint. Sometimes a single aggregation may have so
       | much throughput where you need to artificially increase the
       | cardinality and aggregate the aggregations!
       | 
       | This makes the optimal sharding strategy very complex. Ideally,
       | you want to bin-pack memory-constrained aggregations with CPU-
       | constrained aggregations. In my opinion, the ideal approach
       | involves detecting the cardinality of each shard and bin-packing
       | them.
        
         | jstrong wrote:
         | I've always found that when you are solving a concrete problem,
         | like you were, it's vastly easier than the case of a general-
         | purpose database because you can make all the tradeoffs that
         | benefit your exact use case. but it sounds like that's not what
         | you experienced. was it just how heterogeneous the clients'
         | needs were? I guess what I'm saying is, if you are capable of
         | handling 1TB/minute, seems like you're plenty able to and would
         | want to be designing the system yourself - but interested what
         | I'm missing about this.
        
       | mwcampbell wrote:
       | > All of this comes in a single binary that is easy to install,
       | easy to use, and easy to deploy.
       | 
       | And it looks like they chose a sensible license for that binary
       | [1], so they're not giving too much away.
       | 
       | I wonder though if they could have made this work as a
       | bootstrapped business, so they would answer only to customers,
       | not to investors chasing growth at all costs.
       | 
       | [1]: https://materialize.com/download/
        
         | offtop5 wrote:
         | Bootstrapping is fun until you can't make payroll.
         | 
         | If your goal is an exit, and you can raise this much, why not.
        
       | adamnemecek wrote:
       | This is a big win for Rust.
        
       | [deleted]
        
       | haggy wrote:
       | Can you point me at documentation for the fault tolerance of the
       | system? A huge issue for streaming systems (and largely unsolved
       | AFAIK) is being able to guarantee that counts aren't duplicated
       | when things fail. How does Materialize handle the relevant
       | failure scenarios in order to prevent inaccurate counts/sums/etc?
        
         | [deleted]
        
         | jgraettinger1 wrote:
         | This is a solved problem, for a few years now. The basic trick
         | is to publish "pending" messages to the broker which are ACK'd
         | by a later written message, only after the transaction and all
         | it's effects have been committed to stable storage (somewhere).
         | Meanwhile, you also capture consumption state (e.x. offsets)
         | into the same database and transaction within which you're
         | updating the materialization results of a streaming
         | computation.
         | 
         | Here's [1] a nice blog post from the Kafka folks on how they
         | approached it.
         | 
         | Gazette [2] (I'm the primary architect) also solves in with
         | some different trade-offs: a "thicker" client, but with no
         | head-of-line blocking and reduced end-to-end latency.
         | 
         | Estuary Flow [3], built on Gazette, leverages this to provide
         | exactly-once, incremental map/reduce and materializations into
         | arbitrary databases.
         | 
         | [1]: https://www.confluent.io/blog/exactly-once-semantics-are-
         | pos...
         | 
         | [2]: https://gazette.readthedocs.io/en/latest/architecture-
         | exactl...
         | 
         | [3]: https://estuary.readthedocs.io/en/latest/README.html
        
           | haggy wrote:
           | Interesting! I'm going to read into the info you linked.
           | Thanks for the info!
        
         | frankmcsherry wrote:
         | Hi! I work at Materialize.
         | 
         | I think the right starter take is that Materialize is a
         | deterministic compute engine, one that relies on other
         | infrastructure to act as the source of truth for your data. It
         | can pull data out of your RDBMS's binlog, out of Debezium
         | events you've put in to Kafka, out of local files, etc.
         | 
         | On failure and restart, Materialize leans on the ability to
         | return to the assumed source of truth, again a RDBMS + CDC or
         | perhaps Kafka. I don't recommend thinking about Materialize as
         | a place to sink your streaming events _at the moment_ (there is
         | movement in that direction, because the operational overhead of
         | things like Kafka is real).
         | 
         | The main difference is that unlike an OLTP system, Materialize
         | doesn't have to make and persist non-deterministic choices
         | about e.g. which transactions commit and which do not. That
         | makes fault-tolerance a _performance_ feature rather than a
         | _correctness_ feature, at which point there are a few other
         | options as well (e.g. active-active).
         | 
         | Hope this helps!
        
       | [deleted]
        
       | beoberha wrote:
       | Late to the post, but if anyone wants a good primer on
       | Materialize (beyond what their actual engineers and a cofounder
       | are saying in the comments), check out the Materialize Quarantine
       | Database Lecture: https://db.cs.cmu.edu/events/db-seminar-
       | spring-2020-db-group...
        
         | mavelikara wrote:
         | The actual talk seems to be here:
         | https://www.youtube.com/watch?v=9XTg09W5USM
        
       | pgt wrote:
       | Materialize can help us manifest The Web After Tomorrow [^1].
       | 
       | My previous comments persuading you why DDF is so crucial to the
       | future of the Web:
       | 
       | > "There is a big upset coming in the UX world as we converge
       | toward a generalized implementation of the "diff & patch" pattern
       | which underlies Git, React, compiler optimization, scene
       | rendering, and query optimization." --
       | https://news.ycombinator.com/item?id=21683385 also with links to
       | prior art like Adapton and Incremental.
       | 
       | > "DD (Differential Dataflow) is commercialized in Materialize"
       | -- https://news.ycombinator.com/item?id=24846119
       | 
       | > "Materialize exists to efficiently solve the view maintenance
       | problem" https://news.ycombinator.com/item?id=22888396
       | [^1]: https://tonsky.me/blog/the-web-after-tomorrow/
        
         | cocoflunchy wrote:
         | Thanks for this, I'm glad to see I'm not the only one tired of
         | writing everything twice (once in the frontend and once in the
         | backend). I'll revisit the links later.
        
       ___________________________________________________________________
       (page generated 2020-12-02 23:00 UTC)