[HN Gopher] Monarch: Google's Planet-Scale In-Memory Time Series...
       ___________________________________________________________________
        
       Monarch: Google's Planet-Scale In-Memory Time Series Database
        
       Author : mlerner
       Score  : 173 points
       Date   : 2022-05-14 16:12 UTC (6 hours ago)
        
 (HTM) web link (www.micahlerner.com)
 (TXT) w3m dump (www.micahlerner.com)
        
       | yegle wrote:
       | Google Cloud Monitoring's time series database is backed by
       | Monarch.
       | 
       | The query language is mql which closely resembles the internal
       | Python based query language:
       | https://cloud.google.com/monitoring/mql
        
         | sleepydog wrote:
         | MQL is an improvement over the internal language, IMO. There
         | are some missing features around literal tables, but otherwise
         | the language is more consistent and flexible.
        
       | holly76 wrote:
        
       | candiddevmike wrote:
       | Interesting that Google replaced a pull based metric system
       | similar to Prometheus with a push based system... I thought one
       | of the selling points of Prometheus and the pull based dance was
       | how scalable it was?
        
         | lokar wrote:
         | It's sort of a pull/push hybrid. The client connects to the
         | collection system and is told how often to send each metric (or
         | group of them) back over that same connection. You configure
         | per target/metric collection policy centrally.
        
         | jeffbee wrote:
         | Prometheus itself has no scalability at all. Without
         | distributed evaluation they have a brick wall.
        
           | gttalbot wrote:
           | This. Any new large query or aggregation in the
           | Borgmon/Prometheus model requires re-solving federation, and
           | continuing maintenance of runtime configuration. That might
           | technically be scalable in that you could do it but you have
           | to maintain it, and pay the labor cost. It's not practical
           | over a certain size or system complexity. It's also friction.
           | You can only do the queries you can afford to set up.
           | 
           | That's why Google spent all that money to build Monarch. At
           | the end of the day Monarch is vastly cheaper in person time
           | and resources than manually-configured Borgmon/Prometheus.
           | And there is much less friction in trying new queries, etc.
        
           | halfmatthalfcat wrote:
           | Can you elaborate? I've ran Prometheus at some scale and it's
           | performed fine.
        
             | lokar wrote:
             | You pretty quickly exceed what one instance can handle for
             | memory, cpu or both. At that point you don't have any real
             | good options to scale while maintaining a flat namespace
             | (you need to partition).
        
               | preseinger wrote:
               | Sure? Prometheus scales with a federation model, not a
               | single flat namespace.
        
               | gttalbot wrote:
               | This means each new query over a certain size becomes a
               | federation problem, so the friction for trying new things
               | becomes very high above the scale of a single instance.
               | 
               | Monitoring as a service has a lot of advantages.
        
               | preseinger wrote:
               | Well you obviously don't issue metrics queries over
               | arbitrarily large datasets, right? The Prometheus
               | architecture reflects this invariant. You constrain
               | queries against both time and domain boundaries.
        
               | gttalbot wrote:
               | Monarch can support both ad-hoc and periodic, standing
               | queries of arbitrarily large size, and has the means to
               | spread the computation out over many intermediate mixer
               | and leaf nodes. It does query push-down so that the
               | "expensive" parts of aggregations, joins, etc., can be
               | done in massively parallel fashion at the leaf level.
               | 
               | It scales so well that many aggregations are set up and
               | computed for every service across the whole company (CPU,
               | memory usage, error rates, etc.). For basic monitoring
               | you can run a new service in production and go and look
               | at a basic dashboard for it without doing anything else
               | to set up monitoring.
        
               | gttalbot wrote:
               | Also, very large ad-hoc queries are supported, with
               | really good user isolation, so that (in general) users
               | don't harm each other.
        
           | preseinger wrote:
           | Prometheus is highly scalable?? What are you talking about??
        
             | dijit wrote:
             | It is not.
             | 
             | It basically does the opposite of what every scalable
             | system does.
             | 
             | To get HA you double you're number of pollers.
             | 
             | To get scale your queries you aggregate them into other
             | prometheii.
             | 
             | If this is scalability: everything is scalable.
        
               | preseinger wrote:
               | I don't understand how the properties you're describing
               | imply that Prometheus isn't scalable.
               | 
               | High Availability always requires duplication of effort.
               | Scaling queries always requires sharding and aggregation
               | at some level.
               | 
               | I've deployed stock Prometheus at global scale, O(100k)
               | targets, with great success. You have to understand and
               | buy into Prometheus' architectural model, of course.
        
               | dijit wrote:
               | The ways in which you can scale Prometheus: you can scale
               | anything.
               | 
               | It does not; itself, have highly scalable properties
               | built in.
               | 
               | It does not do sharding, it does not do proxying, it does
               | not do batching, it does not do anything that would allow
               | it to run multiple servers and query over multiple
               | servers.
               | 
               | Look. I'm not saying that it doesn't work; but when I
               | read about borgmon and Prometheus: I understood the
               | design goal was intentionally not to solve these hard
               | problems, and instead use them as primitive time series
               | systems that can be deployed with a small footprint
               | basically everywhere (and individually queried).
               | 
               | I submit to you, I could also have an influxdb in every
               | server and get the same "scalability".
               | 
               | Difference being that I can actually run a huge influxdb
               | cluster with a dataset that exceeds the capabilities of a
               | single machine.
        
               | preseinger wrote:
               | It seems like you're asserting a very specific definition
               | of scalability that excludes Prometheus' scalability
               | model. Scalability is an abstract property of a system
               | that can be achieved in many different ways. It doesn't
               | require any specific model of sharding, batching, query
               | replication, etc. Do you not agree?
        
               | dijit wrote:
               | You need to go back to computer science class.
               | 
               | I'm not defining terms.
               | 
               | https://en.wikipedia.org/wiki/Database_scalability
               | 
               | Scalability means running a single workload across
               | multiple machines.
               | 
               | Prometheus intentionally does not scale this way.
               | 
               | I'm not being mean, it is fact.
               | 
               | It has made engineering design trade offs and one of
               | those means it is not built to scale, this is fine, I'm
               | not here pooping on your baby.
               | 
               | You can build scalable systems on top of things which do
               | not individually scale.
        
               | preseinger wrote:
               | Scalability isn't a well-defined term, and Prometheus
               | isn't a database. :shrug:
        
               | dijit wrote:
               | Wrong on both counts
               | 
               | Sorry for being rude, but this level of ignorance is
               | extremely frustrating.
        
               | jeffbee wrote:
               | Prometheus cannot evaluate a query over time series that
               | do not fit in the memory of a single node, therefore it
               | is not scalable.
               | 
               | The fact that it could theoretically ingest an infinite
               | amount of data that it cannot thereafter query is not
               | very interesting.
        
               | preseinger wrote:
               | It can? It just partitions the query over multiple nodes?
        
               | lokar wrote:
               | Lots of systems provide redundancy with 2X cost. It's not
               | that hard.
        
           | dilyevsky wrote:
           | You can set up dist eval similar to how it was done in
           | borgmon but you gotta do it manually (or maybe write an
           | operator to automate). One of Monarchs core ideas is to do
           | that behind the scenes for you
        
             | jeffbee wrote:
             | Prometheus' own docs say that distributed evaluation is
             | "deemed infeasible".
        
               | dilyevsky wrote:
               | Thats just like.. you know.. their opinion, man.
               | https://prometheus.io/docs/prometheus/latest/federation/
        
               | bpicolo wrote:
               | Prometheus federation isn't distributed evaluation. It
               | "federates" from other nodes onto a single node.
               | 
               | > Federation allows a Prometheus server to scrape
               | selected time series from another Prometheus server
        
               | dilyevsky wrote:
               | Collect directly from shard-level prometheus then
               | aggregate using /federate at another level. That's how
               | thanos also works afaik
        
           | buro9 wrote:
           | That's what Mimir solves
        
             | deepsun wrote:
             | How does it compare to VictoriaMetrics?
        
         | dilyevsky wrote:
         | It was originally push but i think they went back to sort of
         | scheduled pull mode after a few years. There was a very in
         | depth review doc written about this internally which maybe will
         | get published some day
        
           | atdt wrote:
           | What's the go/ link?
        
             | dilyevsky wrote:
             | Can't remember - just search on moma /s
        
         | gttalbot wrote:
         | Pull collection eventually became a real scaling bottleneck for
         | Monarch.
         | 
         | The way the "pull" collection worked was that there was an
         | external process-discovery mechanism, which the leaf used to
         | connect to the entities it was monitoring, the leaf backend
         | processes would connect to the monitored entities to an
         | endpoint that the collection library would listen on, and those
         | entities collection libraries would stream the metric
         | measurements according to the schedules that the leaves sent.
         | 
         | Several problems.
         | 
         | First, the leaf-side data structures and TCP connections become
         | very expensive. If that leaf process is connecting to many many
         | many thousands of monitored entities, TCP buffers aren't free,
         | keep-alives aren't free, and a host of other data structures.
         | Eventually this became an...interesting...fraction of the CPU
         | and RAM on these leaf processes.
         | 
         | Second, this implies a service discovery mechanism so that the
         | leaves can find the entities to monitor. This was a combination
         | of code in Monarch and an external discovery service. This was
         | a constant source of headaches an outages, as the appearance
         | and disappearance of entities is really spiky and
         | unpredictable. Any burp in operation of the discovery service
         | could cause a monitoring outage as well. Relatedly, the
         | technical "powers that be" decided that the particular
         | discovery service, of which Monarch was the largest user,
         | wasn't really something that was suitable for the
         | infrastructure at scale. This decision was made largely
         | independently of Monarch, but required Monarch to move off.
         | 
         | Third, Monarch does replication, up to three ways. In the pull-
         | based system, it wasn't possible to guarantee that the
         | measurement that each replica sees is the same measurement with
         | the same microsecond timestamp. This was a huge data quality
         | issue that made the distributed queries much harder to make
         | correct and performant. Also, the clients had to pay both in
         | persistent TCP connections on their side and in RAM, state
         | machines, etc., for this replication as a connection would be
         | made from each backend leaf processes holding a replica for a
         | given client.
         | 
         | Fourth, persistent TCP connections and load balancers don't
         | really play well together.
         | 
         | Fifth, not everyone wants to accept incoming connections in
         | their binary.
         | 
         | Sixth, if the leaf process doesn't need to know the collection
         | policies for all the clients, those policies don't have to be
         | distributed and updated to all of them. At scale this matters
         | for both machine resources and reliability. This can be made a
         | separate service, pushed to the "edge", etc.
         | 
         | Switching from a persistent connection to the clients pushing
         | measurements in distinct RPCs as they were recorded eventually
         | solved all of these problems. It was a very intricate
         | transition that took a long time. A lot of people worked very
         | hard on this, and should be very proud of their work. I hope
         | some of them jump in to the discussion! (At very least they'll
         | add things I missed/didn't remember... ;^)
        
       | nickstinemates wrote:
       | too small for me, i was looking more for the scale of the
       | universe.
        
         | yayr wrote:
         | in case this can be deployed single-handed it might be useful
         | on a spaceship... would need some relativistic time accounting
         | though.
        
       | kasey_junk wrote:
       | A _huge_ difference between monarch and other tsdb that isn't
       | outlined in this overview, is that a storage primitive for schema
       | values is a histogram. Most (maybe all besides Circonus) tsdb try
       | to create histograms at query time using counter primitives.
       | 
       | All of those query time histogram aggregations are making pretty
       | subtle trade offs that make analysis fraught.
        
         | teraflop wrote:
         | Is it really that different from, say, the way Prometheus
         | supports histogram-based quantiles?
         | https://prometheus.io/docs/practices/histograms/
         | 
         | Granted, it looks like Monarch supports a more cleanly-defined
         | schema for distributions, whereas Prometheus just relies on you
         | to define the buckets yourself and follow the convention of
         | using a "le" label to expose them. But the underlying
         | representation (an empirical CDF) seems to be the same, and so
         | the accuracy tradeoffs should also be the same.
        
           | spullara wrote:
           | Much different. When you are reporting histograms you can
           | combine them and see the true p50 or whatever across all the
           | individual systems reporting the metric.
        
             | [deleted]
        
             | nvarsj wrote:
             | Can you elaborate a bit? You can do the same in Prometheus
             | by summing the bucket counts. Not sure what you mean by
             | "true p50" either. With buckets it's always an
             | approximation based on the bucket widths.
        
               | spullara wrote:
               | Ah, I misunderstood what you meant. If you are reporting
               | static buckets I get how that is better than what folks
               | typically do but how do you know the buckets a priori?
               | Others back their histograms with things like
               | https://github.com/tdunning/t-digest. It is pretty
               | powerful as the buckets are dynamic based on the data and
               | histograms can be added together.
        
             | gttalbot wrote:
             | Yes. This. Also, displaying histograms in heatmap format
             | can allow you to intuit the behavior of layered distributed
             | systems, caches, etc. Relatedly, exemplars allowed tying
             | related data to histogram buckets. For example, RPC traces
             | could be tied to the latency bucket & time at which they
             | complete, giving a natural means to tie metrics monitoring
             | and tracing, so you can "go to the trace with the problem".
             | This is described in the paper as well.
        
         | spullara wrote:
         | Wavefront also has histogram ingestion (I wrote the original
         | implementation, I'm sure it is much better now). Hugely
         | important if you ask me but honestly I don't think that many
         | customers use it.
        
         | sujayakar wrote:
         | I've been pretty happy with datadog's distribution type [1]
         | that uses their own approximate histogram data structure [2]. I
         | haven't evaluated their error bounds deeply in production yet,
         | but I haven't had to tune any bucketing. The linked paper [3]
         | claims a fixed percentage of relative error per percentile.
         | 
         | [1] https://docs.datadoghq.com/metrics/distributions/
         | 
         | [2] https://www.datadoghq.com/blog/engineering/computing-
         | accurat...
         | 
         | [3] https://arxiv.org/pdf/1908.10693.pdf
        
         | hn_go_brrrrr wrote:
         | In my experience, Monarch storing histograms and being unable
         | to rebucket on the fly is a big problem. A percentile line on a
         | histogram will be incredibly misleading, because it's trying to
         | figure out what the p50 of a bunch of buckets is. You'll see
         | monitoring artifacts like large jumps and artificial plateaus
         | as a result of how requests fall into buckets. The bucketer on
         | the default RPC latency metric might not be well tuned for your
         | service. I've seen countless experienced oncallers tripped up
         | by this, because "my graphs are lying to me" is not their first
         | thought.
        
           | heinrichhartman wrote:
           | Circonus Histograms solve that by using a universal bucketing
           | scheme. Details are explained in this paper:
           | https://arxiv.org/abs/2001.06561
           | 
           | Disclaimer: I am a co-author.
        
           | kasey_junk wrote:
           | My personal opinion is that they should have done a log
           | linear histogram which solves the problems you mention (with
           | other trade offs) but to me the big news was making the db
           | flexible enough to have that data type.
           | 
           | Leaving the world of single numeric type for each datum will
           | influence the next generation of open source metrics db.
        
           | jrockway wrote:
           | I definitely remember a lot of time spent tweaking histogram
           | buckets for performance vs. accuracy. The default bucketing
           | algorithm at the time was powers of 4 or something very
           | unusual like that.
        
             | shadowgovt wrote:
             | It's because powers of four was great for the original
             | application of statistics on high traffic services where
             | the primary thing the user was interested in was deviations
             | from the norm, and with a high traffic system the signal
             | for what the norm is would be very strong.
             | 
             | I tried applying it to a service with much lower traffic
             | and found the bucketing to be extremely fussy.
        
           | [deleted]
        
       | 8040 wrote:
       | I broke this once several years ago. I even use the incident
       | number in my random usernames to see if a Googler recognizes it.
        
         | ajstiles wrote:
         | Wow - that was a doozy.
        
         | ikiris wrote:
         | ahahahahahaha
        
         | orf wrote:
         | How did you break it?
        
           | ikiris wrote:
           | IIRC they didn't not break it.
        
         | [deleted]
        
         | zoover2020 wrote:
         | This is also why I love HN. So niche!
        
       | dijit wrote:
       | Discussion from 2020:
       | https://news.ycombinator.com/item?id=24303422
        
       | klysm wrote:
       | I don't really grasp why this is a useful spot in the trade off
       | space from a quick skim. Seems risky.
        
         | dijit wrote:
         | There's a good talk on Monarch https://youtu.be/2mw12B7W7RI
         | 
         | Why it exists is laid out quite plainly.
         | 
         | The pain of it is we're all jumping on Prometheus (borgmon)
         | without considering why Monarch exists. Monarch doesn't have a
         | good corollary outside of google.
         | 
         | Maybe some weird mix of timescale DB backed by cockroachdb with
         | a Prometheus push gateway.
        
           | AlphaSite wrote:
           | Wavefront is based on FoundationDB which I've always found
           | pretty cool.
           | 
           | [1] https://news.ycombinator.com/item?id=16879392
           | 
           | Disclaimer: I work at vmware on an unrelated thing.
        
       | codethief wrote:
       | The first time I heard about Monarch was in discussions about the
       | hilarious "I just want to serve 5 terabytes" video[0].
       | 
       | [0]: https://m.youtube.com/watch?v=3t6L-FlfeaI
        
       | pm90 wrote:
       | A lot of Google projects seem to rely on other Google projects.
       | In this case Monarch relies on spanner.
       | 
       | I guess its nice to publish at least the conceptual design so
       | that others can implement it in "rest of the world" case. Working
       | with OSS can be painful, slow and time consuming so this seems
       | like a reasonable middle ground (although selfishly I do wish all
       | of this was source available).
        
         | praptak wrote:
         | Spanner may be hard to set up even with source code available.
         | It relies on atomic clocks for reliable ordering of events.
        
           | [deleted]
        
         | joshuamorton wrote:
         | I don't think there's any spanner _necessity_ and iirc monarch
         | existed pre-spanner.
        
           | gttalbot wrote:
           | Correct. Spanner is used to hold configuration state, but is
           | not in the serving path.
        
       | sydthrowaway wrote:
       | Stop overhyping software with buzzwords
        
       ___________________________________________________________________
       (page generated 2022-05-14 23:00 UTC)