[HN Gopher] Monarch: Google's Planet-Scale In-Memory Time Series... ___________________________________________________________________ Monarch: Google's Planet-Scale In-Memory Time Series Database Author : mlerner Score : 173 points Date : 2022-05-14 16:12 UTC (6 hours ago) (HTM) web link (www.micahlerner.com) (TXT) w3m dump (www.micahlerner.com) | yegle wrote: | Google Cloud Monitoring's time series database is backed by | Monarch. | | The query language is mql which closely resembles the internal | Python based query language: | https://cloud.google.com/monitoring/mql | sleepydog wrote: | MQL is an improvement over the internal language, IMO. There | are some missing features around literal tables, but otherwise | the language is more consistent and flexible. | holly76 wrote: | candiddevmike wrote: | Interesting that Google replaced a pull based metric system | similar to Prometheus with a push based system... I thought one | of the selling points of Prometheus and the pull based dance was | how scalable it was? | lokar wrote: | It's sort of a pull/push hybrid. The client connects to the | collection system and is told how often to send each metric (or | group of them) back over that same connection. You configure | per target/metric collection policy centrally. | jeffbee wrote: | Prometheus itself has no scalability at all. Without | distributed evaluation they have a brick wall. | gttalbot wrote: | This. Any new large query or aggregation in the | Borgmon/Prometheus model requires re-solving federation, and | continuing maintenance of runtime configuration. That might | technically be scalable in that you could do it but you have | to maintain it, and pay the labor cost. It's not practical | over a certain size or system complexity. It's also friction. | You can only do the queries you can afford to set up. | | That's why Google spent all that money to build Monarch. At | the end of the day Monarch is vastly cheaper in person time | and resources than manually-configured Borgmon/Prometheus. | And there is much less friction in trying new queries, etc. | halfmatthalfcat wrote: | Can you elaborate? I've ran Prometheus at some scale and it's | performed fine. | lokar wrote: | You pretty quickly exceed what one instance can handle for | memory, cpu or both. At that point you don't have any real | good options to scale while maintaining a flat namespace | (you need to partition). | preseinger wrote: | Sure? Prometheus scales with a federation model, not a | single flat namespace. | gttalbot wrote: | This means each new query over a certain size becomes a | federation problem, so the friction for trying new things | becomes very high above the scale of a single instance. | | Monitoring as a service has a lot of advantages. | preseinger wrote: | Well you obviously don't issue metrics queries over | arbitrarily large datasets, right? The Prometheus | architecture reflects this invariant. You constrain | queries against both time and domain boundaries. | gttalbot wrote: | Monarch can support both ad-hoc and periodic, standing | queries of arbitrarily large size, and has the means to | spread the computation out over many intermediate mixer | and leaf nodes. It does query push-down so that the | "expensive" parts of aggregations, joins, etc., can be | done in massively parallel fashion at the leaf level. | | It scales so well that many aggregations are set up and | computed for every service across the whole company (CPU, | memory usage, error rates, etc.). For basic monitoring | you can run a new service in production and go and look | at a basic dashboard for it without doing anything else | to set up monitoring. | gttalbot wrote: | Also, very large ad-hoc queries are supported, with | really good user isolation, so that (in general) users | don't harm each other. | preseinger wrote: | Prometheus is highly scalable?? What are you talking about?? | dijit wrote: | It is not. | | It basically does the opposite of what every scalable | system does. | | To get HA you double you're number of pollers. | | To get scale your queries you aggregate them into other | prometheii. | | If this is scalability: everything is scalable. | preseinger wrote: | I don't understand how the properties you're describing | imply that Prometheus isn't scalable. | | High Availability always requires duplication of effort. | Scaling queries always requires sharding and aggregation | at some level. | | I've deployed stock Prometheus at global scale, O(100k) | targets, with great success. You have to understand and | buy into Prometheus' architectural model, of course. | dijit wrote: | The ways in which you can scale Prometheus: you can scale | anything. | | It does not; itself, have highly scalable properties | built in. | | It does not do sharding, it does not do proxying, it does | not do batching, it does not do anything that would allow | it to run multiple servers and query over multiple | servers. | | Look. I'm not saying that it doesn't work; but when I | read about borgmon and Prometheus: I understood the | design goal was intentionally not to solve these hard | problems, and instead use them as primitive time series | systems that can be deployed with a small footprint | basically everywhere (and individually queried). | | I submit to you, I could also have an influxdb in every | server and get the same "scalability". | | Difference being that I can actually run a huge influxdb | cluster with a dataset that exceeds the capabilities of a | single machine. | preseinger wrote: | It seems like you're asserting a very specific definition | of scalability that excludes Prometheus' scalability | model. Scalability is an abstract property of a system | that can be achieved in many different ways. It doesn't | require any specific model of sharding, batching, query | replication, etc. Do you not agree? | dijit wrote: | You need to go back to computer science class. | | I'm not defining terms. | | https://en.wikipedia.org/wiki/Database_scalability | | Scalability means running a single workload across | multiple machines. | | Prometheus intentionally does not scale this way. | | I'm not being mean, it is fact. | | It has made engineering design trade offs and one of | those means it is not built to scale, this is fine, I'm | not here pooping on your baby. | | You can build scalable systems on top of things which do | not individually scale. | preseinger wrote: | Scalability isn't a well-defined term, and Prometheus | isn't a database. :shrug: | dijit wrote: | Wrong on both counts | | Sorry for being rude, but this level of ignorance is | extremely frustrating. | jeffbee wrote: | Prometheus cannot evaluate a query over time series that | do not fit in the memory of a single node, therefore it | is not scalable. | | The fact that it could theoretically ingest an infinite | amount of data that it cannot thereafter query is not | very interesting. | preseinger wrote: | It can? It just partitions the query over multiple nodes? | lokar wrote: | Lots of systems provide redundancy with 2X cost. It's not | that hard. | dilyevsky wrote: | You can set up dist eval similar to how it was done in | borgmon but you gotta do it manually (or maybe write an | operator to automate). One of Monarchs core ideas is to do | that behind the scenes for you | jeffbee wrote: | Prometheus' own docs say that distributed evaluation is | "deemed infeasible". | dilyevsky wrote: | Thats just like.. you know.. their opinion, man. | https://prometheus.io/docs/prometheus/latest/federation/ | bpicolo wrote: | Prometheus federation isn't distributed evaluation. It | "federates" from other nodes onto a single node. | | > Federation allows a Prometheus server to scrape | selected time series from another Prometheus server | dilyevsky wrote: | Collect directly from shard-level prometheus then | aggregate using /federate at another level. That's how | thanos also works afaik | buro9 wrote: | That's what Mimir solves | deepsun wrote: | How does it compare to VictoriaMetrics? | dilyevsky wrote: | It was originally push but i think they went back to sort of | scheduled pull mode after a few years. There was a very in | depth review doc written about this internally which maybe will | get published some day | atdt wrote: | What's the go/ link? | dilyevsky wrote: | Can't remember - just search on moma /s | gttalbot wrote: | Pull collection eventually became a real scaling bottleneck for | Monarch. | | The way the "pull" collection worked was that there was an | external process-discovery mechanism, which the leaf used to | connect to the entities it was monitoring, the leaf backend | processes would connect to the monitored entities to an | endpoint that the collection library would listen on, and those | entities collection libraries would stream the metric | measurements according to the schedules that the leaves sent. | | Several problems. | | First, the leaf-side data structures and TCP connections become | very expensive. If that leaf process is connecting to many many | many thousands of monitored entities, TCP buffers aren't free, | keep-alives aren't free, and a host of other data structures. | Eventually this became an...interesting...fraction of the CPU | and RAM on these leaf processes. | | Second, this implies a service discovery mechanism so that the | leaves can find the entities to monitor. This was a combination | of code in Monarch and an external discovery service. This was | a constant source of headaches an outages, as the appearance | and disappearance of entities is really spiky and | unpredictable. Any burp in operation of the discovery service | could cause a monitoring outage as well. Relatedly, the | technical "powers that be" decided that the particular | discovery service, of which Monarch was the largest user, | wasn't really something that was suitable for the | infrastructure at scale. This decision was made largely | independently of Monarch, but required Monarch to move off. | | Third, Monarch does replication, up to three ways. In the pull- | based system, it wasn't possible to guarantee that the | measurement that each replica sees is the same measurement with | the same microsecond timestamp. This was a huge data quality | issue that made the distributed queries much harder to make | correct and performant. Also, the clients had to pay both in | persistent TCP connections on their side and in RAM, state | machines, etc., for this replication as a connection would be | made from each backend leaf processes holding a replica for a | given client. | | Fourth, persistent TCP connections and load balancers don't | really play well together. | | Fifth, not everyone wants to accept incoming connections in | their binary. | | Sixth, if the leaf process doesn't need to know the collection | policies for all the clients, those policies don't have to be | distributed and updated to all of them. At scale this matters | for both machine resources and reliability. This can be made a | separate service, pushed to the "edge", etc. | | Switching from a persistent connection to the clients pushing | measurements in distinct RPCs as they were recorded eventually | solved all of these problems. It was a very intricate | transition that took a long time. A lot of people worked very | hard on this, and should be very proud of their work. I hope | some of them jump in to the discussion! (At very least they'll | add things I missed/didn't remember... ;^) | nickstinemates wrote: | too small for me, i was looking more for the scale of the | universe. | yayr wrote: | in case this can be deployed single-handed it might be useful | on a spaceship... would need some relativistic time accounting | though. | kasey_junk wrote: | A _huge_ difference between monarch and other tsdb that isn't | outlined in this overview, is that a storage primitive for schema | values is a histogram. Most (maybe all besides Circonus) tsdb try | to create histograms at query time using counter primitives. | | All of those query time histogram aggregations are making pretty | subtle trade offs that make analysis fraught. | teraflop wrote: | Is it really that different from, say, the way Prometheus | supports histogram-based quantiles? | https://prometheus.io/docs/practices/histograms/ | | Granted, it looks like Monarch supports a more cleanly-defined | schema for distributions, whereas Prometheus just relies on you | to define the buckets yourself and follow the convention of | using a "le" label to expose them. But the underlying | representation (an empirical CDF) seems to be the same, and so | the accuracy tradeoffs should also be the same. | spullara wrote: | Much different. When you are reporting histograms you can | combine them and see the true p50 or whatever across all the | individual systems reporting the metric. | [deleted] | nvarsj wrote: | Can you elaborate a bit? You can do the same in Prometheus | by summing the bucket counts. Not sure what you mean by | "true p50" either. With buckets it's always an | approximation based on the bucket widths. | spullara wrote: | Ah, I misunderstood what you meant. If you are reporting | static buckets I get how that is better than what folks | typically do but how do you know the buckets a priori? | Others back their histograms with things like | https://github.com/tdunning/t-digest. It is pretty | powerful as the buckets are dynamic based on the data and | histograms can be added together. | gttalbot wrote: | Yes. This. Also, displaying histograms in heatmap format | can allow you to intuit the behavior of layered distributed | systems, caches, etc. Relatedly, exemplars allowed tying | related data to histogram buckets. For example, RPC traces | could be tied to the latency bucket & time at which they | complete, giving a natural means to tie metrics monitoring | and tracing, so you can "go to the trace with the problem". | This is described in the paper as well. | spullara wrote: | Wavefront also has histogram ingestion (I wrote the original | implementation, I'm sure it is much better now). Hugely | important if you ask me but honestly I don't think that many | customers use it. | sujayakar wrote: | I've been pretty happy with datadog's distribution type [1] | that uses their own approximate histogram data structure [2]. I | haven't evaluated their error bounds deeply in production yet, | but I haven't had to tune any bucketing. The linked paper [3] | claims a fixed percentage of relative error per percentile. | | [1] https://docs.datadoghq.com/metrics/distributions/ | | [2] https://www.datadoghq.com/blog/engineering/computing- | accurat... | | [3] https://arxiv.org/pdf/1908.10693.pdf | hn_go_brrrrr wrote: | In my experience, Monarch storing histograms and being unable | to rebucket on the fly is a big problem. A percentile line on a | histogram will be incredibly misleading, because it's trying to | figure out what the p50 of a bunch of buckets is. You'll see | monitoring artifacts like large jumps and artificial plateaus | as a result of how requests fall into buckets. The bucketer on | the default RPC latency metric might not be well tuned for your | service. I've seen countless experienced oncallers tripped up | by this, because "my graphs are lying to me" is not their first | thought. | heinrichhartman wrote: | Circonus Histograms solve that by using a universal bucketing | scheme. Details are explained in this paper: | https://arxiv.org/abs/2001.06561 | | Disclaimer: I am a co-author. | kasey_junk wrote: | My personal opinion is that they should have done a log | linear histogram which solves the problems you mention (with | other trade offs) but to me the big news was making the db | flexible enough to have that data type. | | Leaving the world of single numeric type for each datum will | influence the next generation of open source metrics db. | jrockway wrote: | I definitely remember a lot of time spent tweaking histogram | buckets for performance vs. accuracy. The default bucketing | algorithm at the time was powers of 4 or something very | unusual like that. | shadowgovt wrote: | It's because powers of four was great for the original | application of statistics on high traffic services where | the primary thing the user was interested in was deviations | from the norm, and with a high traffic system the signal | for what the norm is would be very strong. | | I tried applying it to a service with much lower traffic | and found the bucketing to be extremely fussy. | [deleted] | 8040 wrote: | I broke this once several years ago. I even use the incident | number in my random usernames to see if a Googler recognizes it. | ajstiles wrote: | Wow - that was a doozy. | ikiris wrote: | ahahahahahaha | orf wrote: | How did you break it? | ikiris wrote: | IIRC they didn't not break it. | [deleted] | zoover2020 wrote: | This is also why I love HN. So niche! | dijit wrote: | Discussion from 2020: | https://news.ycombinator.com/item?id=24303422 | klysm wrote: | I don't really grasp why this is a useful spot in the trade off | space from a quick skim. Seems risky. | dijit wrote: | There's a good talk on Monarch https://youtu.be/2mw12B7W7RI | | Why it exists is laid out quite plainly. | | The pain of it is we're all jumping on Prometheus (borgmon) | without considering why Monarch exists. Monarch doesn't have a | good corollary outside of google. | | Maybe some weird mix of timescale DB backed by cockroachdb with | a Prometheus push gateway. | AlphaSite wrote: | Wavefront is based on FoundationDB which I've always found | pretty cool. | | [1] https://news.ycombinator.com/item?id=16879392 | | Disclaimer: I work at vmware on an unrelated thing. | codethief wrote: | The first time I heard about Monarch was in discussions about the | hilarious "I just want to serve 5 terabytes" video[0]. | | [0]: https://m.youtube.com/watch?v=3t6L-FlfeaI | pm90 wrote: | A lot of Google projects seem to rely on other Google projects. | In this case Monarch relies on spanner. | | I guess its nice to publish at least the conceptual design so | that others can implement it in "rest of the world" case. Working | with OSS can be painful, slow and time consuming so this seems | like a reasonable middle ground (although selfishly I do wish all | of this was source available). | praptak wrote: | Spanner may be hard to set up even with source code available. | It relies on atomic clocks for reliable ordering of events. | [deleted] | joshuamorton wrote: | I don't think there's any spanner _necessity_ and iirc monarch | existed pre-spanner. | gttalbot wrote: | Correct. Spanner is used to hold configuration state, but is | not in the serving path. | sydthrowaway wrote: | Stop overhyping software with buzzwords ___________________________________________________________________ (page generated 2022-05-14 23:00 UTC)