[HN Gopher] CNCF's Cortex v1.0: scalable, fast Prometheus implem... ___________________________________________________________________ CNCF's Cortex v1.0: scalable, fast Prometheus implementation Author : netingle Score : 155 points Date : 2020-04-02 12:59 UTC (10 hours ago) (HTM) web link (grafana.com) (TXT) w3m dump (grafana.com) | netingle wrote: | Hi! Tom, one of the Cortex authors here. Super proud of the team | and this release - let me know if you have any questions! | ctovena wrote: | Great job Cortex team, Do you think this means Cortex will move | to incubation in the CNCF landscape ? | netingle wrote: | I hope so! Goutham is apply for incubation as we speak.. | RichiH wrote: | This will also depend on SIG o11y, the creation of which is | currently being voted on by CNCF TOC. TOC vote is looking | good and projects which have been in sandbox for some time | are obvious candidates for early review. | number101010 wrote: | Hey Tom! | | Can you outline how Cortex differs from some of the other | available Prometheus backends? | netingle wrote: | Sure, check out this talk from PromCon I did with Bartek, the | Thanos author: https://grafana.com/blog/2019/11/21/promcon- | recap-two-househ... | MetalMatze wrote: | Love that talk. :) | rfratto wrote: | Great job Cortex team! | throwaway50203 wrote: | Reminder: github star history is in no way a measure of quality. | mattmendick wrote: | Really exciting! Well done | nopzor wrote: | awesome job by the cortex team! | | there's a lot of good questions, and some confusion in this | thread. here is my view. note: i'm definitely biased; am the co- | founder/ceo at grafana labs. | | - at grafana labs we are huge fans of prometheus. it has become | the most popular metrics backend for grafana. we view cortex and | prometheus as complementary. we are also very active contributors | to the prometheus project itself. in fact, cortex vendors in | prometheus. | | - you can think of cortex as a scale-out, multi-tenant, highly | available "implementation" of prometheus itself. | | - the reason grafana labs put so much resources into cortex is | because it powers our grafana cloud product (which offers a | prometheus backend). like grafana itself, we are also actively | working on an enterprise edition of cortex that is designed to | meet the security and feature requirements of the largest | companies in the world. | | - yes, cortex was born at weaveworks in 2016. tom wilkie (vp of | product at grafana labs) co-created it while he worked there. | after tom joined grafana labs in 2018, we decided to pour a lot | more resources into the project, and managed to convince | weave.works to move it to the cncf. this was a great move for the | project and the community, and cortex has come a long long way in | the last 2 years. | | once again, a big hat tip to everyone who made this release | possible. a big day for the project, and for prometheus users in | general! | | [edit: typos] | Florin_Andrei wrote: | I'm worried about this statement: | | > _Local storage is explicitly not production ready at this | time._ | | https://cortexmetrics.io/docs/getting-started/getting-starte... | | But I want a scale-out, multitenant implementation of | Prometheus with local storage that's ready for prod. What are | my options then? VictoriaMetrics? | netingle wrote: | There are a bunch of different solutions out there; Thanos, | Influx, federated Prometheus etc. | | The local Cortex storage works pretty well but we have a very | high bar for production worthiness. Right now I'd recommend | using Bigtable of DynamoDB, and if you're on premise | Cassandra. In the future the block storage will allow you to | run minio. | gouthamve wrote: | The only one I know with "non-experimental" local-storage is | VictoriaMetrics. But the big thing there is that data in VM | is not replicated, so when you lose a disk/node, you lose | that data. | | Having said that, both Thanos and Cortex have experimental | local-storage modes that are pretty good. You could also try | them for now while they get production ready. | simonrobb wrote: | M3 provides local storage but is not experimental, on top | of that with cluster replication which VictoriaMetrics does | not provide, and has a kubernetes operator to help scale | out a cluster. | | Disclosure: I work on the TSDB underlying M3 (M3DB) at | Uber. Still worth checking out though! | Florin_Andrei wrote: | > _data in VM is not replicated, so when you lose a disk | /node, you lose that data_ | | The vmstorage component in VictoriaMetrics Server - is it | RAID0-like (stripping) or RAID1-like (mirroring)? | | https://github.com/VictoriaMetrics/VictoriaMetrics/tree/clu | s... | prungta wrote: | I suggest checking out M3DB[1]. My team & I use it to serve | metrics for all of Uber, we have ~1500 hosts across various | clusters. It's serving us quite well. | | [1]: https://github.com/m3db/m3 | ecnahc515 wrote: | Thanos is probably one of the other popular choices. It's | being heavily used in production by a number of companies, | but I don't think they've branded it at "Prod ready" in a 1.0 | release though. | sciurus wrote: | Thanos doesn't have production support for local storage | either. The only stable storage providers for it are | google, amazon, and azure's object stores. | | https://thanos.io/storage.md/ | | Interestingly, it looks like Cortex's support for local | storage and object stores comes from using Thanos's storage | engine. So once it's production ready in Thanos it will | probably be production-ready in Cortex shortly thereafter. | | https://cortexmetrics.io/docs/operations/blocks-storage/ | | I think for Cortex your safest storage options now are | Bigtable, DynamoDB, or Cassandra. | ecnahc515 wrote: | I may have misinterpreted what they meant by local | storage! I was reading that as having a local copy of the | TSDB available to Prometheus, (eg: how Thanos works) | versus Cortex which doesn't store metrics locally (IIRC). | | What you said is correct and makes sense. Though, I would | suspect either choice works with any S3 compatible API | that can run on local storage, but I know that isn't | necessarily what's meant by "local storage". | Florin_Andrei wrote: | "local storage" = I don't want to install yet another | gizmo just to store data, nor do I want to use an | external service for that | | Batteries included. | m0rphling wrote: | Please note the difference between _complimentary_ and | _complementary_. It 's a common homophone confusion in English. | | The former means free or charge or expressing praise or a | compliment. | | The latter means disparate things go well together and enhance | each others' qualities. | nopzor wrote: | thanks for the complimentary tip ;) fixed. | kapilvt wrote: | also props to https://weave.works for creating cortex, open- | sourcing it and moving it under cncf, something this blog post | leaves out. | stuff4ben wrote: | This was a Weaveworks project right? | gouthamve wrote: | Yes, it was created at Weaveworks, but it was later donated to | CNCF and now the community is much bigger! Having said that | Weaveworks is still a major contributor! | Rapzid wrote: | Dat architecture tho: https://cortexmetrics.io/docs/architecture/ | . Holy bi-gebus. | netingle wrote: | Thats the "microservices" mode - you can run it as a single | process and the architecture becomes super boring. | | Its like looking at the module interdependencies of reasonably | large piece of software; of course its going to look | complicated. | zytek wrote: | Congrats to Grafana Team! | | If you're looking at scaling your Prometheus setup - check out | also Victoria Metrics. | | Operational simplicity and scalability/robustness are what drive | me to it. | | I used to to send metrics from multiple Kubernetes clusters with | Prometheus - each cluster having Prom with remote_write directive | to send metrics to central VictoriaMetrics service. | | That way my "edge" prometheus installations are practically | "stateless", easily set up using prometheus-operator. You don't | even need to add persistent storage to them. | ones_and_zeros wrote: | Isn't prometheus an implementation and not an interface? I have | "prometheus" running in my cluster, if it's not cortex, what | implementation am I using? | gouthamve wrote: | Yes, you're running the Prometheus server. But what Cortex is a | Prometheus API compatible service that horizontally scales and | has multi-tenancy and other things built in. | netingle wrote: | Yes, Prometheus is an implementation - the HN text has a | limited number of words, so I thought "Prometheus | implementation" conveyed the fact Cortex was trying to be a | 100% API compatible implementation of Prometheus, but with | scalability, replication etc | cat199 wrote: | how about: | | CNCF's Cortex v1.0: scalable, fast Prometheus API | implementation ready for prod (grafana.com) | | saves 1 char. | ownagefool wrote: | It's kinda several things | | - The OSS product | | - The Storage Format (I guess) | | - The Interface for pulling metrics | (https://github.com/OpenObservability/OpenMetrics) | | I haven't dug into cortex even a little, but the other comments | are suggesting it's API compatible but essentially claiming | they're production ready because they'll give you things the | OSS project won't give you out of the box, i.e. long term | storage and RBAC. | | Looks like a good thing. | netingle wrote: | > wrapping prometheus and giving you that production | readyness that they're claiming the OSS project won't give | you out of the box | | No! Prometheus is and has been production ready for many | years. Cortex is a clustered/horizontally scalable | implemention of the Prometheus APIs, and Cortex has just gone | production ready. Sorry for the confusion. | ownagefool wrote: | Just want to say, I use prometheus. It's amazing. | | But readiness depends somewhat on your use case. If you're | on a multi-tenanted cluster and you don't want to explicit | trust your users / admins, how do you stop them from | messing with your metrics whilst allowing them to maintain | their own? | | I typically did it via github flow, some others used the | operator to give us many proms, some others would just | suggest it's missing features. | | Indeed, I could probably word my example better though. | Apologies if I were putting words in your mouth. | RichiH wrote: | And I have Prometheus data from 2015, so I would argue | that's long-term. | outworlder wrote: | You are using Prometheus. | | However, Prometheus can use different storage backends. The | TSDB that it comes with is horrible. | | I mean, it's workable. And can store an impressive amount of | data points. If you don't care about historical data or scale, | it may be all you need. | | However, if your scale is really large, or if you care about | the data, it may not be the right solution, and you'll need | something like Cortex. | | For instance, Prometheus' own TSSB has no 'fsck'-like tool. | From time to time, it does compaction operations. If your | process (or pod in K8s) dies, you may be left with duplicate | time series. And now you have to delete some (or a lot!) of | your data to recover. | | Prometheus documentation, last I checked, even says it is not | suitable for long-term storage. | sagichmal wrote: | > The TSDB that it comes with is horrible. | | The TSDB in Prometheus since 2.0 is excellent for its use | case. | ecnahc515 wrote: | The TSDB it uses is actually pretty state of the art. I think | your pain point is more that it's designed for being used on | local disk, but that doesn't mean it isn't possible to store | the TSDB remotely. In fact, this is exactly how Thanos works. | | The docs say Prometheus is not intended for long term storage | because without a remote_write configuration, all data is | persisted locally, and thus you will eventually hit limits on | the amount that can be stored and queried locally. However, | that is a limitation on how Prometheus is designed, not how | the TSDB is designed, and which can be overcome by using a | remote_write adapter. ___________________________________________________________________ (page generated 2020-04-02 23:00 UTC)