[HN Gopher] CNCF's Cortex v1.0: scalable, fast Prometheus implem...
       ___________________________________________________________________
        
       CNCF's Cortex v1.0: scalable, fast Prometheus implementation
        
       Author : netingle
       Score  : 155 points
       Date   : 2020-04-02 12:59 UTC (10 hours ago)
        
 (HTM) web link (grafana.com)
 (TXT) w3m dump (grafana.com)
        
       | netingle wrote:
       | Hi! Tom, one of the Cortex authors here. Super proud of the team
       | and this release - let me know if you have any questions!
        
         | ctovena wrote:
         | Great job Cortex team, Do you think this means Cortex will move
         | to incubation in the CNCF landscape ?
        
           | netingle wrote:
           | I hope so! Goutham is apply for incubation as we speak..
        
             | RichiH wrote:
             | This will also depend on SIG o11y, the creation of which is
             | currently being voted on by CNCF TOC. TOC vote is looking
             | good and projects which have been in sandbox for some time
             | are obvious candidates for early review.
        
         | number101010 wrote:
         | Hey Tom!
         | 
         | Can you outline how Cortex differs from some of the other
         | available Prometheus backends?
        
           | netingle wrote:
           | Sure, check out this talk from PromCon I did with Bartek, the
           | Thanos author: https://grafana.com/blog/2019/11/21/promcon-
           | recap-two-househ...
        
             | MetalMatze wrote:
             | Love that talk. :)
        
       | rfratto wrote:
       | Great job Cortex team!
        
       | throwaway50203 wrote:
       | Reminder: github star history is in no way a measure of quality.
        
       | mattmendick wrote:
       | Really exciting! Well done
        
       | nopzor wrote:
       | awesome job by the cortex team!
       | 
       | there's a lot of good questions, and some confusion in this
       | thread. here is my view. note: i'm definitely biased; am the co-
       | founder/ceo at grafana labs.
       | 
       | - at grafana labs we are huge fans of prometheus. it has become
       | the most popular metrics backend for grafana. we view cortex and
       | prometheus as complementary. we are also very active contributors
       | to the prometheus project itself. in fact, cortex vendors in
       | prometheus.
       | 
       | - you can think of cortex as a scale-out, multi-tenant, highly
       | available "implementation" of prometheus itself.
       | 
       | - the reason grafana labs put so much resources into cortex is
       | because it powers our grafana cloud product (which offers a
       | prometheus backend). like grafana itself, we are also actively
       | working on an enterprise edition of cortex that is designed to
       | meet the security and feature requirements of the largest
       | companies in the world.
       | 
       | - yes, cortex was born at weaveworks in 2016. tom wilkie (vp of
       | product at grafana labs) co-created it while he worked there.
       | after tom joined grafana labs in 2018, we decided to pour a lot
       | more resources into the project, and managed to convince
       | weave.works to move it to the cncf. this was a great move for the
       | project and the community, and cortex has come a long long way in
       | the last 2 years.
       | 
       | once again, a big hat tip to everyone who made this release
       | possible. a big day for the project, and for prometheus users in
       | general!
       | 
       | [edit: typos]
        
         | Florin_Andrei wrote:
         | I'm worried about this statement:
         | 
         | > _Local storage is explicitly not production ready at this
         | time._
         | 
         | https://cortexmetrics.io/docs/getting-started/getting-starte...
         | 
         | But I want a scale-out, multitenant implementation of
         | Prometheus with local storage that's ready for prod. What are
         | my options then? VictoriaMetrics?
        
           | netingle wrote:
           | There are a bunch of different solutions out there; Thanos,
           | Influx, federated Prometheus etc.
           | 
           | The local Cortex storage works pretty well but we have a very
           | high bar for production worthiness. Right now I'd recommend
           | using Bigtable of DynamoDB, and if you're on premise
           | Cassandra. In the future the block storage will allow you to
           | run minio.
        
           | gouthamve wrote:
           | The only one I know with "non-experimental" local-storage is
           | VictoriaMetrics. But the big thing there is that data in VM
           | is not replicated, so when you lose a disk/node, you lose
           | that data.
           | 
           | Having said that, both Thanos and Cortex have experimental
           | local-storage modes that are pretty good. You could also try
           | them for now while they get production ready.
        
             | simonrobb wrote:
             | M3 provides local storage but is not experimental, on top
             | of that with cluster replication which VictoriaMetrics does
             | not provide, and has a kubernetes operator to help scale
             | out a cluster.
             | 
             | Disclosure: I work on the TSDB underlying M3 (M3DB) at
             | Uber. Still worth checking out though!
        
             | Florin_Andrei wrote:
             | > _data in VM is not replicated, so when you lose a disk
             | /node, you lose that data_
             | 
             | The vmstorage component in VictoriaMetrics Server - is it
             | RAID0-like (stripping) or RAID1-like (mirroring)?
             | 
             | https://github.com/VictoriaMetrics/VictoriaMetrics/tree/clu
             | s...
        
           | prungta wrote:
           | I suggest checking out M3DB[1]. My team & I use it to serve
           | metrics for all of Uber, we have ~1500 hosts across various
           | clusters. It's serving us quite well.
           | 
           | [1]: https://github.com/m3db/m3
        
           | ecnahc515 wrote:
           | Thanos is probably one of the other popular choices. It's
           | being heavily used in production by a number of companies,
           | but I don't think they've branded it at "Prod ready" in a 1.0
           | release though.
        
             | sciurus wrote:
             | Thanos doesn't have production support for local storage
             | either. The only stable storage providers for it are
             | google, amazon, and azure's object stores.
             | 
             | https://thanos.io/storage.md/
             | 
             | Interestingly, it looks like Cortex's support for local
             | storage and object stores comes from using Thanos's storage
             | engine. So once it's production ready in Thanos it will
             | probably be production-ready in Cortex shortly thereafter.
             | 
             | https://cortexmetrics.io/docs/operations/blocks-storage/
             | 
             | I think for Cortex your safest storage options now are
             | Bigtable, DynamoDB, or Cassandra.
        
               | ecnahc515 wrote:
               | I may have misinterpreted what they meant by local
               | storage! I was reading that as having a local copy of the
               | TSDB available to Prometheus, (eg: how Thanos works)
               | versus Cortex which doesn't store metrics locally (IIRC).
               | 
               | What you said is correct and makes sense. Though, I would
               | suspect either choice works with any S3 compatible API
               | that can run on local storage, but I know that isn't
               | necessarily what's meant by "local storage".
        
               | Florin_Andrei wrote:
               | "local storage" = I don't want to install yet another
               | gizmo just to store data, nor do I want to use an
               | external service for that
               | 
               | Batteries included.
        
         | m0rphling wrote:
         | Please note the difference between _complimentary_ and
         | _complementary_. It 's a common homophone confusion in English.
         | 
         | The former means free or charge or expressing praise or a
         | compliment.
         | 
         | The latter means disparate things go well together and enhance
         | each others' qualities.
        
           | nopzor wrote:
           | thanks for the complimentary tip ;) fixed.
        
       | kapilvt wrote:
       | also props to https://weave.works for creating cortex, open-
       | sourcing it and moving it under cncf, something this blog post
       | leaves out.
        
       | stuff4ben wrote:
       | This was a Weaveworks project right?
        
         | gouthamve wrote:
         | Yes, it was created at Weaveworks, but it was later donated to
         | CNCF and now the community is much bigger! Having said that
         | Weaveworks is still a major contributor!
        
       | Rapzid wrote:
       | Dat architecture tho: https://cortexmetrics.io/docs/architecture/
       | . Holy bi-gebus.
        
         | netingle wrote:
         | Thats the "microservices" mode - you can run it as a single
         | process and the architecture becomes super boring.
         | 
         | Its like looking at the module interdependencies of reasonably
         | large piece of software; of course its going to look
         | complicated.
        
       | zytek wrote:
       | Congrats to Grafana Team!
       | 
       | If you're looking at scaling your Prometheus setup - check out
       | also Victoria Metrics.
       | 
       | Operational simplicity and scalability/robustness are what drive
       | me to it.
       | 
       | I used to to send metrics from multiple Kubernetes clusters with
       | Prometheus - each cluster having Prom with remote_write directive
       | to send metrics to central VictoriaMetrics service.
       | 
       | That way my "edge" prometheus installations are practically
       | "stateless", easily set up using prometheus-operator. You don't
       | even need to add persistent storage to them.
        
       | ones_and_zeros wrote:
       | Isn't prometheus an implementation and not an interface? I have
       | "prometheus" running in my cluster, if it's not cortex, what
       | implementation am I using?
        
         | gouthamve wrote:
         | Yes, you're running the Prometheus server. But what Cortex is a
         | Prometheus API compatible service that horizontally scales and
         | has multi-tenancy and other things built in.
        
         | netingle wrote:
         | Yes, Prometheus is an implementation - the HN text has a
         | limited number of words, so I thought "Prometheus
         | implementation" conveyed the fact Cortex was trying to be a
         | 100% API compatible implementation of Prometheus, but with
         | scalability, replication etc
        
           | cat199 wrote:
           | how about:
           | 
           | CNCF's Cortex v1.0: scalable, fast Prometheus API
           | implementation ready for prod (grafana.com)
           | 
           | saves 1 char.
        
         | ownagefool wrote:
         | It's kinda several things
         | 
         | - The OSS product
         | 
         | - The Storage Format (I guess)
         | 
         | - The Interface for pulling metrics
         | (https://github.com/OpenObservability/OpenMetrics)
         | 
         | I haven't dug into cortex even a little, but the other comments
         | are suggesting it's API compatible but essentially claiming
         | they're production ready because they'll give you things the
         | OSS project won't give you out of the box, i.e. long term
         | storage and RBAC.
         | 
         | Looks like a good thing.
        
           | netingle wrote:
           | > wrapping prometheus and giving you that production
           | readyness that they're claiming the OSS project won't give
           | you out of the box
           | 
           | No! Prometheus is and has been production ready for many
           | years. Cortex is a clustered/horizontally scalable
           | implemention of the Prometheus APIs, and Cortex has just gone
           | production ready. Sorry for the confusion.
        
             | ownagefool wrote:
             | Just want to say, I use prometheus. It's amazing.
             | 
             | But readiness depends somewhat on your use case. If you're
             | on a multi-tenanted cluster and you don't want to explicit
             | trust your users / admins, how do you stop them from
             | messing with your metrics whilst allowing them to maintain
             | their own?
             | 
             | I typically did it via github flow, some others used the
             | operator to give us many proms, some others would just
             | suggest it's missing features.
             | 
             | Indeed, I could probably word my example better though.
             | Apologies if I were putting words in your mouth.
        
             | RichiH wrote:
             | And I have Prometheus data from 2015, so I would argue
             | that's long-term.
        
         | outworlder wrote:
         | You are using Prometheus.
         | 
         | However, Prometheus can use different storage backends. The
         | TSDB that it comes with is horrible.
         | 
         | I mean, it's workable. And can store an impressive amount of
         | data points. If you don't care about historical data or scale,
         | it may be all you need.
         | 
         | However, if your scale is really large, or if you care about
         | the data, it may not be the right solution, and you'll need
         | something like Cortex.
         | 
         | For instance, Prometheus' own TSSB has no 'fsck'-like tool.
         | From time to time, it does compaction operations. If your
         | process (or pod in K8s) dies, you may be left with duplicate
         | time series. And now you have to delete some (or a lot!) of
         | your data to recover.
         | 
         | Prometheus documentation, last I checked, even says it is not
         | suitable for long-term storage.
        
           | sagichmal wrote:
           | > The TSDB that it comes with is horrible.
           | 
           | The TSDB in Prometheus since 2.0 is excellent for its use
           | case.
        
           | ecnahc515 wrote:
           | The TSDB it uses is actually pretty state of the art. I think
           | your pain point is more that it's designed for being used on
           | local disk, but that doesn't mean it isn't possible to store
           | the TSDB remotely. In fact, this is exactly how Thanos works.
           | 
           | The docs say Prometheus is not intended for long term storage
           | because without a remote_write configuration, all data is
           | persisted locally, and thus you will eventually hit limits on
           | the amount that can be stored and queried locally. However,
           | that is a limitation on how Prometheus is designed, not how
           | the TSDB is designed, and which can be overcome by using a
           | remote_write adapter.
        
       ___________________________________________________________________
       (page generated 2020-04-02 23:00 UTC)