[HN Gopher] Amazon Managed Service for Prometheus
       ___________________________________________________________________
        
       Amazon Managed Service for Prometheus
        
       Author : pdelgallego
       Score  : 115 points
       Date   : 2020-12-15 17:47 UTC (5 hours ago)
        
 (HTM) web link (aws.amazon.com)
 (TXT) w3m dump (aws.amazon.com)
        
       | pickledish wrote:
       | 14 cents per "query processing minute" sounds like it could add
       | up very fast. Prom queries can get somewhat complex and it's not
       | rare at all IME to have a dashboard making several multi-second
       | queries per load (whether that falls into "you're using
       | Prometheus wrong" being a separate discussion of course)
       | 
       | Edit: The example from their pricing page:
       | 
       | > We will assume you have 1 end user monitoring a dashboard for
       | an average of 2 hours per day refreshing it every 60 seconds with
       | 20 chart widgets per dashboard (assuming 1 PromQL query per
       | widget)... assuming 18ms per query for this example.
       | 
       | Comes out to over $3 per month in query costs. Replace this 1
       | person with a TV showing the dashboard all day, and the cost
       | jumps to $36, for just one dashboard and (again IME) overly fast
       | query estimates... o.O
        
         | edoceo wrote:
         | Now do six dashboards, 10 widgets each, multiple viewers,
         | 18h/day and one slowish query on each dashboard. Seems like we
         | get to hundred+ pretty quick
        
           | bboreham wrote:
           | Caching means that multiple viewers cost very little extra.
           | 
           | (I am a Cortex maintainer)
        
         | gravypod wrote:
         | Does it put any limits on cardinality of metrics? Grafana
         | cloud's offering was absolutely awful for my use cases. They
         | charge per-series so if you have metrics with a "pod=..." label
         | your prices go through the roof.
        
           | valyala wrote:
           | Grafana cloud sets high prices for high-cardinality metrics
           | because the underlying system - Cortex - isn't optimized well
           | for storing high number of unique time series. For example,
           | it requires at least 15GB of RAM for processing a million of
           | active time series [1]. This means high infrastructure costs,
           | which increase pricing for end users. Other systems such as
           | VictoriaMetrics require up to 15x lower RAM for the same
           | metrics' cardinality [2].
           | 
           | [1] https://github.com/cortexproject/cortex/blob/67648aabae70
           | f19...
           | 
           | [2] https://victoriametrics.github.io/#capacity-planning
        
           | heliodor wrote:
           | Plenty has been written about not using the
           | server/container/pod id as a label because it leads to high
           | cardinality which leads to poor performance (cost aside).
           | Time series databases have been purpose-built for certain
           | workloads and you can consider this their weakness.
        
             | gravypod wrote:
             | Plenty has also been written about the bugs/issues that
             | have cropped up that are only visible when inspecting what
             | regions/nodes/cgroups an issue is coming from [0]. My use
             | case wasn't exactly `pod=...` but it was very similar. It
             | was more like `device=...`. Also, for a huge application,
             | it's not uncommon to have 100s or even 1000s of metrics
             | that are important to application health/performance.
             | Constantly saying "do you really need X? It will cost us Y"
             | will lead to an extremely under-monitored application.
             | 
             | [0] - https://cloud.google.com/blog/products/management-
             | tools/sre-...
        
               | heliodor wrote:
               | Plenty of companies run their own servers because cloud
               | is too expensive at their scale. Same goes for metrics.
               | It's a direct result of one-price-fits-all pricing models
               | for software as well as pricing that is not correctly
               | tied to value.
        
           | kasey_junk wrote:
           | Every managed metrics system will put a limit on cardinality
           | because all mainstream available metrics systems cost more
           | per cardinality to query and store. If they don't limit that
           | you can assume you or some other customer is going to use up
           | the clusters resources and cause an outage.
           | 
           | Like most metrics systems, under the covers in Prometheus
           | each unique combination of dimensions is the same as a new
           | metric line.
        
           | webo wrote:
           | I like Weave Cloud's Prometheus hosting model -- it's per
           | host, which is predictable and forecastable.
        
       | latchkey wrote:
       | I just went through the "process" of installing Grafana, Loki,
       | Promtail and Prometheus on an ubuntu box and it is almost like
       | the company behind all of this has gone out of the their way to
       | make it hard. It isn't really _that_ difficult to get set up, but
       | it also isn't 'apt install' easy (you really want me to create my
       | own startup scripts?) and required me to build my own
       | documentation on how I installed everything.
        
         | rfratto wrote:
         | One of the Loki maintainers here (though I mostly work on other
         | stuff now). I promise it's not difficult on purpose.
         | 
         | We've put a lot of effort into optimizing the Kubernetes
         | experience that non-containerized installations haven't been
         | getting as much attention. We'd be thrilled to have system
         | packages for Loki that also set it up as a service, it's just
         | not something we've been able to spend time doing ourselves
         | yet.
        
           | latchkey wrote:
           | It isn't just loki, but the whole stack. Grafana is the only
           | project mentioned that has a debian installer.
           | 
           | The expectation that someone doing greenfield development is
           | going to jump into k8s just to use the software is kind of
           | weird.
        
             | qz2 wrote:
             | I'm deploying it (prom, alertmanager, pushgateway, grafana)
             | on native hardware via ansible and it's not difficult. Not
             | Loki (yet). It's all just go binaries you fire up with
             | systemd with a single config file.
             | 
             | I find it harder to deploy reliably on kubernetes with
             | persistent volumes etc.
        
         | 0xbadcafebee wrote:
         | All of those who have spent their free time contributing to
         | Linux distributions are why 'apt install' is easy. You can
         | contribute too.
        
           | latchkey wrote:
           | As the co-founder of Apache Java and a 20+ year member of the
           | ASF, creator and contributor to hundreds of projects over the
           | years, I think I've contributed enough of my time to OSS. I'm
           | more than happy to let the new kids jump in. Thanks for the
           | 'advice'.
        
         | john_moscow wrote:
         | It's almost like the company behind it wants to see some profit
         | after pouring millions of dollars into developing these tools.
         | Except, in 2020 you cannot just have a closed-source easy-to-
         | use documented and supported product with a license fee. Not in
         | the server market, at least. Everything must be free and open-
         | source, and you are expected to make money by offering a hosted
         | service. Except, good luck competing with Big Cloud.
        
           | RocketSyntax wrote:
           | It's extremely worrisome. The incentive to spend your early
           | mornings, nights, and weekends building something awesome to
           | free yourself from corporate life is fading away. They need
           | to institute some kind of royalty program or at least
           | dedicate engineers to helping maintain the projects they make
           | into services.
           | 
           | Almost have to change gears and get into a scientific field
           | that isn't computer science.
        
       | WoahNoun wrote:
       | Everyone here complaining about the pricing on the managed
       | Grafana and Prometheus services have clearly never worked at a
       | shop using SumoLogic. Log/metric processing/querying is expensive
       | for a reason.
        
       | eminence32 wrote:
       | From the pricing section:
       | 
       | > AMP counts each metric sample ingested to the secured
       | Prometheus-compatible endpoint. AMP also calculates the stored
       | metric samples and metric metadata in gigabytes (GB), where 1GB
       | is 230 bytes.
       | 
       | Surely that's a typo, right?
        
         | biot wrote:
         | Likely a casualty of copy and paste that left out the
         | superscript formatting. 1GB is 2^30 bytes.
        
       | alexhf wrote:
       | I don't see any mention of Pushgateway. They'll need to add that
       | or I won't be able to monitor ephemeral jobs.
        
         | mchene wrote:
         | Hey... Marc here from AWS. I'm the PM lead for this service.
         | Thank you for the feedback. Pushgateway is important for our
         | customers and it is a feature we are looking to support as part
         | of our roadmap. For the time being, you can continue to use the
         | Pushgateway as you do today and remote write the metrics to AMP
         | for long term storage and querying!
        
       | pram wrote:
       | Yeah I dunno about this, and the grafana service. They're not
       | exactly complicated to run on their own. At this pricing you may
       | as well be on Datadog.
        
         | nrmitchi wrote:
         | I've commented fairly heavily in the related Grafana thread.
         | 
         | Prometheus is a bit of a different story. It _does_ have some
         | operational overhead when you get to a certain point, and
         | scaling it out is not always trivial.
         | 
         | Assuming it works, there is value-add on this one, and the
         | pricing is more in line with _active use_ (ie, a cost+ model,
         | which is more typical of AWS services)
        
           | [deleted]
        
           | valyala wrote:
           | Amazon Managed Service for Prometheus is based on Cortex. It
           | is quite expensive in terms of operational and infrastructure
           | costs compared to VictoriaMetrics [1] according to case
           | studies from VictoriaMetrics users [2]. This may explain
           | quite high costs for AMP.
           | 
           | [1] https://victoriametrics.github.io/FAQ.html#what-is-the-
           | diffe...
           | 
           | [2] https://victoriametrics.github.io/CaseStudies.html
           | 
           | Disclaimer: I'm core developer of VictoriaMetrics, so feel
           | free asking any questions about it or about our competitors
           | :)
        
         | zander312 wrote:
         | Scaling prometheus across multiple separate Kubernetes clusters
         | is a fking nightmare.
        
           | zytek wrote:
           | Try VictoriaMetrics. Deploy stateless Prometheuses that
           | remote_write to central VictoriaMetrics instance.
        
         | stevekemp wrote:
         | This seems more interesting of the two, grafana is pretty
         | simple to setup and maintain. The harder part is handling the
         | metrics themselves, be it with influxdb, prometheus, or
         | something else.
        
         | markcartertm wrote:
         | setting up one Prometheus server is easy. scaling, HA, Metrics
         | retention for more than 3 days not so much.
        
           | heliodor wrote:
           | Look at VictoriaMetrics (and the related products vmalert and
           | vmagent) for a much easier and pleasant experience as a drop-
           | in Prometheus replacement.
        
         | Thaxll wrote:
         | Prometheus is not easy to run at scale on the storage side.
        
           | pram wrote:
           | This is all relative but I don't personally think so. Not on
           | EC2+EBS, anyway. Certainly not as difficult as
           | running/scaling an ES or Kafka cluster.
        
             | Thaxll wrote:
             | It's a completely different problem because by default
             | Prometheus does not shard anything so you're bound to a
             | single instance, where ES and Kafka are cluster based.
        
         | 0xbadcafebee wrote:
         | You could say the same about any SaaS based on open source, but
         | people still find it useful
        
       | slyall wrote:
       | The pricing just for the ingest seems way off. $0.002 for 10,000
       | metrics might not seem like much by even a simple node_exporter
       | will grab 700 metrics every 15 seconds.
       | 
       | Thats $24/month just to ingest the cpu/ram/diskspace data from
       | each server. Plus storage and query costs.
       | 
       | At work I have a single r4.xlarge instance handling 1.3 million
       | metrics every 15 seconds. Storage is not clustered but cost is
       | only $500/month. It would cost me $45k/month just for the ingest
       | with the new managed service.
        
         | mchusma wrote:
         | Their pricing for these managed services used to be "no
         | brainer" (something like the cost of compute only, or maybe a
         | <30% upcharge). Managed airflow was similarly very expensive
         | (maybe 3x the cost). Just not worth it. Bummer.
        
       | vishuk wrote:
       | Do we know which scalable prometheus backend are they running?
       | Chronosphere? Thanos?
        
         | bboreham wrote:
         | It's Cortex, though the particular configuration shares a lot
         | of code with Thanos.
         | 
         | (I am a Cortex maintainer)
        
         | bmurphy1976 wrote:
         | The Grafana blog post mentions Cortex, something I'm not
         | familiar with:
         | 
         | https://grafana.com/blog/2020/12/15/announcing-amazon-manage...
        
       ___________________________________________________________________
       (page generated 2020-12-15 23:00 UTC)