hngopher.com

       [HN Gopher] Grafana OnCall: an easy-to-use on-call management tool
       ___________________________________________________________________
        
       Grafana OnCall: an easy-to-use on-call management tool
        
       Author : sciurus
       Score  : 161 points
       Date   : 2021-11-09 17:16 UTC (5 hours ago)
        
 (HTM) web link (grafana.com)
 (TXT) w3m dump (grafana.com)
        
       | marcoboffi wrote:
       | but is it possible to send sms/phone call directly from grafana
       | oncall ? If yes, is there a pricing ?
        
       | markbnj wrote:
       | I'm a grafana fan and a current user of PagerDuty. Maybe there's
       | more to the story but after reading the post I feel like using a
       | calendar integration to manage on-call schedules is the wrong
       | approach. Calendar events are a result of overlaying a rotation
       | on a date range: they're the output, not the input. I'm sure the
       | designers here have looked at how PD enables creating and editing
       | rotations. Curious to know their views on it.
        
       | motakuk wrote:
       | Hey everyone, Matvey, ex-CEO of Amixr is here. Me and Ildar
       | Iskhakov started this project three years ago because we used to
       | be on-call ourselves and needed better tools. It was an amazing
       | journey from 0 to 1. Tons of coding, first customers,
       | fundraising, iterating, and finally the honor to join Grafana
       | Labs and build Grafana OnCall! I'll be happy to answer your
       | questions if you have any.
        
         | joaoqalves wrote:
         | It's great to see more competition in this space. Generally
         | speaking, what I miss in these "incident management" products
         | is also an integrated, flawless way to handle incidents _when_
         | they 're happening. I'm talking about:
         | 
         | 1. Quickly creating a proper chat 2. Quickly creating an
         | incident document where you can pin chat messages and use it in
         | the post-mortem. Ideally, pinning some graphs that you'd
         | extract from your observability solutions 3. Having a status
         | page to put a small description for non-technical stakeholders.
         | 
         | PagerDuty covers some of this. Monzo's Response [1] and now
         | incident.io [2] try to cover it too. I'd like to have this
         | experience end-to-end.
         | 
         | 1 - https://github.com/monzo/response 2 - https://incident.io/
        
           | igetspam wrote:
           | I use incident.io. Pretty happy with it. Very responsive
           | team.
        
           | hrpnk wrote:
           | Monzo's solution does not seem to be actively maintained, is
           | it?
           | 
           | +100 on the creation of incident chat rooms and pinning data
           | to re-use in incident docs. There is nothing worse than
           | copying the timeline events from one tool to a Google Doc.
        
             | joaoqalves wrote:
             | AFAIK, the creators created incident.io as a spin-off [1]
             | :) Smart move, I must say.
             | 
             | 1 - https://www.indexventures.com/perspectives/incidentio-
             | raises...
        
         | SeriousM wrote:
         | Hi! Thanks for sharing this news. Will this be available for
         | on-premise installations, and when?
        
           | motakuk wrote:
           | For now, we are focusing on rolling Grafana OnCall in the
           | Grafana Cloud. It's a very common use case to have such a
           | system outside of your infrastructure so it won't be affected
           | by probable issues. It should be alive even when everything
           | goes wrong.
           | 
           | We've already received multiple questions about OSS and on-
           | premises. Will roll cloud version first, see how it works,
           | collect feedback and build (and share) future plans!
        
         | bilalq wrote:
         | This looks really neat. We don't use Grafana today. We're
         | running CloudWatch/insights and Squadcast for alerting, but
         | deep integration with the monitoring tool looks cool. Is this
         | usable with self-hosted or AWS managed Grafana?
        
           | motakuk wrote:
           | Yep! The idea of Grafana OnCall is to help you to group,
           | deduplicate, route & deliver to Slack/SMS/Phone alerts from
           | any sources. It could be a CloudWatch, DataDog, self-hosted
           | Alertmanager, or Grafana of course. The only requirement for
           | the alert source is to be able to generate a webhook and send
           | it to us.
        
             | bilalq wrote:
             | Can Grafana OnCall itself be self-hosted and/or run as a
             | part of Grafana itself? Your last response makes it sound
             | like it's a separate product with integrations rather than
             | an extension of Grafana. Is that correct?
        
               | motakuk wrote:
               | It's 100% part of the Grafana Cloud, not a separate
               | product. It's deeply integrated with the rest of Grafana.
               | 
               | Same time we've focused on making it useful for those who
               | don't use Grafana for monitoring. Feel free to sign up in
               | the Grafana Cloud and use just OnCall if you want.
        
       | halfmatthalfcat wrote:
       | Is there really anybody else in the "Pager" category of SaaS
       | products other than PagerDuty that have any traction?
        
         | aiisjustanif wrote:
         | xMatters
        
         | bboreham wrote:
         | Opsgenie?
        
         | therealdrag0 wrote:
         | We use OpsGenie. not sure how widely it's used but given its
         | Atlasian I'd guess a non-trivial amount.
        
         | bgm1975 wrote:
         | There's Splunk OnCall (formerly known as VictorOps). It's a
         | very decent solution.
        
         | bilalq wrote:
         | We started using Squadcast: https://squadcast.com
         | 
         | Their free and lower prices tiers offer a lot of what others
         | have on their top/most expensive tiers. Also, integrations with
         | various alert sources are just easier in most cases. I spent I
         | don't know how long trying to get OpsGenie to work before I
         | gave up.
        
         | fredman wrote:
         | There is xMatters: https://www.xmatters.com/
         | 
         | Disclaimer: I work at xMatters.
        
         | Forfold wrote:
         | I work on/for an open source solution that we based off of
         | PagerDuty, called GoAlert: https://github.com/target/goalert
        
           | awestman wrote:
           | Yep. This is a great product. Has the features you need, is
           | super reliable and easy to manage.
        
           | craigching wrote:
           | Target uses go alert across the enterprise for all on call.
           | Definitely enterprise capable!
        
         | abhishekjha wrote:
         | Also what happens if pagerduty goes down?
        
           | jq-r wrote:
           | Your service(s) going down and pagerduty going fully down is
           | very unlikely to happen. Even if it does, you're probably
           | going to get called by customer support because users never
           | go down;)
        
           | kevindong wrote:
           | In the year I used it, I never personally noticed it going
           | down. Although that being said, their SLA is only 99.9%
           | delivery in any calendar month within 5 minutes. The penalty
           | for missing that SLA is only 10% of that month's bill.
           | 
           | > Once an Incident is triggered, PagerDuty will deliver the
           | First Responder Alert within the Notification Delivery Period
           | for 99.9% of the notifications sent by PagerDuty for the
           | Customer during any calendar month. The "Notification
           | Delivery Period" is five (5) minutes and it is measured as
           | the time it takes PagerDuty to deliver a First Responder
           | Alert to telecommunication providers in accordance with the
           | Service configuration and Contact Information.
           | 
           | > ...
           | 
           | > If PagerDuty fails to meet the SLA set forth herein,
           | Customer may receive a service credit. Customer will be
           | eligible for a credit toward future fees owed to PagerDuty
           | for the PagerDuty Service. The Service Credit is calculated
           | as ten percent (10%) of the fees paid for or attributable to
           | the month when the alleged SLA breach occurred.
           | 
           | https://www.pagerduty.com/standard-service-level-agreement/
        
           | vorpalhex wrote:
           | It's very rare for them to go down. I think I can remember
           | one major outage during business hours in the last few years
           | at which point we just switched to manual monitoring for the
           | few hours.
           | 
           | If that is within your outage model, you'd probably want a
           | redundant on-call service I suppose, even if it's just
           | escalating to a single known email or sms group.
        
           | julianlam wrote:
           | Ideally, the services you use should handle that (detect a
           | non-200 and fire off a backup method like a slack webhook or
           | email.)
           | 
           | In reality, probably a lot of missed downtime events, and ops
           | sleeping peacefully I guess.
        
         | armiiller wrote:
         | PagerTree - https://pagertree.com
        
           | coderchix wrote:
           | My team uses PagerTree. Easy to get started with, has the
           | tools you need without being overcomplicated.
        
         | saminzadeh wrote:
         | DataDog also launched their own Incident Management tool, not
         | sure how widely it's used:
         | https://www.datadoghq.com/blog/incident-response-with-datado...
        
         | haliskerbas wrote:
         | Technically Splunk On-call. But I have a few pain points with
         | it, and I miss pagerduty.
         | 
         | If you want to see what teams you are on as the current logged
         | in user, the only way to do it as far as what support told me,
         | is to search for yourself and then check that result.
        
           | rconti wrote:
           | I see my teams listed under my user profile. Or if I go to
           | the left side bar and click on my name, it says when I'm next
           | on-call for various teams. But the UI looks different than
           | last time I logged in a few weeks ago, so maybe something has
           | changed.
           | 
           | Disclaimer: Am an employee.
        
         | dvtrn wrote:
         | I've been seeing them recommended more and more, and myself
         | have been keeping a passive eye on BetterUptime (which has an
         | on-call feature): https://betteruptime.com/incident-management
        
       | moepstar wrote:
       | A few more screenshots of the "Scheduling" options would've been
       | great...
       | 
       | We're (more or less) using OpsGenie's free tier, however their
       | scheduling never really "clicked" with me... not sure if i'm
       | special in that regard, however i find the UI/UX pretty...
       | weird...
        
       | CSDude wrote:
       | > Alerts from each integration 300 5 minutes
       | 
       | > Alerts from the whole team 500 5 minutes
       | 
       | > API requests per API key 300 5 minutes
       | 
       | Product looks great but those API request limits are too low,
       | because alerts rain when you are having incidents and rate
       | limiting all of them is harmful. That's why other products have
       | deduplication keys / aliases so you don't miss important ones.
       | 
       | https://grafana.com/docs/grafana-cloud/oncall/oncall-api-ref...
        
         | named-user wrote:
         | How else do you think they are gonna make money?
        
         | CameronNemo wrote:
         | _That 's why other products have deduplication keys / aliases
         | so you don't miss important ones._
         | 
         | Care to link to the docs? I'm interested.
        
           | CSDude wrote:
           | https://support.atlassian.com/opsgenie/docs/what-is-alert-
           | de...
           | 
           | https://support.pagerduty.com/docs/event-management
        
             | CameronNemo wrote:
             | Thanks for the links.
             | 
             | From the article:
             | 
             |  _With Grafana OnCall's automatic grouping of alerts within
             | Slack, you can avoid alert storms and reduce the noise your
             | teams are exposed to during an incident._
             | 
             | Seems like the same feature described using different
             | terminology.
        
               | EwanToo wrote:
               | The output alerts feature looks largely the same, but the
               | input API limits are the part in question.
               | 
               | What happens if you get 1000 API calls about "Alert 1"
               | and 1 API call about "Alert 2".
               | 
               | You want both on call's to trigger once, but will alert 2
               | get though?
        
         | deeblering4 wrote:
         | I'd think that receiving even 1/5th the rate limit in a 5
         | minute window would be disorienting enough to render alerting
         | effectively useless.
         | 
         | I'd question the configuration which fires that many alerts in
         | that time frame, and suggest improving alert aggregations and
         | dependencies to get the number down to one or a handful of
         | meaningful alerts.
        
           | curryst wrote:
           | The overhead of maintaining those configurations all the time
           | is usually too high to be worth it considering the benefit
           | and likelihood of reaping it.
           | 
           | Also, in my experience with those systems, they only make
           | sense to use very sparingly. Your monitoring becomes
           | extremely fragile when your aggregations and dependencies get
           | complicated enough that "what will our alerting system do
           | when X happens?" results in a flow chart with 18 steps.
           | 
           | If you aren't careful, you can end up making your
           | aggregations less useful than the raw alerts would be.
        
           | rmetzler wrote:
           | It would be great to have a dependency graph or labels in the
           | alerts, so they are easily mapped to the things that can
           | break and are important enough to be monitored.
           | 
           | We just had a short outage where an editor removed the index
           | page in the cms which is central to the site. It's stupid
           | that this is possible but we just operate the cms while we
           | build and operate everything around it for our customer.
           | 
           | I think a large part of our alerts where triggered all at
           | once but the one thing they had in common was that the alerts
           | all pointed to the index page in the cms. E.g. the public www
           | alert for index, the public api alert for index, the preview
           | www alert for index, the preview api alert for index....
        
       | steveBK123 wrote:
       | For a product that's been around 12 years, I've been surprised at
       | how minimally featured PagerDuty is.
       | 
       | Stuff like national holiday awareness, integration to vacation
       | calendars, a better UI for swapping days/overrides, etc.
       | 
       | PD schedule checking and trade negotiation becomes yet another
       | thing in the long list of things I need to do when taking a day
       | off. HR system request off, Department Outlook calendar update,
       | PagerDuty coverage check, Outlook out-of-office status & auto-
       | replies, Slack set away, update status AND pause notifications.
       | 
       | I suppose that's because as an on-call developer I am not the
       | user. The user, management who bought the product, gets KPIs &
       | pretty graphs, so they are happy.
        
         | ethbr0 wrote:
         | Every delightful, successful developer product is eventually
         | doomed to become JIRA.
        
           | rvnx wrote:
           | A multi-billion USD success story ?
        
             | ethbr0 wrote:
             | That's one way to look at it.
        
       ___________________________________________________________________
       (page generated 2021-11-09 23:00 UTC)