[HN Gopher] Grafana OnCall: an easy-to-use on-call management tool ___________________________________________________________________ Grafana OnCall: an easy-to-use on-call management tool Author : sciurus Score : 161 points Date : 2021-11-09 17:16 UTC (5 hours ago) (HTM) web link (grafana.com) (TXT) w3m dump (grafana.com) | marcoboffi wrote: | but is it possible to send sms/phone call directly from grafana | oncall ? If yes, is there a pricing ? | markbnj wrote: | I'm a grafana fan and a current user of PagerDuty. Maybe there's | more to the story but after reading the post I feel like using a | calendar integration to manage on-call schedules is the wrong | approach. Calendar events are a result of overlaying a rotation | on a date range: they're the output, not the input. I'm sure the | designers here have looked at how PD enables creating and editing | rotations. Curious to know their views on it. | motakuk wrote: | Hey everyone, Matvey, ex-CEO of Amixr is here. Me and Ildar | Iskhakov started this project three years ago because we used to | be on-call ourselves and needed better tools. It was an amazing | journey from 0 to 1. Tons of coding, first customers, | fundraising, iterating, and finally the honor to join Grafana | Labs and build Grafana OnCall! I'll be happy to answer your | questions if you have any. | joaoqalves wrote: | It's great to see more competition in this space. Generally | speaking, what I miss in these "incident management" products | is also an integrated, flawless way to handle incidents _when_ | they 're happening. I'm talking about: | | 1. Quickly creating a proper chat 2. Quickly creating an | incident document where you can pin chat messages and use it in | the post-mortem. Ideally, pinning some graphs that you'd | extract from your observability solutions 3. Having a status | page to put a small description for non-technical stakeholders. | | PagerDuty covers some of this. Monzo's Response [1] and now | incident.io [2] try to cover it too. I'd like to have this | experience end-to-end. | | 1 - https://github.com/monzo/response 2 - https://incident.io/ | igetspam wrote: | I use incident.io. Pretty happy with it. Very responsive | team. | hrpnk wrote: | Monzo's solution does not seem to be actively maintained, is | it? | | +100 on the creation of incident chat rooms and pinning data | to re-use in incident docs. There is nothing worse than | copying the timeline events from one tool to a Google Doc. | joaoqalves wrote: | AFAIK, the creators created incident.io as a spin-off [1] | :) Smart move, I must say. | | 1 - https://www.indexventures.com/perspectives/incidentio- | raises... | SeriousM wrote: | Hi! Thanks for sharing this news. Will this be available for | on-premise installations, and when? | motakuk wrote: | For now, we are focusing on rolling Grafana OnCall in the | Grafana Cloud. It's a very common use case to have such a | system outside of your infrastructure so it won't be affected | by probable issues. It should be alive even when everything | goes wrong. | | We've already received multiple questions about OSS and on- | premises. Will roll cloud version first, see how it works, | collect feedback and build (and share) future plans! | bilalq wrote: | This looks really neat. We don't use Grafana today. We're | running CloudWatch/insights and Squadcast for alerting, but | deep integration with the monitoring tool looks cool. Is this | usable with self-hosted or AWS managed Grafana? | motakuk wrote: | Yep! The idea of Grafana OnCall is to help you to group, | deduplicate, route & deliver to Slack/SMS/Phone alerts from | any sources. It could be a CloudWatch, DataDog, self-hosted | Alertmanager, or Grafana of course. The only requirement for | the alert source is to be able to generate a webhook and send | it to us. | bilalq wrote: | Can Grafana OnCall itself be self-hosted and/or run as a | part of Grafana itself? Your last response makes it sound | like it's a separate product with integrations rather than | an extension of Grafana. Is that correct? | motakuk wrote: | It's 100% part of the Grafana Cloud, not a separate | product. It's deeply integrated with the rest of Grafana. | | Same time we've focused on making it useful for those who | don't use Grafana for monitoring. Feel free to sign up in | the Grafana Cloud and use just OnCall if you want. | halfmatthalfcat wrote: | Is there really anybody else in the "Pager" category of SaaS | products other than PagerDuty that have any traction? | aiisjustanif wrote: | xMatters | bboreham wrote: | Opsgenie? | therealdrag0 wrote: | We use OpsGenie. not sure how widely it's used but given its | Atlasian I'd guess a non-trivial amount. | bgm1975 wrote: | There's Splunk OnCall (formerly known as VictorOps). It's a | very decent solution. | bilalq wrote: | We started using Squadcast: https://squadcast.com | | Their free and lower prices tiers offer a lot of what others | have on their top/most expensive tiers. Also, integrations with | various alert sources are just easier in most cases. I spent I | don't know how long trying to get OpsGenie to work before I | gave up. | fredman wrote: | There is xMatters: https://www.xmatters.com/ | | Disclaimer: I work at xMatters. | Forfold wrote: | I work on/for an open source solution that we based off of | PagerDuty, called GoAlert: https://github.com/target/goalert | awestman wrote: | Yep. This is a great product. Has the features you need, is | super reliable and easy to manage. | craigching wrote: | Target uses go alert across the enterprise for all on call. | Definitely enterprise capable! | abhishekjha wrote: | Also what happens if pagerduty goes down? | jq-r wrote: | Your service(s) going down and pagerduty going fully down is | very unlikely to happen. Even if it does, you're probably | going to get called by customer support because users never | go down;) | kevindong wrote: | In the year I used it, I never personally noticed it going | down. Although that being said, their SLA is only 99.9% | delivery in any calendar month within 5 minutes. The penalty | for missing that SLA is only 10% of that month's bill. | | > Once an Incident is triggered, PagerDuty will deliver the | First Responder Alert within the Notification Delivery Period | for 99.9% of the notifications sent by PagerDuty for the | Customer during any calendar month. The "Notification | Delivery Period" is five (5) minutes and it is measured as | the time it takes PagerDuty to deliver a First Responder | Alert to telecommunication providers in accordance with the | Service configuration and Contact Information. | | > ... | | > If PagerDuty fails to meet the SLA set forth herein, | Customer may receive a service credit. Customer will be | eligible for a credit toward future fees owed to PagerDuty | for the PagerDuty Service. The Service Credit is calculated | as ten percent (10%) of the fees paid for or attributable to | the month when the alleged SLA breach occurred. | | https://www.pagerduty.com/standard-service-level-agreement/ | vorpalhex wrote: | It's very rare for them to go down. I think I can remember | one major outage during business hours in the last few years | at which point we just switched to manual monitoring for the | few hours. | | If that is within your outage model, you'd probably want a | redundant on-call service I suppose, even if it's just | escalating to a single known email or sms group. | julianlam wrote: | Ideally, the services you use should handle that (detect a | non-200 and fire off a backup method like a slack webhook or | email.) | | In reality, probably a lot of missed downtime events, and ops | sleeping peacefully I guess. | armiiller wrote: | PagerTree - https://pagertree.com | coderchix wrote: | My team uses PagerTree. Easy to get started with, has the | tools you need without being overcomplicated. | saminzadeh wrote: | DataDog also launched their own Incident Management tool, not | sure how widely it's used: | https://www.datadoghq.com/blog/incident-response-with-datado... | haliskerbas wrote: | Technically Splunk On-call. But I have a few pain points with | it, and I miss pagerduty. | | If you want to see what teams you are on as the current logged | in user, the only way to do it as far as what support told me, | is to search for yourself and then check that result. | rconti wrote: | I see my teams listed under my user profile. Or if I go to | the left side bar and click on my name, it says when I'm next | on-call for various teams. But the UI looks different than | last time I logged in a few weeks ago, so maybe something has | changed. | | Disclaimer: Am an employee. | dvtrn wrote: | I've been seeing them recommended more and more, and myself | have been keeping a passive eye on BetterUptime (which has an | on-call feature): https://betteruptime.com/incident-management | moepstar wrote: | A few more screenshots of the "Scheduling" options would've been | great... | | We're (more or less) using OpsGenie's free tier, however their | scheduling never really "clicked" with me... not sure if i'm | special in that regard, however i find the UI/UX pretty... | weird... | CSDude wrote: | > Alerts from each integration 300 5 minutes | | > Alerts from the whole team 500 5 minutes | | > API requests per API key 300 5 minutes | | Product looks great but those API request limits are too low, | because alerts rain when you are having incidents and rate | limiting all of them is harmful. That's why other products have | deduplication keys / aliases so you don't miss important ones. | | https://grafana.com/docs/grafana-cloud/oncall/oncall-api-ref... | named-user wrote: | How else do you think they are gonna make money? | CameronNemo wrote: | _That 's why other products have deduplication keys / aliases | so you don't miss important ones._ | | Care to link to the docs? I'm interested. | CSDude wrote: | https://support.atlassian.com/opsgenie/docs/what-is-alert- | de... | | https://support.pagerduty.com/docs/event-management | CameronNemo wrote: | Thanks for the links. | | From the article: | | _With Grafana OnCall's automatic grouping of alerts within | Slack, you can avoid alert storms and reduce the noise your | teams are exposed to during an incident._ | | Seems like the same feature described using different | terminology. | EwanToo wrote: | The output alerts feature looks largely the same, but the | input API limits are the part in question. | | What happens if you get 1000 API calls about "Alert 1" | and 1 API call about "Alert 2". | | You want both on call's to trigger once, but will alert 2 | get though? | deeblering4 wrote: | I'd think that receiving even 1/5th the rate limit in a 5 | minute window would be disorienting enough to render alerting | effectively useless. | | I'd question the configuration which fires that many alerts in | that time frame, and suggest improving alert aggregations and | dependencies to get the number down to one or a handful of | meaningful alerts. | curryst wrote: | The overhead of maintaining those configurations all the time | is usually too high to be worth it considering the benefit | and likelihood of reaping it. | | Also, in my experience with those systems, they only make | sense to use very sparingly. Your monitoring becomes | extremely fragile when your aggregations and dependencies get | complicated enough that "what will our alerting system do | when X happens?" results in a flow chart with 18 steps. | | If you aren't careful, you can end up making your | aggregations less useful than the raw alerts would be. | rmetzler wrote: | It would be great to have a dependency graph or labels in the | alerts, so they are easily mapped to the things that can | break and are important enough to be monitored. | | We just had a short outage where an editor removed the index | page in the cms which is central to the site. It's stupid | that this is possible but we just operate the cms while we | build and operate everything around it for our customer. | | I think a large part of our alerts where triggered all at | once but the one thing they had in common was that the alerts | all pointed to the index page in the cms. E.g. the public www | alert for index, the public api alert for index, the preview | www alert for index, the preview api alert for index.... | steveBK123 wrote: | For a product that's been around 12 years, I've been surprised at | how minimally featured PagerDuty is. | | Stuff like national holiday awareness, integration to vacation | calendars, a better UI for swapping days/overrides, etc. | | PD schedule checking and trade negotiation becomes yet another | thing in the long list of things I need to do when taking a day | off. HR system request off, Department Outlook calendar update, | PagerDuty coverage check, Outlook out-of-office status & auto- | replies, Slack set away, update status AND pause notifications. | | I suppose that's because as an on-call developer I am not the | user. The user, management who bought the product, gets KPIs & | pretty graphs, so they are happy. | ethbr0 wrote: | Every delightful, successful developer product is eventually | doomed to become JIRA. | rvnx wrote: | A multi-billion USD success story ? | ethbr0 wrote: | That's one way to look at it. ___________________________________________________________________ (page generated 2021-11-09 23:00 UTC)