[HN Gopher] Grafana releases OnCall open source project ___________________________________________________________________ Grafana releases OnCall open source project Author : netingle Score : 283 points Date : 2022-06-14 15:28 UTC (7 hours ago) (HTM) web link (grafana.com) (TXT) w3m dump (grafana.com) | pphysch wrote: | Seems like a solid replacement for Alertmanager for those already | using Grafana OSS. Anyone planning on using both OnCall and | Alertmanager? | dString wrote: | Doesn't AlertManager evaluate metrics and fire alerts? | | A quick look at OnCall suggests it is more for managing fired | alerts than firing alerts. | | Their own screenshot has AlertManager as an alert source. | remram wrote: | Grafana used to be so simple, I don't know if I'm a fan of | this direction towards many services. | | Having to run alertmanager and configure it in addition to | Grafana was bad enough, now you need to run and configure | another service if you want some extra functionalities for | those alerts? Are they going to keep maintaining | acknowledgements and scheduled silence in AlertManager now | that OnCall exists? Are we going to have "legacy | notifications" in AlertManager when not running OnCall, the | same way there are "legacy alerts" in Grafana when updating | from Grafana 7 (pre-AlertManager)? | pphysch wrote: | AlertManager does not do the evaluations, it does not connect | to any metrics database; those are done by Prometheus/etc and | forwarded to AlertManager, which handles deduplication and | routing among other things. | juliennakache wrote: | Looking forward to trying this out. I've always felt that | PagerDuty was absurdly expensive for the feature set they were | offering. It costs something at least $250 per user for | organization larger than 5 person - even if you're not an | engineer who is ever directly on call. At my previous company, IT | had to regularly send surveys to employees to assess if they | _really_ needed to have a PagerDuty account. Alerts are a key | information in an organization that runs software in production | and you shouldn 't have to pay $250 / month just to be able to | have some visibility into it. I'm hoping Grafana OnCall is able | to fully replace PagerDuty. | CSMastermind wrote: | > I've always felt that PagerDuty was absurdly expensive for | the feature set they were offering | | For anyone out there in the same spot, I'll say that I switched | my last company to Atlassian's OpsGenie and it was a 10x cost | savings for the same feature set. | arccy wrote: | the opsgenie api is really bad though if you want to manage | it as code/declaratively | dijit wrote: | I really can't find myself to ever recommend atlassian | products though. | | If cost is the only measure: I understand. But time lost in | various areas of the software package (performance alone! | Before we get into weird UX paradigms and esoteric query | languages, shoddy search systems etc;) surely has an impact | on cost. Having your employees spending a lot of time | navigating janky software has a cost too. | jlg23 wrote: | Thanks, I think I finally understand why some friends of mine, | who can implement this for any company in half a day, take | $2000/day... | motakuk wrote: | Check this ;) | https://github.com/grafana/oncall/tree/dev/tools/pagerduty-m... | ildari wrote: | Hey HN, Ildar here, one of the co-founders of Amixr and one of | the software engineers behind Grafana OnCall. Finally we open- | sourced the product I'm really excited about that. Please try it | out and leave your feedback | sandstrom wrote: | I think it would be great if it was easier to mix and match | Grafana SaaS and self-hosted products. | | For example, we need to run Loki ourselves, for security / | privacy reasons, but wouldn't mind using hosted versions of | Tempo, Prometheus and OnCall. | | Right now it isn't super-easy to link e.g. self-hosted loki | search queries with SaaS-Prometheus. | netingle wrote: | Its very much our aim to make this mix of self-hosted and cloud | services as easy as going all-cloud; but I agree we're not | quite there yet. | | Do you mind if I ask what isn't super-easy about linking self- | hosted loki search queries with SaaS-Prometheus? You should be | e.g. able to add a Prometheus data source to your local Grafana | (or securely expose your Loki to the internet and add a Loki | data source to your Cloud Grafana) | [deleted] | this_was_posted wrote: | glad to hear this got open sourced! | | for someone at grafana; noticed a dead link in the post: | https://grafana.com/docs/oncall/main/ | nojito wrote: | Unfortunate that it's AGPL. But this is looks really great! | josephcsible wrote: | There's nothing unfortunate about the AGPLv3. Everything that | it doesn't let you do is stuff that you shouldn't be doing | anyway. | [deleted] | ucosty wrote: | Why is that unfortunate? Unless you're looking to make | proprietary changes to Grafana Oncall and host it as a SAAS, | it's the same as running any other GPL software. | nojito wrote: | GPL and its variants are a no go where I work. | ketralnis wrote: | To distribute I understand, but even just to use? Almost | any desktop OS you run has GPL code somewhere in it | dividedbyzero wrote: | Almost any desktop OS? I may be wrong but I don't think | Windows and macOS contain any GPL code. | warp wrote: | Doesn't Windows 10 ship with WSL2 now? (which includes a | full Linux kernel). | | Apple still ships bash under GPLv2 on current macOS | versions. Apple hates GPLv3, which is why they're trying | to switch away from bash to zsh, but for the time being | they're still shipping bash. | eeZah7Ux wrote: | Then the problem is in the company and not in the license. | woadwarrior01 wrote: | Is Linux verboten at work? | to11mtm wrote: | Probably not. | | Linux usually gets a pass, because most times you're just | deploying it and not mucking with source code. | | But a lot of places (I've worked at more that do than | don't) will have rules about GPL/AGPL for libraries/infra | as a whole though. Often evaluated case-by-case, but it's | rare I've seen a AGPL stuff get approved for usage. | | I think some of it is not wanting to deal with the cost | of vigilance; i.e. you can make sure that someone is | using %thing% in a way that doesn't run afoul of AGPL | right now, but does legal and upper management have | confidence in that being true forever and always? | Engineers are still human, and corporate management + | legal teams tend to hate licensing folk tromping around. | | This results in refusals ranging from "This is internal | for now but we will open it up later" (a fair concern) to | "Somebody is worried that exposing it over the VPN to | contractors would count as making it public" (IDK, I'm | not a lawyer.) | ucosty wrote: | > Linux usually gets a pass, because most times you're | just deploying it and not mucking with source code. | | That would apply for most uses of software, wouldn't it? | | > This results in refusals ranging from "This is internal | for now but we will open it up later" (a fair concern) to | "Somebody is worried that exposing it over the VPN to | contractors would count as making it public" (IDK, I'm | not a lawyer.) | | I've encountered variations of this problem at places I | have worked in. Education goes a long way to solving | this, and this example of simple usage of (A)GPL software | is easy enough to explain with examples. | nojito wrote: | Any Linux deploy is through RedHat but most local | development here is using windows. | | No idea why Linux gets a pass though. | ucosty wrote: | Must be quite the paranoid business, given even tier 1 | banks here (in the UK) will happily run GPL software. | matsemann wrote: | Running a service with a GPL license is different than | including their code in your projects, though. So while it | may be a blanket ban, it may be worth it to clarify the | scope of that ban. | bbkane wrote: | LinkedIn built and uses https://iris.claims/ . I don't know how | it compares to alternatives, but I find IRIS relatively pretty | easy to use. | acatton wrote: | https://drewdevault.com/2020/07/27/Anti-AGPL-propaganda.html | Equiet wrote: | It's surprising how seemingly difficult it is to build a good on- | call scheduling system. Everything I tried so far (not naming the | companies here) felt like the UX was the last thing on the | developers' minds. Which is tolerable during business hours but | really annoying at 2am. | | Is there some hidden complexity or is it just a consequence of | engineers building a product for other engineers? Also, any tips | what worked for you? | matsemann wrote: | Have had lots of bad experiences with that from Pagerduty at | least. Want to generate a schedule far in advance, so people | know when they will be oncall and can plan/switch. | | Of course, in a few months we may have some new people having | joined, some quit, or other circumstances. A single misclick | when fixing that can invalidate the whole schedule and generate | another. Infuriating. | | Or the UI itself, might have become better tha last two years, | but having to click "next week" tens of times to see when I was | scheduled (since I wasnt just interested in my next scheduled | time but all of them) were annoying. | raffraffraff wrote: | Production helm chart link on this page leads to 404: | https://grafana.com/docs/grafana-cloud/oncall/open-source/#p... | Deritio wrote: | I like what grafana labs does with grafana. | | Im annoyed by their license choice. | | But apparently when you are grafana everything looks like a | dashboard UI? | | Joke aside I will have a look but I didn't like the screenshots | before already. I like the dashboardy thing for dashboards but | otherwise it's not a really good UI system for everything else. | Maledictus wrote: | What I really want is an Android app that keeps alerting until a | page is ACKed or escalated. | machinerychorus wrote: | check out pushover, I use it for this exact case | | https://pushover.net/ | pphysch wrote: | A bit disappointed by the architecture -- it's a Django stack | with MySQL, Redis, RabbitMQ, and Celery -- for what is | effectively AlertManager (a single golang binary) with a nicer | web frontend + Grafana integration + etc. | | I'm curious why/if this architecture was chosen. I get that it | started as a standalone product (Amixr), but in the current state | it is hard to rationalize deploying this next to Grafana in my | current containerless setting. | alex_dev wrote: | One of the most frustrating aspects of being a software | engineer is dealing with others that love to over-engineer. | Unfortunately, they make enough noise that complex solutions | are necessary that it gets managers scared about taking any | easier, simpler solutions. | skullone wrote: | That seems like a perfectly reasonable architecture. If only | all of us could work on battle tested components like those | during our job! | contravariant wrote: | For something that is supposed to add some more features to | the basic email/HTTP message alert like grafana generates, I | do wonder what extra features require an additional 2 | databases, a message queue and a separate task queue. | skullone wrote: | probably keeps history, state, escalation flow, etc? | goodpoint wrote: | That's very bad. 99% of organizations don't have a volume of | alerts that justifies any of MySQL, Redis and RabbitMQ. | | Complexity comes at a steep price when something critical (e.g. | OnCall) breaks and you have to debug it in a hurry. | | Shoving everything in a container and closing the lid does not | help. | [deleted] | motakuk wrote: | I agree that multi-component architecture is harder to deploy. | We did our best and prepared tooling to make deployment an easy | thing. | | Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), | docker-composes for hobby and dev environments. | | Besides deployment, there are two main priorities for OnCall | architecture: 1) It should be as "default" as possible. No | fancy tech, no hacking around 2) It should deliver | notifications no matter what. | | We chose the most "boring" (no offense Django community, that's | a great quality for a framework) stack we know well: Django, | Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows | us to build a message bus-based pipeline with reliable and | predictable migrations. | | It's important for such a tool to be based on message bus | because it should have no single point of failure. If worker | will die, the other will pick up the task and deliver alert. If | Slack will go down, you won't loose your data. It will continue | delivering to other destinations and will deliver to Slack once | it's up. | | The architecture you see in the repo was live for 3+ years now. | We were able to perform a few hundreds of data migrations | without downtimes, had no major downtimes or data loss. So I'm | pretty happy with this choice. | Deritio wrote: | Hearing your message bus assumption sounds like one of the | most ridiculous claims I heard. | | Sorry but why is rabbitmq really necessary? | slotrans wrote: | You don't need Rabbit, Celery, or Redis. You should be able | to replace MySQL with SQLite. Then it would be _radically_ | easier to deploy. | sergiomattei wrote: | It's curious to see people questioning the stack choices of | apps they haven't built yet and problems they haven't faced | either. | | They chose this stack, it works for them. They've put it | through its paces in production. | | It's as boring as it gets. | throwaway892238 wrote: | A MySQL database cluster, and a local copy of a SQL | database on a single file on a single filesystem, are not | close to the same thing. Except they both have "SQL" in the | name. | | One of them allows a thousand different nodes on different | networks to share a single dataset with high availability. | The other can't share data with any other application, | doesn't have high availability, is constrained by the | resources of the executing application node, has obvious | performance limits, limited functionality, no commercial | support, etc etc. | | And we're talking about a product that's intended for | dealing with on-call alerts. The entire point is to alert | when things are crashing, so you would want it to be highly | available. As in, running on more than one node. | | I know the HN hipsters are all gung-ho for SQLite, but | let's try to reign in the hype train. | slotrans wrote: | I don't need _any_ of that stuff, and nor does anyone who | would use this. People who need clustered high- | availability stuff are _paying for PagerDuty or | VictorOps_. | | This is for tiny shops with 4 servers. And tiny shops | with 4 servers don't have time to spin up a horrendous | stack like this. I was excited to see this announcement | until I saw all the moving pieces. No thanks! | Spivak wrote: | And this is the on-prem version of those tools. Just | because it isn't the tool you wanted doesn't mean it's | not good. | throwaway892238 wrote: | If you only have 4 servers, make a GitHub Action (or, | hell, since we're assuming one node with SQLite, a cron | job on one of your 4 servers) that _curl_ s your servers | every 5 minutes and sends you a text when they're down. | You don't need a Lamborghini to get groceries. | pphysch wrote: | This discussion is in the context of a self-contained app | called Grafana OnCall, which is built on Django, which | does not _particularly_ care which RDBMS you are using. | | At the very least, SQLite should be the default database | for this product, and users can swap it out with their | MySQL database cluster if they really are Google-scale. | gjulianm wrote: | > The entire point is to alert when things are crashing, | so you would want it to be highly available. As in, | running on more than one node. | | An important question to ask is how much availability are | you actually gaining from the setup. It wouldn't be the | first time I see a system moving from single-node to | multinode and being less available than before due to the | extra complexity and moving pieces. | [deleted] | gen220 wrote: | I think your decisions were reasonable, as is the opinion of | the person you're responding to. | | To be fair, even in its current form, it should be possible | to operate this system with sqlite (i.e. no db server) and | in-process celery workers (i.e. no rabbit MQ) if configured | correctly, assuming they're not using MySQL-specific features | in the app. | | Using a message bus, a persistent data store behind a SQL | interface, and a caching layer are all good design choices. I | think the OP's concern is less with your particular | implementations, and more with the principle of preventing | operators from bringing their own preferred implementation of | those interfaces to the table. | | They mentioned that it makes sense because you were a | standalone product, so stack portability was less of a | concern. But as FOSS, you're opening yourself up to different | standards on portability. | | It requires some work on the maintainer to make the | application tolerant to different fulfillments of the same | interfaces. But it's good work. It usually results in cleaner | separation of concerns between application logic and | caching/message bus/persistence logic, for one. It also | allows your app to serve a wider audience: for example, those | who are locked-in to using Postgres/Kafka/Memcached. | raffraffraff wrote: | Nothing wrong with that. I managed 7+ Sensu "clusters" at a | previous job, and it's stack was a ruby server, Redis and | RabbitMQ. But I completely ditched RabbitMQ and used Redis | for the queue and data. Simpler, more performant and more | reliable (even if the feature was marked _experimental_ ). | Our alerts were really spammy, and we had ~8k servers (each | running a bunch of containers) per cluster, so these things | were busy. Each cluster was 3x small nodes (6gb memory, 2CPU) | Memory usage was miniscule, typically <300mb. Any box could | be restarted without any impact because Redis just operated | in (failover) mode and Sensu was horizontally scalable. | | I get why you would add a relational DB to the mix. | Personally, I'd like a Rabbit-free option. | minusf wrote: | not gonna argue that a single binary is the ultimate deploy | solution but running a django app is not that difficult | (although i am biased cause i do that for a living). | | i love django projects but mysql, celery and rabbitmq -- no | thanks. | pphysch wrote: | Don't get me wrong, I love Django and think its a great | framework for writing internal tools like this. Redis gets a | pass too since Django has native support for it in 4.0+. It's | really the (IMHO unnecessary) combo of MySQL+RabbitMQ+Celery | that turns me off. | | Redis itself has had solid support for building reliable | distributed task streaming for nearly 4 years (Redis | ConsumerGroups introduced in 2018). | lazyant wrote: | Curious as to what architecture you would have preferred or why | this pretty standard stack (that can be deployed to k8s) is not | giving you. | pphysch wrote: | Any of the following: | | Python(Django)+Redis+[SQLite] | | Python(Django)+Postgres | | [Compiled Go binary]+[SQLite] | | SQLite barely even counts as an architectural dependency TBH | :) | theptip wrote: | For a simple low-scale app you can often do without Redis and | Celery/RMQ if you just push everything into Postgres. | | Far less scalable, but it is dramatically simpler to deploy. | Often gets you surprisingly far though. Would be interesting | to know how many monitored integrations could be supported by | that flow. | picozeta wrote: | How does a message queue work via Postgres? Many people | (including me) use Redis to run background jobs. | theptip wrote: | Here's the option I'm familiar with (siblings have others | too): | | https://github.com/malthe/pq | | Doesn't have all the plumbing you'd want, there is a | wrapper (https://github.com/bretth/django-pq/) that seems | to give you an entrypoint command more like `celery | worker ...` but I've not investigated it closely. | minusf wrote: | https://github.com/procrastinate-org/procrastinate | | https://github.com/gavinwahl/django-postgres-queue | infogulch wrote: | lmgtfy https://www.crunchydata.com/blog/message-queuing- | using-nativ... | slotrans wrote: | This is a very confused question. The data store you keep | your queued items in is completely orthogonal to what a | message queue actually is. | | A simple way to use an RDBMS as a message queue, that has | been in use since before most HN readers were born, is | roughly: - enqueue an item by inserting a | row into a table with a status of QUEUED - use a | SELECT FOR UPDATE, or UPDATE...LIMIT 1, or similar, to | atomically claim and return the first status=QUEUED item, | while setting its status to RUNNING (setting a timestamp | is also recommended) - when the work is complete, | update the status to DONE | | There are more details to it obviously but that's the | outline. | | The first software company I worked for was using this | basic approach to queue outbound emails (and phone and | fax... it was 2005!), millions per day, on an Oracle DB | that _also_ ran the entire rest of the business. It 's | not hard. | gjulianm wrote: | I bet quite a lot, probably at least 10-50 per second | without doing anything special for performance, i.e. | multiple queries per alert, calling different APIs, things | like that. I don't know of many places that are dealing | with alerts measured in "per second" as a unit. | | Not to mention that having multiple components doesn't mean | it's "scalable" by default, it could happen that some part | of the pipeline doesn't like multiple instances of | something. | chrisandchris wrote: | Not OP, but one may interpret your response as "I don't | understand why you prefer a single binary over this | architecture that requires 6 different services and prefers | k8s". | | IMHO, OP just stated that one could solve this with less | dependencies and have the same (if not a better) result. | pphysch wrote: | Yes, thank you. I would be surprised if this same product | couldn't be delivered with just Python(Django) + SQLite + | Redis (assuming writing everything in Go is unrealistic). | Spinning up a venv and launching a local Redis instance is | significantly more reasonable than having to configure | MySQL, RabbitMQ, and Celery. | lazyant wrote: | I missed that interpretation :( | | IMHO a fat binary written from scratch would have been a | way worse choice than to use a standard stack, both in | terms of bugs and time, let alone Open Source contributions | or any scalability. | | In terms of number of services, what do you get rid of that | produce a better result? maybe RMQ and use a worse queue?, | celery and write your own task manager or use another | dependency? | gjulianm wrote: | Installation in a regular system without Kubernetes? Right | now I can install Grafana, Prometheus and Alertmanager in a | regular Linux system using distribution packages, and just | worry about those programs themselves. If I want to install | OnCall, I need not only OnCall plus four other non-trivial | dependencies that will still need configuration, management | and troubleshooting. All for something that is going to deal | with far less load than any of | Grafana/Prometheus/Alertmanager. I honestly do not understand | it. | lazyant wrote: | you can install this stack without kubernetes no? I don't | see anything k8s-specific | heavyset_go wrote: | Yes, there is nothing Kubernetes specific here, and this | can be deployed using whatever container orchestration | system you want. | gjulianm wrote: | The problem still stands of adding dependencies, extra | complexity and configuration. I'm usually happy about | Grafana/Prometheus deployments because the base | installation is fairly simple and self-contained, but | this looks like a bit of a mess. | vhold wrote: | AlertManager is one component of a more complicated | infrastructure. | | https://prometheus.io/docs/introduction/overview/#architectu... | | https://kubernetes.io/docs/concepts/overview/components/ | pphysch wrote: | OnCall also does nothing unless you have something external | firing alerts for you. They both fill similar niches in a | larger monitoring system; this does not excuse OnCall having | a drastically more complex internal architecture. | mkl95 wrote: | > Django stack with MySQL, Redis, RabbitMQ, and Celery | | MySQL is a weird if not slightly disturbing choice. Other than | that it's a boring, battle-tested stack that is relatively easy | to scale. I agree that Go is nicer, but I'm biased by several | years of dealing with horrific Flask / Django projects. | heavyset_go wrote: | That's a tried and true stack, and a very good one for | maintaining sane levels of reliability, consistency, durability | etc. Resource wise, at least with Celery, RabbitMQ and Django, | they're also pretty lean. | | It even ships in containers along with Docker Compose files and | Helm charts, which would suit the deployment use cases of 99% | of users. I understand that you're not using containers, but I | don't think that's a limitation that many are inflicting upon | themselves as of late, and if pressed, installing Docker | Compose takes about 5 minutes and you don't have to think about | it again. | MarquesMa wrote: | This. I find open source projects written in Go or Rust are | usually more pleasant to work with than Java, Django or Rails, | etc. They have less clunky dependencies, are less resource- | hungry, and can ship with single executables which make | people's life much easier. | | Just think about Gitea vs GitLab. | matsemann wrote: | Not sure why you include java in that, as you mostly get a | standalone file. No such thing as a jre in modern java | deployment. | | As for python, at least getting a dockerfile helps a lot. | Otherwise it's a huge mess to get running, yes. | | Python is still a hassle anyways, since the lack of true | multithreading means that you often need multiple | deployments, which the Celery usage here for instance shows. | Volundr wrote: | > Not sure why you include java in that, as you mostly get | a standalone file. No such thing as a jre in modern java | deployment. | | Maybe I'm behind the times, but I can't figure out what you | mean here. As far as I know 'java -jar' or servlets are | still the most common ways of running a Java app. Are you | talking graal and native image? | matsemann wrote: | For deploying your own stuff, most people do as before, | yes. But even then, it's at least still only a single jar | file, containing all dependencies. Not like a typical | python project where they ask you to run some command to | fetch dependencies and you have to pray it will work on | your system. | | But using jlink for java, one can package everything to a | smaller runtime distributed together with the | application. So then I feel it will be not much different | than a Go executable. | | > _The generated JRE with your sample application does | not have any other dependencies..._ | | > _You can distribute your application bundled with the | custom runtime in custom-runtime. It includes your | application._ | | From the guide here | https://access.redhat.com/documentation/en- | us/openjdk/11/htm... | FridgeSeal wrote: | Python application deployments are all fun and games until | suddenly the documentation starts unironically suggesting | that you should "write your configuration as a Python | script" that should get mounted to some random specific | directory within the app as if that could ever be a sane | and rational idea. | eeZah7Ux wrote: | Hell no, I want stuff like OnCall packaged into Linux | distribution. I need something stable and reliable and that | receive security fixes. | | Maintaining tenths of binaries pulled from random github | projects over the years is a nightmare. | | (Not to mention all the issues around supply chain | management, licensing issues, homecalling and so on) | morelisp wrote: | At this point I trust the Go modules supply chain | considerably more than any free distro's packaging, which | is ultimately pulling from GitHub anyway. | dijit wrote: | > At this point I trust the Go modules supply chain | considerably more than any free distro's packaging | | What has happened in the package ecosystem to make you | believe this? Is it velocity of updates or actual trust? | | I haven't heard of any malicious package maintainers. | eeZah7Ux wrote: | This is plain false. Most production-grade distribution | do extensive vetting of the packages, both in terms of | code and legal. | | Additionally, distribution packages are tested by a | significant number of users before the release. | | Nothing of this sort happens around any language-specific | package manager. You just get whatever happens to be | around all software forges. | | Unsurprisingly, there has been many serious supply chain | attacks in the last 5 years. None of which affected the | usual big distros. | morelisp wrote: | No, Go modules implement a global TOFU checksum database. | Obviously a compromised upstream at initial pull would | not be affected, but distros (other than the well-scoped | commercial ones) don't do anything close to security | audits of every module they package either. Real-world | untargeted SCAs come from compromised upstreams, not | long-term bad faith actors. Go modules protects against | that (as well as other forms of upstream incompetence | that break immutable artifacts / deterministic builds). | | MVS also prevents unexpected upgrades just because | someone deleted a lockfile. | goodpoint wrote: | It's very nice to see Python and AGPL used for this. | ucosty wrote: | Looks very cool, will have to give this a shot. | motakuk wrote: | Hello HN! | | Matvey Kukuy, ex-CEO of Amixr and a head of the OnCall project | here. We've been working hard for a few months to make this OSS | release happen. I believe it should make incident response | features (on-call rotations, escalations, multi-channel | notifications) and best practices more accessible to the wider | audience of SRE and DevOps engineers. | | Hope someone will be able to finally sleep well at night being | sure that OnCall will handle escalations and will alert the right | person :) | | Please join our community on a GitHub! The whole Grafana OnCall | team is help you and to make this thing better. | knicholes wrote: | Being on-call has never made me sleep better at night! | krab wrote: | If I know someone else is on call and he's competent, I can | sleep better. | the_duke wrote: | The docs link [1] is 404. | | Seems like the /main is the culprit. | | [1] https://grafana.com/docs/oncall/main/. | motakuk wrote: | Fixed: https://grafana.com/docs/grafana-cloud/oncall/ | pachico wrote: | I love Grafana, don't get me wrong, but I have the sensation they | are now in that position where, companies that got a massive | capital injection and, therefore, a massive increase of work | power, release too much and too soon. | | It doesn't have anything to do, of course, with the fact that | this morning we suddenly found that all our dashboards stopped | working because we were upgraded to Grafana v9, for which there | is not a stable release nor documentation for breaking changes. | | Luckily they rolled back our account. | danlimerick wrote: | I apologize for the disruption we caused you when rolling out | Grafana 9. We are working on improving our releases to Grafana | Cloud and also on making sure that errors due to breaking | changes in a major release won't affect customers in the | future. As a Grafana Cloud customer, you shouldn't need to read | docs about breaking changes when we upgrade your instance. | pachico wrote: | Dude, I hope you also read when I say that I love what you do | and your reply just confirms I'm putting my money in the | right hands. | | I just wouldn't mind to be the last to upgrade to a newer | version :) | greatgib wrote: | I would give a huge marketing bullshit award for the following | sentence: | | <<We offered Grafana OnCall to users as a SaaS tool first for a | few reasons. It's a commonly shared belief that the more | independent your on-call management system is, the better it will | be for your entire operation. If something goes wrong, there will | be a "designated survivor" outside of your infrastructure to help | identify any issues. >> | | They tried to ensure that you use their SaaS offering because | they care more about your own good than yourself. So humanist... | ezrast wrote: | The point isn't that their infrastructure is more reliable than | yours, but that it's decoupled from yours. If you run your | monitoring on the same infra as production, it's liable to go | down when production does, i.e. just when you need it most. | This is a real reason to outsource monitoring to a SaaS, just | like there are real reasons to self-host. | | I mean, obviously they chose to address the segment of the | market they could get more money out of first; I'm not | contesting that. But the bit you quoted is low-grade bullshit | at best. Hardly award-winning. | martypitt wrote: | Congrats - this looks great, and definitely something I was | wishing for during an incident earlier this week. | | A minor note, if anyone from Grafana is around - a bunch of the | links on the bottom of the announcement go to a 404. | motakuk wrote: | We're fixing that, thank you ;) | googletron wrote: | Very cool. I love what the Grafana team is up to. | anyfactor wrote: | Here is the repo: https://github.com/grafana/oncall | | AGPL 3.0 ___________________________________________________________________ (page generated 2022-06-14 23:00 UTC)