[HN Gopher] Show HN: Monitoror - Unified monitoring wallboard ___________________________________________________________________ Show HN: Monitoror - Unified monitoring wallboard Author : alex_d Score : 234 points Date : 2020-03-04 13:13 UTC (9 hours ago) (HTM) web link (monitoror.com) (TXT) w3m dump (monitoror.com) | soygul wrote: | Care to explain why one would use this over something much more | capable like Grafana? [1] | | [1] https://github.com/grafana/grafana | snug wrote: | Grafana needs a backend datastore, and typically prometheus | exporters on each app, etc to get timeseries data that gets | into the backend. | | This seems to be checking endpoints for data at that specific | time, not really doing any complex calculations or anything of | that nature. | sciurus wrote: | How often do you have data that is | | 1) important enough to display on a dashboard 2) not | important enough to record so you can track it over time | | ? | ketzo wrote: | Build status immediately jumps to mind. | kqr wrote: | Getting any sort of interesting insight from that surely | requires the context of historical build statuses? | | How long will it be down for? When will it be down next? | How likely is it that it goes down next week? Is it just | me or has it been down a lot this month? | ThePadawan wrote: | I can only second the sibling comment. | | For my use case ("check when the cronjob X on this machine last | ran successfully"), setting up a data ingress pipeline which I | could later configure as a time series data source seems like 3 | times the effort it should actually take. | thehodge wrote: | Reminds me of http://dashing.io/ | sdiepend wrote: | Indeed, which has a successor named | https://smashing.github.io/. Unfortunately it's not being | developed very actively. | djsumdog wrote: | Dashing was kinda garbage through. There was no standard/sane | way to install new plugins. I haven't checked out the currently | maintained for, but the original is dead/archived. I made the | following for Dashing for tracking Seattle Transit: | | https://github.com/sumdog/seatransit | BlackLotus89 wrote: | Could you grab and parse content with this? I'm not really using | CI stuff, but showing events (calendar), grabbing weather data or | output from other simple commands (health checks) could be of | use. Didn't find any of that in the example tiles | alex_d wrote: | Yep, check HTTP-FORMATTED tile :) | | You can display content from JSON, YAML or XML available over | HTTP | BlackLotus89 wrote: | oh didn't see that HTTP-RAW also returns the regex match. | Thanks. will give it a try Any possibility for command | outputs thought? | alex_d wrote: | Put the output in a file and expose it with a simple HTTP | server :) | | I do not think that we will add some command call since it | can be heavy and can potentially add some security | concerns. | blowski wrote: | Is this like an open-source Geckoboard? | alex_d wrote: | Kind of, yep :) | | But there is no graph/visualization support for now, and | Monitoror is more for IT monitoring right now. | | It will evolve to add more and more tile types. | | Feel free to create issues if you need specific tile types :) | chrissnell wrote: | Neat. I want to add crabby support: | | https://github.com/chrissnell/crabby | sub7 wrote: | Problem with this is that any half competent team can put | something like this up in an hour or so. Wallboards are for high | level stats - like 1 or 2 numbers the team should focus on. | | Maybe the tiles are super smart and can do uptime testing, log | monitoring etc in which case this should be positioned as an | uptime tester/log monitor etc | cstuder wrote: | Strangely the page doesn't say anything about it being Open | Source. | | It's MIT licenced by the way. | alex_d wrote: | You right, I should add that to the landing page :) | | It's in the footer but... who looks at the footer? :p | reaperducer wrote: | Panic used to have an iOS app that did this. It was called Status | Board, and was magnificent. | | You could put it on an old iPad on an easel on your desk and | watch everything from RSS feeds to ping statistics. In an office | setting, you'd hook the 'Pad up to a cheap flat screen TV so | everyone could see. | | Sadly, Panic discontinued it when it decided to go after the | video game market. | gumby wrote: | I had this app and it was useful; it's a shame that they | discontinued it though I was glad they gave a clear explanation | as to why. | Brendinooo wrote: | They wrote about why they killed it: simply put, there weren't | enough sales to justify further development. | | https://panic.com/blog/the-future-of-status-board/ | SergeAx wrote: | But why do you need an app where webpage is more than enough? | _jal wrote: | Because it cost like $10 and 5 minutes, could be set up by | non-web-plumbers, and was pretty out of the box. | iamben wrote: | Aside: this is one of the biggest lessons of my adult life. | Just because I _could_ make something doesn 't mean I | _should_ make something. Learning to value your time is a | very underdeveloped skill. | ampdepolymerase wrote: | But...but.. something.. something Stallman...vendor lock- | in...closed-platforms bad...something. | SergeAx wrote: | If it is a sarcasm, then please mind that original | comment author got really humped by this app's vendor | when it stopped working. Maybe Stallman got something | right after all? | reaperducer wrote: | It didn't stop working. It just doesn't get updated | anymore. | SergeAx wrote: | I really don't understand, is sounds too ignorant and | almost arrogant. | | Web browser doesn't need to be set up, it is already | included with iPad (or any Android device, or any laptop or | desktop, or your smart TV, smart fridge or smart watch). | | You need to configure your app during first launch pointing | it to your data source, and maybe entering login and | password. It is _exactly_ the same amount of hussle | compared to opening web site and make it default homepage | of your browser. | | I don't even want to talk about money, it's totally | irrelevant, $10 or $0. | reaperducer wrote: | The hard/technical part isn't launching the browser. It's | about building a continuously updating status page for | that browser to display. | stiray wrote: | One more thing, if I want a webpage, I will just write plugin | for this: https://github.com/netdata/netdata | | It is meant as system monitor but it can chew up preatty much | anything. And it is fast and lean. Really fast. | djsumdog wrote: | Have you found any tools for writing custom dashboard in | netdata? | | The docs are really light and custom dashboards seems to | involve hacking the default HTML and pulling out all the | components (javascript/divs) you need. I thought about | writing my own, but have too many projects already. | stiray wrote: | Set up netdata, check which plugin serves the type of | chart you want, find shortest of that type, make a copy | and hack it untill it works. I was doing it a while back | so I cant remember the details but it wasnt something | special to do... Ignore the docs, existing plugins are | all the documentation you need. | CubsFan1060 wrote: | For a terminal version of this, I really like | https://wtfutil.com/ | | Create a config file, and you get something similar. | EToS wrote: | Site down :-) | catrina11 wrote: | Hello | | Do you need financial support? | | Sign up for all kinds of loans and get the money urgently! | | * Get a stress-free loan today! * No Contest Qualifying! * No | credit check, no faxing! * Instant online approvals! * Completely | confidential! * Cash in 48 hours! | | * Appointment between $5,000 and $100,000,000 USD (only one | hundred million USD) * Interest rate of 3% * Choose between 1-25 | years repayment. * Choose between monthly and annual repayment | plan. * Flexibility of loan terms. | | All these plans and more, please contact us via: | catrinaprestamo@outlook.com | | Enter your data as needed. Name, address, date of birth, monthly | income, loan amount required, desired loan term. | | Administration Catrinaprestamo@outlook.com | | WHATSAPP: +1(863)410-6179 | [deleted] | bilekas wrote: | Nice handy tool, one grip is the scaling with different sizes, | the text does scale, but not the box modules.. Small thing but | really nice tool | dnadler wrote: | This is cool, but I'm running into a lot of issues with multiple | Jenkins tiles. The name from one is erroneously propagating to | following tiles :/ | gitgud wrote: | Nice design! Looks to be targeted at developers, but it could be | good for product managers too. Some tile ideas; Issue counts, PR | counts, vanity metrics... plenty of room for extension :) | alex_d wrote: | There is already GITHUB-COUNT tile type for issue/PR count :) | | And yep, we plan to add more and more tile types as user ask | them. | | Thank you, glad you enjoy the design :D | grantler wrote: | The first UI config example has a PING tile, but PING type seems | to be disabled by default, and I can't find how to enable it in | the docs. So maybe a good thing to make more clear for people | wanting to test quickly. | alex_d wrote: | You right, I will change the config example for now. | | Check the note in the Ping section here: | https://monitoror.com/documentation/#ping | | I will work on making it more obvious/visible :) | | Thank you for your feedback! | djsumdog wrote: | I've been looking at different status board tools and the one | thing I've always found missing is dual-stack IPv4+IPv6 tests. | It'd be nice to be able to see that both protocols to a given | port are working as expected. | | I don't want to write my own, so I'll probably settle on one and | try to offer up a PR for dual-ip stack checks. I'll take a look | at this one too. | CSDude wrote: | I know people like wallboards and monitors but we found them | anti-pattern. If you find yourself looking at a | wallboard/dashboard, it should already be an automated alert. | sunbear-lover wrote: | By that logic a speedometer is an anti pattern and your car | should just send up an alert when you're speeding... since when | is getting accurate real-time information a bad thing? | jedberg wrote: | That's... actually true. It doesn't matter how fast you're | going unless you're speeding. And it distracts you by making | you look down. | | The only reason we don't have that yet is because the car | doesn't know the speed limit everywhere all the time. | timdorr wrote: | Funny thing, this is actually a feature in Teslas. You can | set it to chime once the speed limit is exceeded (in areas | where it knows the limit). Although, I've never seen anyone | turn that on. | kube-system wrote: | I strongly disagree. I can think of a ton of reasons why a | driver may need (or even be legally required) to know their | speed regardless of speed limit: | | * when speed restricted by equipment (trailer, temporary | spare, etc) | | * when observing advisory speeds | | * when observing minimum speed requirements | | * as a reference for judging appropriate speeds under | inclement conditions | | * as a reference for judging appropriate | acceleration/deceleration rates when entering/exiting the | roadway | jedberg wrote: | Of course an alert system would have to be able to | understand all those things. That's why we don't have | that kind of system. | | A single number in isolation is rarely useful. Graphs | with trends are useful. Alerts are useful. | | The only reason we don't have alert based speeds is | _because_ it can 't get all the necessary information to | make a useful alert, so we compromise by telling you the | number. | | > as a reference for judging appropriate | acceleration/deceleration rates when entering/exiting the | roadway | | A perfect example of why a graph would be ideal here, not | a single number. | kube-system wrote: | > reference for judging appropriate | acceleration/deceleration | | > perfect example of why a graph would be ideal | | A gauge chart, maybe? :D | | But seriously, if we have a system that appropriately | judges everything on my laundry list above, you probably | won't need an alert system anymore because the cars will | be self-driving. | jaywalk wrote: | I have that in my 2020 Ford, and you can tune the alert | threshold from 0-5mph over the limit. I wouldn't even | consider turning it on unless I could set it to at least | 10mph, however. | | The vehicle has a camera that looks out for speed limit | signs, and then updates a little icon on the instrument | cluster with the current speed limit. It works very well. | zola wrote: | I was driving a car like that few times, with beeping | when the limit was exceeded. Annoying, but informative, | unless road sign was limiting vehicle mass to 10t and not | speed to 10km/h. It was carsharing vehicle, so I didn't | bother to turn it off. | alex_d wrote: | We are thinking about monetizing alert feature as a browser | extension or PWA for mobile based on Monitoror Core API :) | | A wallboard is useful to monitor project builds, CI servers, | even production things. | sirtoffski wrote: | I'll chime in here to say we use both at work. In a NOC at a | medium-sized ISP, we are getting hammered with alerts 24/7. | Some are not urgent, while others need to be actioned much | faster - I mean 100G transit link down is no good. | | We'd receive an automatic email about a large circuit going | down, we'd also receive a ticket about it; sometimes people | dont look at the tickets closely enough, other times people get | distracted with other topics, issues, etc. Having a large | screen with interface status monitoring has proven to be | effective enough; for example, someone walks by the monitor and | says "why is this thing red, is it supposed to be?... and we | immediately know one of the larger interfaces is down. | | In an ideal world, we would not need it because every ticket | will be diligently dealt with.... however in a real world, | having a big red part of the screen flashing had proved quite | effective. | sparrish wrote: | If you're getting alerts for non-actionable events, you need | to do a better job of tuning your monitors and alerts. | | Alerts shouldn't be sent about anything that doesn't require | an action. | C1sc0cat wrote: | Yeh any one from Google here if you like to allow us to | fine tune the alerts from GSC - I came in today and found | 87 non useful alerts in my inbox. | | I will have a look at the tool and have a play - I assume | you can have multiple pages :-) | | Would be cool to monitor the looks at GA " 1 2 3 4 5 .... | Many" sites I have an interest in | chrisandchris wrote: | There's a wide variety of ,,requires action". It might be | that it's fine to act within 1 hour or within 10min. Both | deserves an alert, but only one requires you to immediately | stop your coffe break... | | In an ideal world, I agree. But sometimes an automated | system can not perfeclty decide about the severity of an | alert which leads to some alerts being ignofed, which is | fine. | sirtoffski wrote: | Well the thing is alerts are indeed for actionable events. | | For example many remote locations have an on-site battery | backup, which would supply power in an event of loosing | commercial power. Those are actioned in terms of notifying | field teams and deciding whether a specific location needs | to be placed on a generator. | | Imagine a hurricane disrupted commercial power grid and | there are thousands of "site on battery" alerts; somewhere | among them there is also an alert for OSPF down between two | core switches. | | Having a monitor with a large red warning saying "Link X at | location Y is down!" - is a pretty effective way to not | miss important notifications. | | I mean playing devil's advocate one might say "Then your | alerts should have better filtering system with the | important ones staying at the top of the page"... which is | true. A lot of smart design features can render dashboards | less relevant - however when there aren't enough resources | in a DevOps team to implement those solutions, a simple | dashboard can go a long way! | jrockway wrote: | I think wallboards can be interesting. Do you want an alert if | your site is suddenly trending on Twitter? If latency and error | rates are good, probably not. Would you be interested if you | walked by and noticed? Probably. | sm4rk0 wrote: | But visualising the data and alerting are two different things. | geofft wrote: | Yes, which is why you shouldn't use wall boards for alerting, | only for visualization. | | https://demo.monitoror.com/?configUrl=https://monitoror.com/. | .. is full of things that aren't visualizations at all (no | graphs, no sense of whether things are abnormal but not past | an alerting threshold, etc.) and are in fact alerts (the | website is fine, one PR failed, the QA nodes are ... doing | something but there isn't enough space to see what is wrong). | | If you want some graphs, great. If you want your team to look | up every few minutes and poll some graphs (or worse, some | colored rectangles) to figure out what they're supposed to be | doing, consider that polling is usually the wrong approach. | | (To be clear, this is a criticism of the choice of demo data, | not of the product overall. A product like this has its uses, | but "our alerting system is people looking up at the TV" is | not one of them.) | OJFord wrote: | What would your alert be for # open PRs (an example in the demo | linked from posted page)? How often would it fire? | | Whatever the answer, that's a different thing from this. Both | have their place. | geofft wrote: | Why do you want a display of open PRs at all? | | I think the fundamental question of all such tools is "Why | are we watching this, and what are we looking for," and there | are limited but nonzero good reasons to have a display. | "Someone should look at open PRs if there are too many" is a | bad one - the number doesn't tell you about the urgency of | the existing PRs. If you want to respond promptly, respond to | all of them promptly. | | "We need to know if we're falling behind" is a possible | reason to create an alert, not a dashboard. If you really | want people to drop what they're doing and triage issues if | there are too many, make an alert. If you don't, you'll just | get a rectangle that turns red at some point and train people | to ignore red rectangles on the board. (Relatedly: I added a | pageable alert to my team a few years back to check whether | there are a large number of non-pageable alerts, because it | usually means something has gone wrong at a low level and we | should investigate urgently. It's worked out pretty well, but | the alert looks only at tickets created by our monitoring | systems, not at tickets created by humans.) | | "We need to see if we're getting worse" is a reason to have | managers review graphs periodically, not a reason for anyone | to stare at a single display. You can't track long-term | trends from a status board. | | "I need to see what to work on" is a valid reason, but much | more useful in the form of a website you can visit on your | own computer with links to PRs, not a raw number on a TV | screen. (My team has a TV showing open tickets in our queue, | both support tickets and automated alert, but we all have an | equivalent link locally, too. Showing the names of tickets is | useful for "Hey teammate, can you look at the second ticket | there? Sounds related to a thing you were working on.") | | I'd say there are roughly two useful cases for screens like | this. One is to show to internal customers, so they say "oh, | service X is yellow, so the slowness I"m seeing isn't just | me, I'll do something else for a while." But those screens | aren't primarily for the team that _owns_ the product, they | 're for teams that depend on the product. (Such status boards | can be either automated or manual.) The other is to show | graphs of various metrics to see abnormal behavior, with the | idea that no action is ever triggered by someone looking at | the graph, but if you're _already_ investigating something, | it 's useful to say "Hey, that's funny, this other thing | spiked at about the same time even though it's within | acceptable limits" and then you have a clue for | investigation. | wpietri wrote: | > Why do you want a display of open PRs at all? | | All PRs are WIP, and minimizing WIP is very valuable in | product development processes. See Reinertsen's _The | Principles of Product Development Flow_ for the math, but | basically high /unpredictable latency drastically limits | the pace of learning and causes a lot of upstream thrash | and waste. | | I remember talking with one team at the bird-themed social | media company that was frustrated with slow PRs; they | dropped average delay from 3-4 days to under 4 hours. They | said it made a huge experiential difference and they loved | the change. | geofft wrote: | Yes, I understand why you'd want to focus on solving the | number of open PRs. I agree that keeping that number down | is good. My question is _why do you want to put this on a | TV screen_. | | If you want people to focus on open PRs, tell them to | open GitHub on their computers, don't tell them to look | up at a TV screen periodically. Treat it like alerts: you | have a list of open things to deal with and you need to | get that number to zero. There's no threshold greater | than zero of a long-term acceptable number of open PRs. | | If the problem is that they have other things to look at | too, installing yet another TV screen won't solve that, | your team needs to make the management decision of what | to prioritize. Options include making a unified dashboard | of incidents/alerts/PRs/support tickets (and encoding | which ones sort to the top), setting up a PR review | rotation (i.e., for one week, completing reviews is your | top priority barring all-hands-on-deck incidents), | treating open PRs as alerts and escalating them if nobody | replies within 4 hours, removing other work by deciding | you'll deprioritize low-impact alerts (and hope that the | increased development velocity ends up solving problems), | etc. | wpietri wrote: | The notion with information radiators not that you tell | them to look up. The notion is that people naturally look | at things while walking around or when idle, so it's | valuable to make important things visible. It also serves | as a way to trigger and focus discussions. | | For example, consider the Kanban board. Here's one I | built a while back: | http://williampietri.com/writing/2015/the-big-board/ | | We loved having a physical map of what we were up to. | We'd have our daily stand-up around board and discuss it. | You'd know when something was completed, because you'd | see somebody move a card. I would often know when the | product manager was thinking about something he'd go over | to look right at it. That often sparked conversations. | And we'd all have a feel for how work was flowing, | something we'd talk about in our weekly retro. | | Could this have been replicated with a system of alerts? | No. Alerts are interruptive and necessarily threshhold- | driven. I don't want my people caught in a cycle of | continuous reactivity to things that at some point in | history were seen as important enough to configure an | alert. Except for emergencies, I want them to be serene, | thoughtful, and proactive, which is very hard to achieve | if you're continuously juggling alerts. | | So I'd put up something with PR stats if it were | something I wanted us to be aware of. Especially so if it | were an item of concern in previous retros. Maybe that | would eventually lead to an alert (although I'd hope | not). But the first step in solving a problem is | understanding the problem, and I think information | radiators are great for that, especially when problems | are thorny and don't have obviously correct answers. | geofft wrote: | That's fair - I think part of it is also that you don't | really have a green vs. red state (which is a good part | of what I object to in the demo presentation), you just | have a general feel, and no specific state is defined as | an actual problem. (And most of what you're trying to | achieve is a shared sense of what's being done, which is | very different from a shared sense of what's broken and | needs fixing.) | CSDude wrote: | If you just want to have a nice visualization to look at some | numbers, fine. But, if you want to detect problems, it's | ineffective. I saw too many companies do it to actually | monitor the state of things and find out problems with | charts, numbers, traffic lights etc. | OJFord wrote: | But that's my point, it isn't _for_ alerting about | problems, some things have a 'status' that might be | interesting, but isn't a problem, or something to fire an | alert on necessarily. | | You could have unintrusive notifications (inaudible etc.) | to 'alert' to such statuses I suppose, if they were kept in | view and not 'dismissed' (whatever that means for the | medium they came in) - but then really you're just | implementing a version of something like this Monitoror in | your inbox, phone notification tray, Telegram channel, or | whatever. | | You're not going to rip out logging, prometheus, or | services' that this connects to own UI just because you | have alerting, so I don't see why you would this. It's like | prometheus & grafana for higher level stuff. (Of course you | _could_ use those tools for this sort of monitoring too, | but that 's not really the point.) | throw_away wrote: | You can do both. Especially at the beginning of a system's | lifecycle and you don't really understand its behavior yet. | Lots of times, people wandering by have said hmm, that | doesn't seem right... Later, as we learned more, these | hunches evolved into more advanced automated alarms. | virgil_disgr4ce wrote: | A "nice visualization" is not necessarily just a | "pretty"/"shiny" thing to show off to people. Human beings | are highly visual creatures with outstanding visual pattern | recognition abilities. Maybe you personally don't get | anything out of them but the value of visualization is | proven. Here are a few sources to get you started: | https://www.csgsolutions.com/blog/15-statistics-prove- | power-... | scoutt wrote: | I don't think it is intended to be stared at it 8 hours- | straight. | | I thinks it's more like a clock: you look at it several times a | day, and not only when you hear an alarm. | tilolebo wrote: | Clocks are an anti pattern... | | Why would you want to have the time displayed permanently? | It's such a distraction for developers. | | Just set automated alerts for lunch and end of day and that's | it. | dvtrn wrote: | _Clocks are an anti-pattern_ | | Well this is a first for me... | scoutt wrote: | Well, I have a clock in the wall in front of me that | permanently displays the time. I check the time several | times a day, for example to check how much time left I have | to do something before lunch or going home, or a having a | meeting. | | I don't know why are we discussing the practical uses of a | clock. I can't imagine a life where one is allowed to look | at a clock _only_ when an alarm or alert is triggered. | lioeters wrote: | Calendars, clocks, real-time notifications, and video chats | are all anti-patterns, distracting developers from their | zone of genius. Just send a concise email at the | beginning/end of the day/week. (One can only dream..) | RBerenguel wrote: | I've spotted interesting "things" from idly looking at our | dashboard while chatting with coworkers (and more than a few | were interesting enough to warrant a lot of investigation and | double-checking of metrics, providers and stack). They were not | alert-able, or not very easily unless we wrote some complex | time series analysis system for our internal metrics. | wpietri wrote: | Not at all. Alerts serve a different purpose. | | One of the most important things a team needs over the long | haul is a _feel_ for their system. Many people refer to this as | mechanical sympathy. And the way you develop that is long-term | exposure to rich data. | | Alerts are the red and yellow lights on your dashboard. But you | get mechanical sympathy by listening to the sound of the | engine, feel of the road, and the smell of things when you take | a peek under the hood. | | There are a lot of ways to achieve mechanical sympathy, of | course. And information radiators are easily misused; you have | to have the right information shown in the right ways for | people to develop a correlative, intuitive understanding of | what they've built. But nobody develops mechanical sympathy by | looking at dashboard lights alone. | TeMPOraL wrote: | > _you have to have the right information shown in the right | ways for people to develop a correlative, intuitive | understanding of what they 've built_ | | Lots of things have to be right for this to work, | unfortunately, and company dashboards I've seen so far tend | to be nowhere near it. | | For instance, the dashboard refreshed $PERIOD only makes | sense if you're showing data that updates $PERIOD, and if you | can respond to changes in that data $PERIOD. $PERIOD = "in | realtime" or "every minute" or "hourly" or whatever is | relevant in a given context. | | If you're looking at the dashboard much more frequently than | the data changes, you're wasting time. If the data changes | much more frequently than you're looking at it, you're likely | to miss things, as 'geofft mentions elsewhere in the thread. | And if you can't react to the data roughly as fast as it's | updating, there's no point in looking at it so often. All | those periods - recording, observing and reacting - must be | roughly similar for the always-on dashboard to be useful, | relative to generating reports every now and then. | | Panels full of lights and charts work on fighter jets or on | the bridge of the Enterprise, because the pilots/crew are in | a tight feedback control loop with their dashboards. | | (WRT. reacting in time, there are also error bars to | consider. For instance, people on a diet are advised to weigh | themselves weekly and not daily, because body mass varies by | +/- 2kg during the day, so a naive person checking weight | daily would get fixated on those random oscillations. It's | easier to tell regular people to reduce measurement frequency | than to explain to them what a low-pass filter is and how is | it relevant here. I have a feeling there's plenty of | dashboard misuse that amounts to that too.) | | -- | | Speaking of the Enterprise and "getting the feel for the | system", there's something that I'd like to try one day: make | a monitoring tool that translates various system metrics into | background sounds, creating an ambience similar to the one | you hear on the Enterprise-D[0][1]. I feel a somewhat | unobtrusive mix of background noises would be better to | develop "the feel for the system" than a visual dashboard. | Real-life examples of this are combustion engine's RPM, or | spinning rust hard drives, if anyone still remembers those. | | -- | | [0] - https://www.youtube.com/watch?v=UKBvaOLDem0 - the | bridge | | [1] - In Enterprise's engineering, there's a well-known | pulsating sound of the warp core; I can't find a good enough | YouTube video (whatever there is, apparently got broken by | YT's audio compression). This background pulsing correlated | to the speed Enterprise was traveling with. | wpietri wrote: | Agreed! I find dashboards at most companies disappointing. | And often for the same reason I find other stuff on their | walls disappointing: it's frequently irrelevant or actively | unhelpful to the work actually being performed. | | For me, good dashboards are like good checklists: they | should be living entities owned by the team in question and | regularly updated to address active concerns. And they | don't even have to be complex. Back before CI was in | fashion, I drove giant changes in a team's behavior just by | having a single LED indicator (the now-departed Ambient | Orb) show the state of the current build. Previously, the | build would stay broken for weeks at a time, only | converging to green around the time of release. Nobody | liked that, but they were used to it, so they'd just work | around it. But once it was visible and discussed, they | eventually got so the build was green almost all the time. | It was less painful and saved a bunch of time. | | I would absolutely love to try out a set of ambient audio | indicators. I suspect I'd want to try it along with a | visual dashboard, because the moment I hear something | anomalous, I'm going to want to look up and see the recent | history, so I can correlate the audio with what it | represents and what else is going on. | RBerenguel wrote: | Likewise! I found this some time ago: | https://dl.acm.org/doi/10.5555/1045502.1045526 | | but never tried it, I just left it as a fun idea for the | future. | AgloeDreams wrote: | You know, for some people I think that's true and for others | it's not. There is real value in making some data reactive | rather than proactive in communication. Knowing current active | traffic, open PRs, time til build is done, all that kind of | stuff is 'I would like to see it/check it...but I do not want | it to interrupt me.' | | People who deal with tens of interruptions at that level are | clearly not very productive. | | On the other hand, for the site returning non-200 or for API | issues, that should be an alert, for sure. | | Kinda surprised that Slack or MS Teams isn't in this market. | dexterdog wrote: | I find them great as a first look when there is a problem | because you can often pinpoint the problem just by looking at | the board. | wjossey wrote: | Strongly disagree. | | Understanding your metrics is a key part of so many roles, from | devops, to product teams, to marketers... | | Yes, you should be automating alerts whenever possible. Yes, | you should be putting up key metrics in a visible place so | everyone can see how the product is performing. | | I can't tell you how many times I caught an issue because I | knew our metrics backwards and forwards, but it didn't trip an | alert threshold. Not every issue follows a pattern easily | defined in a check, and human brains are incredible computers | capable of helping to fill in that gap. | geofft wrote: | > _I can't tell you how many times I caught an issue because | I knew our metrics backwards and forwards, but it didn't trip | an alert threshold._ | | So how many times was an issue _missed_ because you weren 't | in the office, or because you were looking at your own screen | and not dashboards at the moment? | | Humans are incredibly powerful, but our whole job as SREs is | to make things reliable, repeatable, and scalable. We're | doing an industry-wide migration from elegantly hand-crafted | LAMP stacks running SSH to Kubernetes and infrastructure-as- | code, not because you can't fix problems with SSH (you can, | and you can usually fix them faster and better) but because | you can't _scalably_ fix problems with SSH. Similarly, if a | human found an issue and alert didn 't trip, I'd count that | as a bug/missing feature in the monitoring. | | It's valuable while you're still small and working out your | monitoring to keep a human in the loop - but at some point | you need to get rid of that single point of failure. By all | means, rely on a human to figure out where your alerting is | lacking (just like you rely on a human to write the | infrastructure-as-code), but you should eventually not rely | on human intervention to actually keep incidents from | happening. | pjmorris wrote: | > Similarly, if a human found an issue and alert didn't | trip, I'd count that as a bug/missing feature in the | monitoring. | | The way that I took the GP's point was that humans can find | things that haven't yet been automated, while automation | can't (at least not yet, but I'd argue it'll take AGI for | that.) | geofft wrote: | Yes, I agree with this. But if you're _relying_ on humans | to look at dashboards to keep your actual service up in | the moment, you 're not seriously committing to | automating (just like if you SSH to every machine you | Terraform to tweak things, you're not really committed to | Terraform). | | What you should do is rely on automation to detect | problems and alert people, and in postmortems, look at | graphs and have humans say things like "Hey, this queue | kept steadily climbing for three hours before the outage" | or "We would have noticed it in this metric but it's so | noisy so we can't alert on it" or something. Then you can | write more automation (or focus on some prerequisite dev | work). | kqr wrote: | I don't think anyone is arguing that, though. Lots of | things humans notice e.g. "we speculatively upped the | virtual file system cache and now the service has worse | throughput but better high nines response time" is not | something you can really build an alert for, and neither | is it something you really want an alert for -- but | absolutely something that would show up on a dashboard | you're intimate with. | | In other words, people are not arguing replacing alerts | with humans, but rather arguing that continuously looking | at your metrics give you a mental model for how your | system behaviour changes in response to changes in | configuration, whether intentional or not. | reaperducer wrote: | _So how many times was an issue missed because you weren 't | in the office, or because you were looking at your own | screen and not dashboards at the moment?_ | | That's not a problem with dashboards. That's a problem with | training and staffing people. | | _because you can 't scalably fix problems with SSH._ | | The number of businesses that need to worry about | scalability is vanishingly small compared to the number of | businesses that don't. Let's not pretend that one company's | problems are the same as another's. | | _you should eventually not rely on human intervention to | actually keep incidents from happening._ | | He didn't state that the dashboard was the only way his | organization kept tabs on things. He indicated that it was | only one way, and specifically stated that an alert system | also exists. | tyrust wrote: | >That's not a problem with dashboards. That's a problem | with training and staffing people. | | Training and staffing people to look at dashboards? I've | never heard of this and it sounds awful. | reaperducer wrote: | "Hey, Mike. On your way to the Keurig, remember to glance | at the status panel on the wall and let us know if | something doesn't look right, OK?" | | Brutal. | InvisibleCities wrote: | Why should Mike have to remember this? Why should all of | your infrastructure depend on Mike not getting a text | from his wife while walking to the fridge for a La Croix? | geofft wrote: | > _That 's not a problem with dashboards. That's a | problem with training and staffing people._ | | Again, the whole point of us being computer people is | that we think computers can solve problems in repeated, | reliable ways. You can run a highly reliable, say, | delivery-based bookstore by having a well-staffed group | of well-trained human phone operators who pass messages | onto human shippers. People did that (and they still do), | and it worked. But we have the thesis that you can do | this more efficiently and more reliably - in short, that | you can deliver more business value - by using computers | to automate the process. | | > _The number of businesses that need to worry about | scalability is vanishingly small compared to the number | of businesses that don 't. Let's not pretend that one | company's problems are the same as another's._ | | I do fully agree that different companies have different | priorities, and in particular I think it's totally fine | to rely on humans in the loop while a system is still | young (or has just been redesigned) and you don't have a | good codified sense of how it behaves yet. However, | | 1) Wall-based dashboards aren't a best practice, any more | than SSHing to production servers is a best practice. | It's the right thing for some cases, some of the time. | I'd agree with "It's a valuable skill, and it's been | useful;" I disagree with "It's so valuable you should | make sure everyone does it." If you have the option of | either getting good at alerts or getting good at | dashboards, spend your time getting good at alerts, | first. I'd say the same about infrastructure-as-code vs. | SSH-to-prod (and I say this as someone who regularly SSHs | to prod and is real good at single-machine old-school | sysadminnery). | | 2) Scalability isn't about absolute size, it's about how | much you can do with the resources you have. Small teams | and not-yet-profitable teams need to focus _more_ on | scalability (in the sense I 'm using it) because they | simply can't staff enough people to cover up gaps in | operability. For example, you're much better off figuring | out how to set up HA and automated failover than saying | "We're too small for that," setting up a weekly pager | rotation with people on call 24 hours a day, and alerting | them so much they can't do non-toil work (or worse, | burning them out and having them find another job). | | Many years ago I was on a ~4-person team at my undergrad | computer club running web hosting. We ended up getting | popular enough that many real university applications | (course websites for submitting assignments, etc.) | depended on us. Our priority was that, as students, we | couldn't get paged during finals week because our | academics would take priority, and yet finals week was | the most critical time for the service to stay up. So we | got real good at HA, at reproducible deployments and | config management, etc. (I remember one time we spun up a | new server during finals week - and we didn't have to do | any fiddling to add it to the cluster precisely because | we'd automated the provisioning process.) We had web | pages with graphed metrics to inform our capacity | planning, but no dashboards that anyone was expected to | stare at, just alerts on full outages. | _jal wrote: | You're both right. | | Instrumentation and alerts are vital - they leverage | inhuman persistence, patience and low cost. But alerts do | not substitute for a deep understanding of how your systems | work. | | A number of the more useful "pre-crime" alerts we have | derived from that - if I hadn't been elbow-deep in our | systems long enough to notice certain behaviors have non- | obvious second- and third-order effects downstream, we | wouldn't have the alerts at all. | geofft wrote: | So, I'm making a bit of a subtle claim - you should | absolutely be elbow-deep in your systems, and you should | be understanding things well enough to build these sorts | of proactive alerts, but you shouldn't rely on people | being elbow-deep for noticing problems in real time. | | If you're ever at the point where you catch a problem and | automated monitoring didn't, that's a bug in automated | monitoring. If you are really good at finding new bugs in | automated monitoring and more things to monitor because | you're spending your time getting a sense of how the | system behaves, _that 's fantastic_, keep doing that. | (That is one of the good reasons for dashboards IMO - a | bunch of data to look at when you've already realized | something's wrong. Just don't use dashboards to make the | decision that something must be wrong.) If you don't | improve your automated monitoring and you're worried | things will start failing without humans watching | dashboards, then you're not solving your existing bugs. | achow wrote: | Strongly agree (with you). | | _From the very first formulation of Ubiquitous Computing, | the idea of a calmer and more environmentally integrated way | of displaying information has held intuitive appeal. Weiser | called this "calm computing".. When information can be | conveyed via calm changes in the environment, users are more | able to focus on their primary work tasks while staying aware | of non-critical information that affects them. Research in | this sub-domain goes by various names including "ambient | displays", "peripheral displays", and "notification | systems"..._ | | A Taxonomy of Ambient Information Systems: Four Patterns of | Design | | https://www.cc.gatech.edu/~john.stasko/papers/avi06.pdf | C1sc0cat wrote: | An automated email is ok but seeing visually a graph flat | line or a monitor turn red is much more likely to get | noticed. | tyrust wrote: | If you ignore alerting then it's likely that your alerts | are too noisy. See "alert fatigue". | C1sc0cat wrote: | Its that nice Mr Googles alerts | kkirsche wrote: | Cool item but didn't scale well for mobile (iOS iPhone XS Plus) | Jeremy1026 wrote: | It looks like it doesn't actually support changing the port | currently, despite the documentation saying it is possible. I | already use port 8080 so kind of stuck until I can use a | different port. | hsartoris wrote: | I got it to run on a different port just fine with the MO_PORT | environment variable, FWIW. | Jeremy1026 wrote: | Turns out its just too early in the day. I wasn't saving the | variable beyond setting it. So when I switched terminals it | didn't exist. Put it in my bash profile and all is well. | alex_d wrote: | You can use .env file too, or even put it before the | command like that: | | MO_PORT=8888 ./monitoror | | :) | [deleted] ___________________________________________________________________ (page generated 2020-03-04 23:00 UTC)