[HN Gopher] Show HN: Monitoror - Unified monitoring wallboard
       ___________________________________________________________________
        
       Show HN: Monitoror - Unified monitoring wallboard
        
       Author : alex_d
       Score  : 234 points
       Date   : 2020-03-04 13:13 UTC (9 hours ago)
        
 (HTM) web link (monitoror.com)
 (TXT) w3m dump (monitoror.com)
        
       | soygul wrote:
       | Care to explain why one would use this over something much more
       | capable like Grafana? [1]
       | 
       | [1] https://github.com/grafana/grafana
        
         | snug wrote:
         | Grafana needs a backend datastore, and typically prometheus
         | exporters on each app, etc to get timeseries data that gets
         | into the backend.
         | 
         | This seems to be checking endpoints for data at that specific
         | time, not really doing any complex calculations or anything of
         | that nature.
        
           | sciurus wrote:
           | How often do you have data that is
           | 
           | 1) important enough to display on a dashboard 2) not
           | important enough to record so you can track it over time
           | 
           | ?
        
             | ketzo wrote:
             | Build status immediately jumps to mind.
        
               | kqr wrote:
               | Getting any sort of interesting insight from that surely
               | requires the context of historical build statuses?
               | 
               | How long will it be down for? When will it be down next?
               | How likely is it that it goes down next week? Is it just
               | me or has it been down a lot this month?
        
         | ThePadawan wrote:
         | I can only second the sibling comment.
         | 
         | For my use case ("check when the cronjob X on this machine last
         | ran successfully"), setting up a data ingress pipeline which I
         | could later configure as a time series data source seems like 3
         | times the effort it should actually take.
        
       | thehodge wrote:
       | Reminds me of http://dashing.io/
        
         | sdiepend wrote:
         | Indeed, which has a successor named
         | https://smashing.github.io/. Unfortunately it's not being
         | developed very actively.
        
         | djsumdog wrote:
         | Dashing was kinda garbage through. There was no standard/sane
         | way to install new plugins. I haven't checked out the currently
         | maintained for, but the original is dead/archived. I made the
         | following for Dashing for tracking Seattle Transit:
         | 
         | https://github.com/sumdog/seatransit
        
       | BlackLotus89 wrote:
       | Could you grab and parse content with this? I'm not really using
       | CI stuff, but showing events (calendar), grabbing weather data or
       | output from other simple commands (health checks) could be of
       | use. Didn't find any of that in the example tiles
        
         | alex_d wrote:
         | Yep, check HTTP-FORMATTED tile :)
         | 
         | You can display content from JSON, YAML or XML available over
         | HTTP
        
           | BlackLotus89 wrote:
           | oh didn't see that HTTP-RAW also returns the regex match.
           | Thanks. will give it a try Any possibility for command
           | outputs thought?
        
             | alex_d wrote:
             | Put the output in a file and expose it with a simple HTTP
             | server :)
             | 
             | I do not think that we will add some command call since it
             | can be heavy and can potentially add some security
             | concerns.
        
       | blowski wrote:
       | Is this like an open-source Geckoboard?
        
         | alex_d wrote:
         | Kind of, yep :)
         | 
         | But there is no graph/visualization support for now, and
         | Monitoror is more for IT monitoring right now.
         | 
         | It will evolve to add more and more tile types.
         | 
         | Feel free to create issues if you need specific tile types :)
        
       | chrissnell wrote:
       | Neat. I want to add crabby support:
       | 
       | https://github.com/chrissnell/crabby
        
       | sub7 wrote:
       | Problem with this is that any half competent team can put
       | something like this up in an hour or so. Wallboards are for high
       | level stats - like 1 or 2 numbers the team should focus on.
       | 
       | Maybe the tiles are super smart and can do uptime testing, log
       | monitoring etc in which case this should be positioned as an
       | uptime tester/log monitor etc
        
       | cstuder wrote:
       | Strangely the page doesn't say anything about it being Open
       | Source.
       | 
       | It's MIT licenced by the way.
        
         | alex_d wrote:
         | You right, I should add that to the landing page :)
         | 
         | It's in the footer but... who looks at the footer? :p
        
       | reaperducer wrote:
       | Panic used to have an iOS app that did this. It was called Status
       | Board, and was magnificent.
       | 
       | You could put it on an old iPad on an easel on your desk and
       | watch everything from RSS feeds to ping statistics. In an office
       | setting, you'd hook the 'Pad up to a cheap flat screen TV so
       | everyone could see.
       | 
       | Sadly, Panic discontinued it when it decided to go after the
       | video game market.
        
         | gumby wrote:
         | I had this app and it was useful; it's a shame that they
         | discontinued it though I was glad they gave a clear explanation
         | as to why.
        
         | Brendinooo wrote:
         | They wrote about why they killed it: simply put, there weren't
         | enough sales to justify further development.
         | 
         | https://panic.com/blog/the-future-of-status-board/
        
         | SergeAx wrote:
         | But why do you need an app where webpage is more than enough?
        
           | _jal wrote:
           | Because it cost like $10 and 5 minutes, could be set up by
           | non-web-plumbers, and was pretty out of the box.
        
             | iamben wrote:
             | Aside: this is one of the biggest lessons of my adult life.
             | Just because I _could_ make something doesn 't mean I
             | _should_ make something. Learning to value your time is a
             | very underdeveloped skill.
        
               | ampdepolymerase wrote:
               | But...but.. something.. something Stallman...vendor lock-
               | in...closed-platforms bad...something.
        
               | SergeAx wrote:
               | If it is a sarcasm, then please mind that original
               | comment author got really humped by this app's vendor
               | when it stopped working. Maybe Stallman got something
               | right after all?
        
               | reaperducer wrote:
               | It didn't stop working. It just doesn't get updated
               | anymore.
        
             | SergeAx wrote:
             | I really don't understand, is sounds too ignorant and
             | almost arrogant.
             | 
             | Web browser doesn't need to be set up, it is already
             | included with iPad (or any Android device, or any laptop or
             | desktop, or your smart TV, smart fridge or smart watch).
             | 
             | You need to configure your app during first launch pointing
             | it to your data source, and maybe entering login and
             | password. It is _exactly_ the same amount of hussle
             | compared to opening web site and make it default homepage
             | of your browser.
             | 
             | I don't even want to talk about money, it's totally
             | irrelevant, $10 or $0.
        
               | reaperducer wrote:
               | The hard/technical part isn't launching the browser. It's
               | about building a continuously updating status page for
               | that browser to display.
        
           | stiray wrote:
           | One more thing, if I want a webpage, I will just write plugin
           | for this: https://github.com/netdata/netdata
           | 
           | It is meant as system monitor but it can chew up preatty much
           | anything. And it is fast and lean. Really fast.
        
             | djsumdog wrote:
             | Have you found any tools for writing custom dashboard in
             | netdata?
             | 
             | The docs are really light and custom dashboards seems to
             | involve hacking the default HTML and pulling out all the
             | components (javascript/divs) you need. I thought about
             | writing my own, but have too many projects already.
        
               | stiray wrote:
               | Set up netdata, check which plugin serves the type of
               | chart you want, find shortest of that type, make a copy
               | and hack it untill it works. I was doing it a while back
               | so I cant remember the details but it wasnt something
               | special to do... Ignore the docs, existing plugins are
               | all the documentation you need.
        
       | CubsFan1060 wrote:
       | For a terminal version of this, I really like
       | https://wtfutil.com/
       | 
       | Create a config file, and you get something similar.
        
       | EToS wrote:
       | Site down :-)
        
       | catrina11 wrote:
       | Hello
       | 
       | Do you need financial support?
       | 
       | Sign up for all kinds of loans and get the money urgently!
       | 
       | * Get a stress-free loan today! * No Contest Qualifying! * No
       | credit check, no faxing! * Instant online approvals! * Completely
       | confidential! * Cash in 48 hours!
       | 
       | * Appointment between $5,000 and $100,000,000 USD (only one
       | hundred million USD) * Interest rate of 3% * Choose between 1-25
       | years repayment. * Choose between monthly and annual repayment
       | plan. * Flexibility of loan terms.
       | 
       | All these plans and more, please contact us via:
       | catrinaprestamo@outlook.com
       | 
       | Enter your data as needed. Name, address, date of birth, monthly
       | income, loan amount required, desired loan term.
       | 
       | Administration Catrinaprestamo@outlook.com
       | 
       | WHATSAPP: +1(863)410-6179
        
       | [deleted]
        
       | bilekas wrote:
       | Nice handy tool, one grip is the scaling with different sizes,
       | the text does scale, but not the box modules.. Small thing but
       | really nice tool
        
       | dnadler wrote:
       | This is cool, but I'm running into a lot of issues with multiple
       | Jenkins tiles. The name from one is erroneously propagating to
       | following tiles :/
        
       | gitgud wrote:
       | Nice design! Looks to be targeted at developers, but it could be
       | good for product managers too. Some tile ideas; Issue counts, PR
       | counts, vanity metrics... plenty of room for extension :)
        
         | alex_d wrote:
         | There is already GITHUB-COUNT tile type for issue/PR count :)
         | 
         | And yep, we plan to add more and more tile types as user ask
         | them.
         | 
         | Thank you, glad you enjoy the design :D
        
       | grantler wrote:
       | The first UI config example has a PING tile, but PING type seems
       | to be disabled by default, and I can't find how to enable it in
       | the docs. So maybe a good thing to make more clear for people
       | wanting to test quickly.
        
         | alex_d wrote:
         | You right, I will change the config example for now.
         | 
         | Check the note in the Ping section here:
         | https://monitoror.com/documentation/#ping
         | 
         | I will work on making it more obvious/visible :)
         | 
         | Thank you for your feedback!
        
       | djsumdog wrote:
       | I've been looking at different status board tools and the one
       | thing I've always found missing is dual-stack IPv4+IPv6 tests.
       | It'd be nice to be able to see that both protocols to a given
       | port are working as expected.
       | 
       | I don't want to write my own, so I'll probably settle on one and
       | try to offer up a PR for dual-ip stack checks. I'll take a look
       | at this one too.
        
       | CSDude wrote:
       | I know people like wallboards and monitors but we found them
       | anti-pattern. If you find yourself looking at a
       | wallboard/dashboard, it should already be an automated alert.
        
         | sunbear-lover wrote:
         | By that logic a speedometer is an anti pattern and your car
         | should just send up an alert when you're speeding... since when
         | is getting accurate real-time information a bad thing?
        
           | jedberg wrote:
           | That's... actually true. It doesn't matter how fast you're
           | going unless you're speeding. And it distracts you by making
           | you look down.
           | 
           | The only reason we don't have that yet is because the car
           | doesn't know the speed limit everywhere all the time.
        
             | timdorr wrote:
             | Funny thing, this is actually a feature in Teslas. You can
             | set it to chime once the speed limit is exceeded (in areas
             | where it knows the limit). Although, I've never seen anyone
             | turn that on.
        
             | kube-system wrote:
             | I strongly disagree. I can think of a ton of reasons why a
             | driver may need (or even be legally required) to know their
             | speed regardless of speed limit:
             | 
             | * when speed restricted by equipment (trailer, temporary
             | spare, etc)
             | 
             | * when observing advisory speeds
             | 
             | * when observing minimum speed requirements
             | 
             | * as a reference for judging appropriate speeds under
             | inclement conditions
             | 
             | * as a reference for judging appropriate
             | acceleration/deceleration rates when entering/exiting the
             | roadway
        
               | jedberg wrote:
               | Of course an alert system would have to be able to
               | understand all those things. That's why we don't have
               | that kind of system.
               | 
               | A single number in isolation is rarely useful. Graphs
               | with trends are useful. Alerts are useful.
               | 
               | The only reason we don't have alert based speeds is
               | _because_ it can 't get all the necessary information to
               | make a useful alert, so we compromise by telling you the
               | number.
               | 
               | > as a reference for judging appropriate
               | acceleration/deceleration rates when entering/exiting the
               | roadway
               | 
               | A perfect example of why a graph would be ideal here, not
               | a single number.
        
               | kube-system wrote:
               | > reference for judging appropriate
               | acceleration/deceleration
               | 
               | > perfect example of why a graph would be ideal
               | 
               | A gauge chart, maybe? :D
               | 
               | But seriously, if we have a system that appropriately
               | judges everything on my laundry list above, you probably
               | won't need an alert system anymore because the cars will
               | be self-driving.
        
             | jaywalk wrote:
             | I have that in my 2020 Ford, and you can tune the alert
             | threshold from 0-5mph over the limit. I wouldn't even
             | consider turning it on unless I could set it to at least
             | 10mph, however.
             | 
             | The vehicle has a camera that looks out for speed limit
             | signs, and then updates a little icon on the instrument
             | cluster with the current speed limit. It works very well.
        
               | zola wrote:
               | I was driving a car like that few times, with beeping
               | when the limit was exceeded. Annoying, but informative,
               | unless road sign was limiting vehicle mass to 10t and not
               | speed to 10km/h. It was carsharing vehicle, so I didn't
               | bother to turn it off.
        
         | alex_d wrote:
         | We are thinking about monetizing alert feature as a browser
         | extension or PWA for mobile based on Monitoror Core API :)
         | 
         | A wallboard is useful to monitor project builds, CI servers,
         | even production things.
        
         | sirtoffski wrote:
         | I'll chime in here to say we use both at work. In a NOC at a
         | medium-sized ISP, we are getting hammered with alerts 24/7.
         | Some are not urgent, while others need to be actioned much
         | faster - I mean 100G transit link down is no good.
         | 
         | We'd receive an automatic email about a large circuit going
         | down, we'd also receive a ticket about it; sometimes people
         | dont look at the tickets closely enough, other times people get
         | distracted with other topics, issues, etc. Having a large
         | screen with interface status monitoring has proven to be
         | effective enough; for example, someone walks by the monitor and
         | says "why is this thing red, is it supposed to be?... and we
         | immediately know one of the larger interfaces is down.
         | 
         | In an ideal world, we would not need it because every ticket
         | will be diligently dealt with.... however in a real world,
         | having a big red part of the screen flashing had proved quite
         | effective.
        
           | sparrish wrote:
           | If you're getting alerts for non-actionable events, you need
           | to do a better job of tuning your monitors and alerts.
           | 
           | Alerts shouldn't be sent about anything that doesn't require
           | an action.
        
             | C1sc0cat wrote:
             | Yeh any one from Google here if you like to allow us to
             | fine tune the alerts from GSC - I came in today and found
             | 87 non useful alerts in my inbox.
             | 
             | I will have a look at the tool and have a play - I assume
             | you can have multiple pages :-)
             | 
             | Would be cool to monitor the looks at GA " 1 2 3 4 5 ....
             | Many" sites I have an interest in
        
             | chrisandchris wrote:
             | There's a wide variety of ,,requires action". It might be
             | that it's fine to act within 1 hour or within 10min. Both
             | deserves an alert, but only one requires you to immediately
             | stop your coffe break...
             | 
             | In an ideal world, I agree. But sometimes an automated
             | system can not perfeclty decide about the severity of an
             | alert which leads to some alerts being ignofed, which is
             | fine.
        
             | sirtoffski wrote:
             | Well the thing is alerts are indeed for actionable events.
             | 
             | For example many remote locations have an on-site battery
             | backup, which would supply power in an event of loosing
             | commercial power. Those are actioned in terms of notifying
             | field teams and deciding whether a specific location needs
             | to be placed on a generator.
             | 
             | Imagine a hurricane disrupted commercial power grid and
             | there are thousands of "site on battery" alerts; somewhere
             | among them there is also an alert for OSPF down between two
             | core switches.
             | 
             | Having a monitor with a large red warning saying "Link X at
             | location Y is down!" - is a pretty effective way to not
             | miss important notifications.
             | 
             | I mean playing devil's advocate one might say "Then your
             | alerts should have better filtering system with the
             | important ones staying at the top of the page"... which is
             | true. A lot of smart design features can render dashboards
             | less relevant - however when there aren't enough resources
             | in a DevOps team to implement those solutions, a simple
             | dashboard can go a long way!
        
         | jrockway wrote:
         | I think wallboards can be interesting. Do you want an alert if
         | your site is suddenly trending on Twitter? If latency and error
         | rates are good, probably not. Would you be interested if you
         | walked by and noticed? Probably.
        
         | sm4rk0 wrote:
         | But visualising the data and alerting are two different things.
        
           | geofft wrote:
           | Yes, which is why you shouldn't use wall boards for alerting,
           | only for visualization.
           | 
           | https://demo.monitoror.com/?configUrl=https://monitoror.com/.
           | .. is full of things that aren't visualizations at all (no
           | graphs, no sense of whether things are abnormal but not past
           | an alerting threshold, etc.) and are in fact alerts (the
           | website is fine, one PR failed, the QA nodes are ... doing
           | something but there isn't enough space to see what is wrong).
           | 
           | If you want some graphs, great. If you want your team to look
           | up every few minutes and poll some graphs (or worse, some
           | colored rectangles) to figure out what they're supposed to be
           | doing, consider that polling is usually the wrong approach.
           | 
           | (To be clear, this is a criticism of the choice of demo data,
           | not of the product overall. A product like this has its uses,
           | but "our alerting system is people looking up at the TV" is
           | not one of them.)
        
         | OJFord wrote:
         | What would your alert be for # open PRs (an example in the demo
         | linked from posted page)? How often would it fire?
         | 
         | Whatever the answer, that's a different thing from this. Both
         | have their place.
        
           | geofft wrote:
           | Why do you want a display of open PRs at all?
           | 
           | I think the fundamental question of all such tools is "Why
           | are we watching this, and what are we looking for," and there
           | are limited but nonzero good reasons to have a display.
           | "Someone should look at open PRs if there are too many" is a
           | bad one - the number doesn't tell you about the urgency of
           | the existing PRs. If you want to respond promptly, respond to
           | all of them promptly.
           | 
           | "We need to know if we're falling behind" is a possible
           | reason to create an alert, not a dashboard. If you really
           | want people to drop what they're doing and triage issues if
           | there are too many, make an alert. If you don't, you'll just
           | get a rectangle that turns red at some point and train people
           | to ignore red rectangles on the board. (Relatedly: I added a
           | pageable alert to my team a few years back to check whether
           | there are a large number of non-pageable alerts, because it
           | usually means something has gone wrong at a low level and we
           | should investigate urgently. It's worked out pretty well, but
           | the alert looks only at tickets created by our monitoring
           | systems, not at tickets created by humans.)
           | 
           | "We need to see if we're getting worse" is a reason to have
           | managers review graphs periodically, not a reason for anyone
           | to stare at a single display. You can't track long-term
           | trends from a status board.
           | 
           | "I need to see what to work on" is a valid reason, but much
           | more useful in the form of a website you can visit on your
           | own computer with links to PRs, not a raw number on a TV
           | screen. (My team has a TV showing open tickets in our queue,
           | both support tickets and automated alert, but we all have an
           | equivalent link locally, too. Showing the names of tickets is
           | useful for "Hey teammate, can you look at the second ticket
           | there? Sounds related to a thing you were working on.")
           | 
           | I'd say there are roughly two useful cases for screens like
           | this. One is to show to internal customers, so they say "oh,
           | service X is yellow, so the slowness I"m seeing isn't just
           | me, I'll do something else for a while." But those screens
           | aren't primarily for the team that _owns_ the product, they
           | 're for teams that depend on the product. (Such status boards
           | can be either automated or manual.) The other is to show
           | graphs of various metrics to see abnormal behavior, with the
           | idea that no action is ever triggered by someone looking at
           | the graph, but if you're _already_ investigating something,
           | it 's useful to say "Hey, that's funny, this other thing
           | spiked at about the same time even though it's within
           | acceptable limits" and then you have a clue for
           | investigation.
        
             | wpietri wrote:
             | > Why do you want a display of open PRs at all?
             | 
             | All PRs are WIP, and minimizing WIP is very valuable in
             | product development processes. See Reinertsen's _The
             | Principles of Product Development Flow_ for the math, but
             | basically high /unpredictable latency drastically limits
             | the pace of learning and causes a lot of upstream thrash
             | and waste.
             | 
             | I remember talking with one team at the bird-themed social
             | media company that was frustrated with slow PRs; they
             | dropped average delay from 3-4 days to under 4 hours. They
             | said it made a huge experiential difference and they loved
             | the change.
        
               | geofft wrote:
               | Yes, I understand why you'd want to focus on solving the
               | number of open PRs. I agree that keeping that number down
               | is good. My question is _why do you want to put this on a
               | TV screen_.
               | 
               | If you want people to focus on open PRs, tell them to
               | open GitHub on their computers, don't tell them to look
               | up at a TV screen periodically. Treat it like alerts: you
               | have a list of open things to deal with and you need to
               | get that number to zero. There's no threshold greater
               | than zero of a long-term acceptable number of open PRs.
               | 
               | If the problem is that they have other things to look at
               | too, installing yet another TV screen won't solve that,
               | your team needs to make the management decision of what
               | to prioritize. Options include making a unified dashboard
               | of incidents/alerts/PRs/support tickets (and encoding
               | which ones sort to the top), setting up a PR review
               | rotation (i.e., for one week, completing reviews is your
               | top priority barring all-hands-on-deck incidents),
               | treating open PRs as alerts and escalating them if nobody
               | replies within 4 hours, removing other work by deciding
               | you'll deprioritize low-impact alerts (and hope that the
               | increased development velocity ends up solving problems),
               | etc.
        
               | wpietri wrote:
               | The notion with information radiators not that you tell
               | them to look up. The notion is that people naturally look
               | at things while walking around or when idle, so it's
               | valuable to make important things visible. It also serves
               | as a way to trigger and focus discussions.
               | 
               | For example, consider the Kanban board. Here's one I
               | built a while back:
               | http://williampietri.com/writing/2015/the-big-board/
               | 
               | We loved having a physical map of what we were up to.
               | We'd have our daily stand-up around board and discuss it.
               | You'd know when something was completed, because you'd
               | see somebody move a card. I would often know when the
               | product manager was thinking about something he'd go over
               | to look right at it. That often sparked conversations.
               | And we'd all have a feel for how work was flowing,
               | something we'd talk about in our weekly retro.
               | 
               | Could this have been replicated with a system of alerts?
               | No. Alerts are interruptive and necessarily threshhold-
               | driven. I don't want my people caught in a cycle of
               | continuous reactivity to things that at some point in
               | history were seen as important enough to configure an
               | alert. Except for emergencies, I want them to be serene,
               | thoughtful, and proactive, which is very hard to achieve
               | if you're continuously juggling alerts.
               | 
               | So I'd put up something with PR stats if it were
               | something I wanted us to be aware of. Especially so if it
               | were an item of concern in previous retros. Maybe that
               | would eventually lead to an alert (although I'd hope
               | not). But the first step in solving a problem is
               | understanding the problem, and I think information
               | radiators are great for that, especially when problems
               | are thorny and don't have obviously correct answers.
        
               | geofft wrote:
               | That's fair - I think part of it is also that you don't
               | really have a green vs. red state (which is a good part
               | of what I object to in the demo presentation), you just
               | have a general feel, and no specific state is defined as
               | an actual problem. (And most of what you're trying to
               | achieve is a shared sense of what's being done, which is
               | very different from a shared sense of what's broken and
               | needs fixing.)
        
           | CSDude wrote:
           | If you just want to have a nice visualization to look at some
           | numbers, fine. But, if you want to detect problems, it's
           | ineffective. I saw too many companies do it to actually
           | monitor the state of things and find out problems with
           | charts, numbers, traffic lights etc.
        
             | OJFord wrote:
             | But that's my point, it isn't _for_ alerting about
             | problems, some things have a  'status' that might be
             | interesting, but isn't a problem, or something to fire an
             | alert on necessarily.
             | 
             | You could have unintrusive notifications (inaudible etc.)
             | to 'alert' to such statuses I suppose, if they were kept in
             | view and not 'dismissed' (whatever that means for the
             | medium they came in) - but then really you're just
             | implementing a version of something like this Monitoror in
             | your inbox, phone notification tray, Telegram channel, or
             | whatever.
             | 
             | You're not going to rip out logging, prometheus, or
             | services' that this connects to own UI just because you
             | have alerting, so I don't see why you would this. It's like
             | prometheus & grafana for higher level stuff. (Of course you
             | _could_ use those tools for this sort of monitoring too,
             | but that 's not really the point.)
        
             | throw_away wrote:
             | You can do both. Especially at the beginning of a system's
             | lifecycle and you don't really understand its behavior yet.
             | Lots of times, people wandering by have said hmm, that
             | doesn't seem right... Later, as we learned more, these
             | hunches evolved into more advanced automated alarms.
        
             | virgil_disgr4ce wrote:
             | A "nice visualization" is not necessarily just a
             | "pretty"/"shiny" thing to show off to people. Human beings
             | are highly visual creatures with outstanding visual pattern
             | recognition abilities. Maybe you personally don't get
             | anything out of them but the value of visualization is
             | proven. Here are a few sources to get you started:
             | https://www.csgsolutions.com/blog/15-statistics-prove-
             | power-...
        
         | scoutt wrote:
         | I don't think it is intended to be stared at it 8 hours-
         | straight.
         | 
         | I thinks it's more like a clock: you look at it several times a
         | day, and not only when you hear an alarm.
        
           | tilolebo wrote:
           | Clocks are an anti pattern...
           | 
           | Why would you want to have the time displayed permanently?
           | It's such a distraction for developers.
           | 
           | Just set automated alerts for lunch and end of day and that's
           | it.
        
             | dvtrn wrote:
             | _Clocks are an anti-pattern_
             | 
             | Well this is a first for me...
        
             | scoutt wrote:
             | Well, I have a clock in the wall in front of me that
             | permanently displays the time. I check the time several
             | times a day, for example to check how much time left I have
             | to do something before lunch or going home, or a having a
             | meeting.
             | 
             | I don't know why are we discussing the practical uses of a
             | clock. I can't imagine a life where one is allowed to look
             | at a clock _only_ when an alarm or alert is triggered.
        
             | lioeters wrote:
             | Calendars, clocks, real-time notifications, and video chats
             | are all anti-patterns, distracting developers from their
             | zone of genius. Just send a concise email at the
             | beginning/end of the day/week. (One can only dream..)
        
         | RBerenguel wrote:
         | I've spotted interesting "things" from idly looking at our
         | dashboard while chatting with coworkers (and more than a few
         | were interesting enough to warrant a lot of investigation and
         | double-checking of metrics, providers and stack). They were not
         | alert-able, or not very easily unless we wrote some complex
         | time series analysis system for our internal metrics.
        
         | wpietri wrote:
         | Not at all. Alerts serve a different purpose.
         | 
         | One of the most important things a team needs over the long
         | haul is a _feel_ for their system. Many people refer to this as
         | mechanical sympathy. And the way you develop that is long-term
         | exposure to rich data.
         | 
         | Alerts are the red and yellow lights on your dashboard. But you
         | get mechanical sympathy by listening to the sound of the
         | engine, feel of the road, and the smell of things when you take
         | a peek under the hood.
         | 
         | There are a lot of ways to achieve mechanical sympathy, of
         | course. And information radiators are easily misused; you have
         | to have the right information shown in the right ways for
         | people to develop a correlative, intuitive understanding of
         | what they've built. But nobody develops mechanical sympathy by
         | looking at dashboard lights alone.
        
           | TeMPOraL wrote:
           | > _you have to have the right information shown in the right
           | ways for people to develop a correlative, intuitive
           | understanding of what they 've built_
           | 
           | Lots of things have to be right for this to work,
           | unfortunately, and company dashboards I've seen so far tend
           | to be nowhere near it.
           | 
           | For instance, the dashboard refreshed $PERIOD only makes
           | sense if you're showing data that updates $PERIOD, and if you
           | can respond to changes in that data $PERIOD. $PERIOD = "in
           | realtime" or "every minute" or "hourly" or whatever is
           | relevant in a given context.
           | 
           | If you're looking at the dashboard much more frequently than
           | the data changes, you're wasting time. If the data changes
           | much more frequently than you're looking at it, you're likely
           | to miss things, as 'geofft mentions elsewhere in the thread.
           | And if you can't react to the data roughly as fast as it's
           | updating, there's no point in looking at it so often. All
           | those periods - recording, observing and reacting - must be
           | roughly similar for the always-on dashboard to be useful,
           | relative to generating reports every now and then.
           | 
           | Panels full of lights and charts work on fighter jets or on
           | the bridge of the Enterprise, because the pilots/crew are in
           | a tight feedback control loop with their dashboards.
           | 
           | (WRT. reacting in time, there are also error bars to
           | consider. For instance, people on a diet are advised to weigh
           | themselves weekly and not daily, because body mass varies by
           | +/- 2kg during the day, so a naive person checking weight
           | daily would get fixated on those random oscillations. It's
           | easier to tell regular people to reduce measurement frequency
           | than to explain to them what a low-pass filter is and how is
           | it relevant here. I have a feeling there's plenty of
           | dashboard misuse that amounts to that too.)
           | 
           | --
           | 
           | Speaking of the Enterprise and "getting the feel for the
           | system", there's something that I'd like to try one day: make
           | a monitoring tool that translates various system metrics into
           | background sounds, creating an ambience similar to the one
           | you hear on the Enterprise-D[0][1]. I feel a somewhat
           | unobtrusive mix of background noises would be better to
           | develop "the feel for the system" than a visual dashboard.
           | Real-life examples of this are combustion engine's RPM, or
           | spinning rust hard drives, if anyone still remembers those.
           | 
           | --
           | 
           | [0] - https://www.youtube.com/watch?v=UKBvaOLDem0 - the
           | bridge
           | 
           | [1] - In Enterprise's engineering, there's a well-known
           | pulsating sound of the warp core; I can't find a good enough
           | YouTube video (whatever there is, apparently got broken by
           | YT's audio compression). This background pulsing correlated
           | to the speed Enterprise was traveling with.
        
             | wpietri wrote:
             | Agreed! I find dashboards at most companies disappointing.
             | And often for the same reason I find other stuff on their
             | walls disappointing: it's frequently irrelevant or actively
             | unhelpful to the work actually being performed.
             | 
             | For me, good dashboards are like good checklists: they
             | should be living entities owned by the team in question and
             | regularly updated to address active concerns. And they
             | don't even have to be complex. Back before CI was in
             | fashion, I drove giant changes in a team's behavior just by
             | having a single LED indicator (the now-departed Ambient
             | Orb) show the state of the current build. Previously, the
             | build would stay broken for weeks at a time, only
             | converging to green around the time of release. Nobody
             | liked that, but they were used to it, so they'd just work
             | around it. But once it was visible and discussed, they
             | eventually got so the build was green almost all the time.
             | It was less painful and saved a bunch of time.
             | 
             | I would absolutely love to try out a set of ambient audio
             | indicators. I suspect I'd want to try it along with a
             | visual dashboard, because the moment I hear something
             | anomalous, I'm going to want to look up and see the recent
             | history, so I can correlate the audio with what it
             | represents and what else is going on.
        
             | RBerenguel wrote:
             | Likewise! I found this some time ago:
             | https://dl.acm.org/doi/10.5555/1045502.1045526
             | 
             | but never tried it, I just left it as a fun idea for the
             | future.
        
         | AgloeDreams wrote:
         | You know, for some people I think that's true and for others
         | it's not. There is real value in making some data reactive
         | rather than proactive in communication. Knowing current active
         | traffic, open PRs, time til build is done, all that kind of
         | stuff is 'I would like to see it/check it...but I do not want
         | it to interrupt me.'
         | 
         | People who deal with tens of interruptions at that level are
         | clearly not very productive.
         | 
         | On the other hand, for the site returning non-200 or for API
         | issues, that should be an alert, for sure.
         | 
         | Kinda surprised that Slack or MS Teams isn't in this market.
        
         | dexterdog wrote:
         | I find them great as a first look when there is a problem
         | because you can often pinpoint the problem just by looking at
         | the board.
        
         | wjossey wrote:
         | Strongly disagree.
         | 
         | Understanding your metrics is a key part of so many roles, from
         | devops, to product teams, to marketers...
         | 
         | Yes, you should be automating alerts whenever possible. Yes,
         | you should be putting up key metrics in a visible place so
         | everyone can see how the product is performing.
         | 
         | I can't tell you how many times I caught an issue because I
         | knew our metrics backwards and forwards, but it didn't trip an
         | alert threshold. Not every issue follows a pattern easily
         | defined in a check, and human brains are incredible computers
         | capable of helping to fill in that gap.
        
           | geofft wrote:
           | > _I can't tell you how many times I caught an issue because
           | I knew our metrics backwards and forwards, but it didn't trip
           | an alert threshold._
           | 
           | So how many times was an issue _missed_ because you weren 't
           | in the office, or because you were looking at your own screen
           | and not dashboards at the moment?
           | 
           | Humans are incredibly powerful, but our whole job as SREs is
           | to make things reliable, repeatable, and scalable. We're
           | doing an industry-wide migration from elegantly hand-crafted
           | LAMP stacks running SSH to Kubernetes and infrastructure-as-
           | code, not because you can't fix problems with SSH (you can,
           | and you can usually fix them faster and better) but because
           | you can't _scalably_ fix problems with SSH. Similarly, if a
           | human found an issue and alert didn 't trip, I'd count that
           | as a bug/missing feature in the monitoring.
           | 
           | It's valuable while you're still small and working out your
           | monitoring to keep a human in the loop - but at some point
           | you need to get rid of that single point of failure. By all
           | means, rely on a human to figure out where your alerting is
           | lacking (just like you rely on a human to write the
           | infrastructure-as-code), but you should eventually not rely
           | on human intervention to actually keep incidents from
           | happening.
        
             | pjmorris wrote:
             | > Similarly, if a human found an issue and alert didn't
             | trip, I'd count that as a bug/missing feature in the
             | monitoring.
             | 
             | The way that I took the GP's point was that humans can find
             | things that haven't yet been automated, while automation
             | can't (at least not yet, but I'd argue it'll take AGI for
             | that.)
        
               | geofft wrote:
               | Yes, I agree with this. But if you're _relying_ on humans
               | to look at dashboards to keep your actual service up in
               | the moment, you 're not seriously committing to
               | automating (just like if you SSH to every machine you
               | Terraform to tweak things, you're not really committed to
               | Terraform).
               | 
               | What you should do is rely on automation to detect
               | problems and alert people, and in postmortems, look at
               | graphs and have humans say things like "Hey, this queue
               | kept steadily climbing for three hours before the outage"
               | or "We would have noticed it in this metric but it's so
               | noisy so we can't alert on it" or something. Then you can
               | write more automation (or focus on some prerequisite dev
               | work).
        
               | kqr wrote:
               | I don't think anyone is arguing that, though. Lots of
               | things humans notice e.g. "we speculatively upped the
               | virtual file system cache and now the service has worse
               | throughput but better high nines response time" is not
               | something you can really build an alert for, and neither
               | is it something you really want an alert for -- but
               | absolutely something that would show up on a dashboard
               | you're intimate with.
               | 
               | In other words, people are not arguing replacing alerts
               | with humans, but rather arguing that continuously looking
               | at your metrics give you a mental model for how your
               | system behaviour changes in response to changes in
               | configuration, whether intentional or not.
        
             | reaperducer wrote:
             | _So how many times was an issue missed because you weren 't
             | in the office, or because you were looking at your own
             | screen and not dashboards at the moment?_
             | 
             | That's not a problem with dashboards. That's a problem with
             | training and staffing people.
             | 
             |  _because you can 't scalably fix problems with SSH._
             | 
             | The number of businesses that need to worry about
             | scalability is vanishingly small compared to the number of
             | businesses that don't. Let's not pretend that one company's
             | problems are the same as another's.
             | 
             |  _you should eventually not rely on human intervention to
             | actually keep incidents from happening._
             | 
             | He didn't state that the dashboard was the only way his
             | organization kept tabs on things. He indicated that it was
             | only one way, and specifically stated that an alert system
             | also exists.
        
               | tyrust wrote:
               | >That's not a problem with dashboards. That's a problem
               | with training and staffing people.
               | 
               | Training and staffing people to look at dashboards? I've
               | never heard of this and it sounds awful.
        
               | reaperducer wrote:
               | "Hey, Mike. On your way to the Keurig, remember to glance
               | at the status panel on the wall and let us know if
               | something doesn't look right, OK?"
               | 
               | Brutal.
        
               | InvisibleCities wrote:
               | Why should Mike have to remember this? Why should all of
               | your infrastructure depend on Mike not getting a text
               | from his wife while walking to the fridge for a La Croix?
        
               | geofft wrote:
               | > _That 's not a problem with dashboards. That's a
               | problem with training and staffing people._
               | 
               | Again, the whole point of us being computer people is
               | that we think computers can solve problems in repeated,
               | reliable ways. You can run a highly reliable, say,
               | delivery-based bookstore by having a well-staffed group
               | of well-trained human phone operators who pass messages
               | onto human shippers. People did that (and they still do),
               | and it worked. But we have the thesis that you can do
               | this more efficiently and more reliably - in short, that
               | you can deliver more business value - by using computers
               | to automate the process.
               | 
               | > _The number of businesses that need to worry about
               | scalability is vanishingly small compared to the number
               | of businesses that don 't. Let's not pretend that one
               | company's problems are the same as another's._
               | 
               | I do fully agree that different companies have different
               | priorities, and in particular I think it's totally fine
               | to rely on humans in the loop while a system is still
               | young (or has just been redesigned) and you don't have a
               | good codified sense of how it behaves yet. However,
               | 
               | 1) Wall-based dashboards aren't a best practice, any more
               | than SSHing to production servers is a best practice.
               | It's the right thing for some cases, some of the time.
               | I'd agree with "It's a valuable skill, and it's been
               | useful;" I disagree with "It's so valuable you should
               | make sure everyone does it." If you have the option of
               | either getting good at alerts or getting good at
               | dashboards, spend your time getting good at alerts,
               | first. I'd say the same about infrastructure-as-code vs.
               | SSH-to-prod (and I say this as someone who regularly SSHs
               | to prod and is real good at single-machine old-school
               | sysadminnery).
               | 
               | 2) Scalability isn't about absolute size, it's about how
               | much you can do with the resources you have. Small teams
               | and not-yet-profitable teams need to focus _more_ on
               | scalability (in the sense I 'm using it) because they
               | simply can't staff enough people to cover up gaps in
               | operability. For example, you're much better off figuring
               | out how to set up HA and automated failover than saying
               | "We're too small for that," setting up a weekly pager
               | rotation with people on call 24 hours a day, and alerting
               | them so much they can't do non-toil work (or worse,
               | burning them out and having them find another job).
               | 
               | Many years ago I was on a ~4-person team at my undergrad
               | computer club running web hosting. We ended up getting
               | popular enough that many real university applications
               | (course websites for submitting assignments, etc.)
               | depended on us. Our priority was that, as students, we
               | couldn't get paged during finals week because our
               | academics would take priority, and yet finals week was
               | the most critical time for the service to stay up. So we
               | got real good at HA, at reproducible deployments and
               | config management, etc. (I remember one time we spun up a
               | new server during finals week - and we didn't have to do
               | any fiddling to add it to the cluster precisely because
               | we'd automated the provisioning process.) We had web
               | pages with graphed metrics to inform our capacity
               | planning, but no dashboards that anyone was expected to
               | stare at, just alerts on full outages.
        
             | _jal wrote:
             | You're both right.
             | 
             | Instrumentation and alerts are vital - they leverage
             | inhuman persistence, patience and low cost. But alerts do
             | not substitute for a deep understanding of how your systems
             | work.
             | 
             | A number of the more useful "pre-crime" alerts we have
             | derived from that - if I hadn't been elbow-deep in our
             | systems long enough to notice certain behaviors have non-
             | obvious second- and third-order effects downstream, we
             | wouldn't have the alerts at all.
        
               | geofft wrote:
               | So, I'm making a bit of a subtle claim - you should
               | absolutely be elbow-deep in your systems, and you should
               | be understanding things well enough to build these sorts
               | of proactive alerts, but you shouldn't rely on people
               | being elbow-deep for noticing problems in real time.
               | 
               | If you're ever at the point where you catch a problem and
               | automated monitoring didn't, that's a bug in automated
               | monitoring. If you are really good at finding new bugs in
               | automated monitoring and more things to monitor because
               | you're spending your time getting a sense of how the
               | system behaves, _that 's fantastic_, keep doing that.
               | (That is one of the good reasons for dashboards IMO - a
               | bunch of data to look at when you've already realized
               | something's wrong. Just don't use dashboards to make the
               | decision that something must be wrong.) If you don't
               | improve your automated monitoring and you're worried
               | things will start failing without humans watching
               | dashboards, then you're not solving your existing bugs.
        
           | achow wrote:
           | Strongly agree (with you).
           | 
           |  _From the very first formulation of Ubiquitous Computing,
           | the idea of a calmer and more environmentally integrated way
           | of displaying information has held intuitive appeal. Weiser
           | called this "calm computing".. When information can be
           | conveyed via calm changes in the environment, users are more
           | able to focus on their primary work tasks while staying aware
           | of non-critical information that affects them. Research in
           | this sub-domain goes by various names including "ambient
           | displays", "peripheral displays", and "notification
           | systems"..._
           | 
           | A Taxonomy of Ambient Information Systems: Four Patterns of
           | Design
           | 
           | https://www.cc.gatech.edu/~john.stasko/papers/avi06.pdf
        
           | C1sc0cat wrote:
           | An automated email is ok but seeing visually a graph flat
           | line or a monitor turn red is much more likely to get
           | noticed.
        
             | tyrust wrote:
             | If you ignore alerting then it's likely that your alerts
             | are too noisy. See "alert fatigue".
        
               | C1sc0cat wrote:
               | Its that nice Mr Googles alerts
        
       | kkirsche wrote:
       | Cool item but didn't scale well for mobile (iOS iPhone XS Plus)
        
       | Jeremy1026 wrote:
       | It looks like it doesn't actually support changing the port
       | currently, despite the documentation saying it is possible. I
       | already use port 8080 so kind of stuck until I can use a
       | different port.
        
         | hsartoris wrote:
         | I got it to run on a different port just fine with the MO_PORT
         | environment variable, FWIW.
        
           | Jeremy1026 wrote:
           | Turns out its just too early in the day. I wasn't saving the
           | variable beyond setting it. So when I switched terminals it
           | didn't exist. Put it in my bash profile and all is well.
        
             | alex_d wrote:
             | You can use .env file too, or even put it before the
             | command like that:
             | 
             | MO_PORT=8888 ./monitoror
             | 
             | :)
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2020-03-04 23:00 UTC)