[HN Gopher] Monitoring tiny web services
       ___________________________________________________________________
        
       Monitoring tiny web services
        
       Author : mfrw
       Score  : 108 points
       Date   : 2022-07-09 17:29 UTC (5 hours ago)
        
 (HTM) web link (jvns.ca)
 (TXT) w3m dump (jvns.ca)
        
       | dafelst wrote:
       | I like this apparent shift back to "small is okay" where not
       | every service has to be an overengineered allegedly hyper-
       | scalable distributed mess of five nines uptime with enterprise
       | logging, alerting and monitoring.
       | 
       | Those things are nice when you have a bazillion users and
       | downtime means hordes of unhappy users and dollars flushing away
       | at insane rates, but for the vast majority of hobby projects and
       | even mid stage startups, what is described in this article is
       | plenty good enough.
        
         | is_true wrote:
         | I've thought about posting an AskHN about simple infrastructure
         | for some time but I'm not sure how to word it to attract as
         | many responses as possible.
        
       | rozenmd wrote:
       | My particular favourite is how GraphQL servers respond with "200
       | OK" and the errors will be sent in a key called "errors". Makes
       | regular healthchecks almost useless.
       | 
       | I ended up writing my own service[0] to detect problems with
       | graphql responses, before expanding it to cover websites and web
       | apps too.
       | 
       | -[0]: https://onlineornot.com
        
         | BiteCode_dev wrote:
         | Github answers 404 instead of a 403 when you try to access a
         | private repository while not being logged in.
         | 
         | I assume the rational is to not leak information about what's
         | private. But still, it's weird.
        
           | [deleted]
        
           | dmlittle wrote:
           | AWS S3 does the opposite when querying objects that don't
           | exist. If you don't have s3:ListObjects permissions on the
           | bucket you'll get a 403 error (you can't differentiate
           | between the object not existing vs. you don't have access to
           | it).
           | 
           | I think either approach is valid as long as you're
           | consistent. You can make a case for either 404 or 403 when
           | you don't have enough permissions. In GitHub's case you can
           | argue that it's a 404 because the resource does indeed not
           | exist through your auth context. In AWS' case you can argue
           | that a 403 makes sense because you don't have permission to
           | know the answer to your query.
        
         | OJFord wrote:
         | I honestly hate that so much, it's a relief to read someone
         | saying the same.
         | 
         | I sort of almost made myself feel a bit better about it by
         | thinking 'no, it's not REST, we _have_ reached the graphql
         | server successfully and got a ..  "successful" response from
         | _it_ , it's sort of a "Layer 8" on top of HTTP'. The problem is
         | that none of the bloody tooling is 'Layer 8', so you end up in
         | browser dev tools with all these 200 responses and no idea
         | which ones are errorful. If any.
        
         | bdd wrote:
         | Google's uptime monitoring also allows writing JSONPath checks,
         | so one can monitor HTTP 200 JSON responses semantically.
        
       | KronisLV wrote:
       | Currently got the cheapest VPS that I could (in my case from
       | Time4VPS, some others might prefer Hetzner, or Scaleway Stardust
       | instances), setup Uptime Kuma on it
       | (https://github.com/louislam/uptime-kuma), now have checks every
       | 5 minutes against 30+ URLs (could easily do each minute, but
       | don't need that sort of resolution yet).
       | 
       | It's integrated with Mattermost currently, seems to work pretty
       | well. Could also set it up on another VPS, for example on Hetzner
       | (which also has excellent pricing), could also integrate another
       | alerting method such as sending e-mails, or anything else that's
       | supported out of the box: https://github.com/louislam/uptime-
       | kuma/issues/284
       | 
       | Oh, also Zabbix for the servers themselves. Honestly, if things
       | are as simple to setup as nowadays and you have about 50 EUR per
       | year per node that you want (1 is usually enough, 2 is better
       | from a redundancy standpoint, since then it becomes feasible to
       | monitor the monitoring, others might go for 3 nodes for important
       | things etc.), you don't even need to look for cloud services or
       | complex systems out there.
       | 
       | Of course, if someone knows of some affordable options for cloud
       | services, feel free to share!
       | 
       | I briefly checked the prices for a few and most of them are a
       | little bit more expensive than just getting a VPS, setting up
       | sshd to only use key based auth, throwing Let's Encrypt in front
       | of the web UI (or maybe additional auth, or making it accessible
       | only through VPN, whatever you want), adding fail2ban and
       | unattended updates, and doing some other basic configuration that
       | you probably have automated anyways.
       | 
       | The good news is that if you prefer cloud services and would
       | rather have that piece of your setup be someone else's problem,
       | they're not even an order of magnitude off in most cases - though
       | I'm yet to see how Uptime Kuma in particular scales once I'll get
       | to 100 endpoints. Seems like at a certain scale it's a bit
       | cheaper to run your own monitoring, but at that point you might
       | still find it easier to just pay a vendor.
       | 
       | At the end of the day, there's lots of great options out there,
       | both cloud based and self-hosted, whichever is your personal
       | preference.
        
         | jacooper wrote:
         | You can get a free 4vcpu 24gb ram 200gb storage VPS with oracle
         | cloud Free tier.
        
         | perth wrote:
         | You can get a cheaper VPS through ramnode & $15/year atm
        
           | KronisLV wrote:
           | That's pretty cool!
           | 
           | I guess I'd personally also mention Contabo as an affordable
           | host in general (though their web UI is antiquated),
           | especially their storage nodes:
           | https://contabo.com/en/storage-vps/
           | 
           | For the most part, though, use whichever host you've been
           | with for a few years (though feel free to experiment with
           | whatever new platforms catch your eye), but ideally still
           | have local backups for everything (as long as you don't have
           | to deal with regulations that'd make it not possible) so you
           | can migrate elsewhere.
        
       | tatoalo wrote:
       | I have been using cronitor[0] for a few months now and I have
       | been really satisfied with them so far!
       | 
       | [0]: https://cronitor.io
        
       | pkrumins wrote:
       | If you have a popular service, then one of the best approaches is
       | to have your users notify you when something is down or is
       | broken. This pattern follows the famous quote: "Given enough
       | eyeballs all bugs are shallow." I have employed this approach to
       | great success and haven't had a need for any monitoring services.
        
         | redleader55 wrote:
         | If users see the problem it is too late. You will be seen as
         | unable to keep the service up and the service will be seen as
         | flaky.
         | 
         | Also, the Holly grail of monitoring is to be able to remediate
         | the problem automatically - this is pretty hard when users are
         | reporting it.
        
       | dimitar wrote:
       | If I have to do one thing to monitor a simple website I'm
       | probably going to use something that takes a screenshot
       | periodically and checks it for changes. There are open source
       | solutions but I just prefer to pay a bit for a managed service to
       | do it.
       | 
       | I think it covers quite a lot of things - the servers are up, DNS
       | is OK, assets are OK. It can also be a safety net in case of
       | other, more sophisticated monitoring fails to detect an unusual
       | state.
       | 
       | This doesn't work well for website with too much javascript, ads
       | or widgets.
        
         | radus wrote:
         | What are the OSS solutions for this?
        
       | xrd wrote:
       | I installed Uptime Kuma (https://github.com/louislam/uptime-kuma)
       | on my dokku paas to monitor my dokku apps. It works great. It is
       | great for pure HTTP services, but it can be used against things
       | like RTMP servers because it also permits configuration of a
       | health check with TCP pings. It gives me an email when things are
       | down, and supports retry, heartbeat intervals, and can validate a
       | string in the HTML retrieved. I love it.
        
         | jslakro wrote:
         | I considerated this option but then realized that both sides,
         | the api/services and the uptime checker will be in the same
         | server then any problem impacting the server itself will leave
         | offline the monitoring
        
       | jewel wrote:
       | Another approach that has been working great for me:
       | https://www.webalert.me. This app runs on your phone, you can
       | configure it to check once an hour if any content on a page
       | changes.
        
       | blondin wrote:
       | have to say, this is exactly what kubernetes was designed to
       | solve. but the focus was on microservices and containers. and
       | things also got out of hands.
        
         | nickjj wrote:
         | > have to say, this is exactly what kubernetes was designed to
         | solve
         | 
         | Kubernetes probes are much different in my opinion.
         | 
         | Your Kubernetes liveness check will check if things are working
         | inside of your cluster which is great for a high frequency
         | checkup to potentially modify the state of your pod based on
         | the result.
         | 
         | But Uptime Robot is an end to end test. It tests a real
         | connection over the internet to your domain which exercises
         | external DNS, traffic flowing through any reverse proxies, your
         | SSL certificate, etc..
         | 
         | Both compliment each other for different use cases.
        
         | dinvlad wrote:
         | I really wish managed Kubernetes offerings remained "free" for
         | small use, and would only expose "empty" nodes ready for full
         | utilization by end-user containers.
         | 
         | The reality however is that every managed node (like on GKE)
         | uses quite a lot of CPU and memory out of the box, for which
         | the user pays. On top of that there're cluster fees, just for
         | having it around. This makes it completely unfriendly to
         | hobbyist projects, unless one is ready to pay dozens of $s just
         | to have Kubernetes (prior to deploying any apps to it).
         | 
         | (And sure, there're free tiers here and there, but they never
         | solve this problem completely on any of the big cloud
         | providers, at least)
         | 
         | Compare that to managed "serverless" offerings (even pseudo-
         | compatible with K8s API like Cloud Run), which eliminate the
         | management fees, but impose a tax with latency. Oh well.
        
           | epelesis wrote:
           | One reason this is not feasible is that K8s is not designed
           | for secure multitenancy, so for every tenant, you'll need to
           | spin up an entire K8s control plane, which includes a
           | database and several services - this is what's driving the
           | cluster fees. Keep in mind that customers also expect managed
           | K8s to be highly available, so this cost is also going into
           | things like replicating data, setting up load balancers,
           | etc...
           | 
           | Compare this to a serverless offering that is multitenant by
           | design, the control plane is shared making the overhead cost
           | of an extra user is basically zero, which is why they don't
           | charge you a fee like this.
           | 
           | IMO if you're a hobbyist interested in K8s, your best way to
           | go is to install K3s, which is a lightweight, API compatible
           | K8s alternative that runs on a single node. It's pretty nice
           | if you don't care about fault tolerance or High Availability.
           | 
           | https://k3s.io/
        
             | dinvlad wrote:
             | I'm not so sure about the economics of what you describe. I
             | think it could very well be that small customers don't
             | really consume that much "bandwidth" that their resource
             | requirements could be subsumed entirely by larger uses. It
             | doesn't make much sense that both large and small customers
             | have to pay the same cluster fee, for example - it would be
             | much more fair to charge more the more you use, and
             | approach "near zero" the lesser you use it.
             | 
             | At the end of the day, all resources are run by the cloud
             | provider on KVMs sharing the same physical machines
             | anyways, so it's up to them how much to charge. The fact
             | that both small and large customers get to pay for the same
             | amount of resources allocated for them, only means these
             | resources are not allocated in the most efficient manner.
             | So a cloud provider could fix this.
             | 
             | We should also not discount the net positive effect of
             | attracting more hobbyists and startups to your platform.
             | That's how AWS and GCP started, for example, but now
             | they're just focusing on more enterprise business so
             | smaller ones mean less to them (although AWS arguably less
             | so). But we shouldn't forget that while they don't
             | contribute as much to the revenue, they're essentially a
             | free advertising resource that make your platform stay
             | "relevant" (and especially more so for burgeoning startups
             | that could grow to bring more revenue in the future!). The
             | moment they leave, the platform just becomes another IBM
             | that's bound to die, for better or worse.
             | 
             | On top of that, the anti-analogy with serverless for
             | control plane breaks down, because one could always run it
             | on the same shared pool of resources in gVisor or
             | Firecracker, just like with serverless.
        
       | daverobbins1 wrote:
       | Since everyone is posting their favorite free-tier monitoring
       | products - does anyone have a recommendation for a cloud product
       | that will allow us to create a group of ping monitors and alert
       | only if all monitors in the group are down for N minutes?
        
         | machinerychorus wrote:
         | You could hack that together with huginn pretty easily
         | 
         | https://github.com/huginn/huginn
        
           | zoover2020 wrote:
           | > [...] recommend a cloud product
           | 
           | Hacker mentality never left this site since inception :)
        
         | prakashn27 wrote:
         | I am curious for the use case of it. What group of servers do
         | you want to monitor?
        
           | daverobbins1 wrote:
           | We have dual internet connections coming into a satellite
           | office and we only want to be alerted if both are down.
        
       | bdd wrote:
       | You can get free uptime monitoring from Google Cloud. The limit
       | is 100 uptime checks per monitoring scope, which may mean either
       | a project or an organization based on how you configure IIUC.
       | https://cloud.google.com/monitoring/uptime-checks. The checks are
       | ran from 6 locations around the world, so you can also catch
       | network issues, that you likely cannot do much about when you're
       | running a tiny service. My uptime checks show the probes come
       | from: usa-{virginia,oregon,iowa}, eur-belgium, apac-singapore,
       | sa-brazil-sao_paulo
       | 
       | Another neat monitoring thing I rely on is
       | https://healthchecks.io. Anything that needs to run periodically
       | checks in with the API at the start and the end of execution so
       | you can be sure they are running as they should, on time, and
       | without errors. Its free tier allows 20 checks.
        
         | ydant wrote:
         | healthchecks.io is a great service (and apparently can be self-
         | hosted - https://github.com/healthchecks/healthchecks) that I
         | use for both personal projects and at work.
         | 
         | It works really well for cron jobs - while it works with a
         | single call, you can also call a /start and finished endpoint
         | and get extra insights such as runtime for your jobs.
         | 
         | It would be nice if it had slightly more complex alerting rules
         | available - for example, a "this service should run
         | successfully at least once every X hours, but is fine to fail
         | multiple times otherwise" type alert.
         | 
         | We wanted to use it for monitoring some periodic downloads
         | (like downloading partners' reports), and the expectation is
         | the call will often time out or fail or have no data to
         | download, which is technically a "failure", but only if it goes
         | on for more than a day. Since healtchecks.io doesn't really
         | support this, we ended up writing our own "stale data"
         | monitoring logic and alerting inside the downloader, and just
         | use healtchecks.io to monitor the script not crashing.
        
         | jacooper wrote:
         | What is the interval for the checks ?
         | 
         | Its written that its 100 per metric scope, but I don't know
         | what that means really.(2)
         | 
         | Also there seems to be no status monitor page ?
         | 
         | 2- https://cloud.google.com/monitoring/uptime-checks
        
         | yawgmoth wrote:
         | Continuing the tooling thread: the free tier of
         | https://www.uptimetoolbox.com/ is quite good.
        
       ___________________________________________________________________
       (page generated 2022-07-09 23:00 UTC)