[HN Gopher] Show HN: SadServers - Test your Linux troubleshootin... ___________________________________________________________________ Show HN: SadServers - Test your Linux troubleshooting skills Hello, I'm building SadServers.com, a SaaS where users can test their Linux troubleshooting skills on real Linux servers in a "Capture the Flag" fashion. I hope this is useful, to learn more about the project please see https://github.com/fduran/sadservers Author : fduran Score : 407 points Date : 2022-10-26 14:22 UTC (8 hours ago) (HTM) web link (sadservers.com) (TXT) w3m dump (sadservers.com) | BossingAround wrote: | I'd love to get the actual VM content offline, packaged as | Vagrantfiles or Containerfiles. Love the idea though! Go to | Pluralsight and pitch it to them :) | fduran wrote: | A few people have suggested offering content offline as a | Docker image etc, good idea, thanks. | computershit wrote: | I love this idea, I'll definitely try it out when provisioning | for scenario machines is up again. Nice work. | N3Xxus_6 wrote: | Well this sucks I wanted to try it lol. It's timing out for me or | throws an error. | deeblering4 wrote: | > It's also my not-so-secret hope that a sophisticated enough | version of SadServers could be used by tech companies (or for | companies that carry on job interviews on their behalf) to | automate or facilitate the Linux troubleshooting interview | section. | | Yup, that's what I was afraid of. | KaiserPro wrote: | but why? a real test that is repeatable, realistic and not | _overly_ hard. Sure for a junior software its a bad fit. but | for a devop/sre/sysadmin, its a great fit. | | its certainly better than some crappy whiteboarding session, or | worse a take home test. | [deleted] | pvg wrote: | _Please don 't post shallow dismissals, especially of other | people's work. | | [...] Please don't pick the most provocative thing in an | article or post to complain about in the thread._ | | https://news.ycombinator.com/newsguidelines.html | fduran wrote: | That doesn't mean that I'd charge individual users :-) | | Heck, I'm not even asking for an email (and I had to do extra | session management coding for that). | technofiend wrote: | The Redhat Certified System Admin, Redhat Certified System | Engineer and similar tests require practical, general hands-on | skills to solve broken systems. The performance tuning and | troubleshooting exams go into more detail and more complex | scenarios. No internet access, but resources are available if | you understand how to use them. Would never suggest people | should solely hire on those certs, but if someone takes the | time to complete 7 hands on tests for the certified architect | certification, it's a strong indicator they have skills. | | Even so, test taking can be stressful but it's arguably less | stressful than actual production support with people waiting on | the result. Whether people really want to put candidates in a | stressful situation is up to them. Sadserver seems like it's | somewhere in the middle vs some of the things I've seen. One | job interview put me in a room with a boot cd, and an ancient | computer with a cdrom so slow you got exactly one chance to | boot the media and recover the system in the time limit. But | the job was for a trading company, so if you couldn't handle | that they didn't want you. It was a fun exercise but would I do | that to someone else? Probably not. | lbotos wrote: | Why are you afraid of this? My org has run a hands-on technical | exam with a stack of linux admin basics (I won't enumerate them | here because people do their research) but they are _based on | real problems we 've had_ and the feedback is overwhelmingly | "this was one of the best technical interviews I've ever had." | | We ask the engineer who is proctoring the interview to think | about the following question: Would you want to pair with that | engineer again? | | If that answer is no, then we probably won't go further because | _pairing with engineers to troubleshoot is what we do every | day_. | | Some great resumes have died with not knowing how to see what's | running on port 80. | joenot443 wrote: | If you give the person you're interviewing access to the same | tools they'd have in a regular day on the job (Google, | manpages, etc.), I'd say that's a fair and probably | relatively enjoyable interview. | | Rejecting someone because they can't recall the correct | netstat syntax doesn't seem like good hiring practice, but I | assume in good faith that's not what you meant :) | yamtaddle wrote: | Yeah, I google, tealdear, "--help", and manpage anything I | don't use at least once a week, every time. Usually I don't | remember them otherwise, and if I think I do, I don't trust | my memory that well. Only exception is if I remember enough | to be able to ctrl+r them out of shell history faster than | I can do those things--and actually, for some of those, I | _do_ use them often, but couldn 't possibly tell you how | because I only run a couple commands 99% of the time and | always pull them out of history unless it's one of the rare | exceptional cases--I couldn't rsync for a particular | outcome without consulting a reference, to save my life, | even though I use it often. | | And usually you only use a fairly small set of tools _that_ | often, in any job, and which set will depend on the | employer, how things are set up, and what exactly you 're | doing. | | Oh and somehow I get "-r" versus "-R" for "recursive" wrong | almost every time, even for commands I type almost daily, | unless I check first. It's weird. If tools could get on the | same damn page about which means "recursive", that'd be | great. | | TL;DR I do have a pretty good idea what I'm doing, but look | like an absolute idiot if anyone watches me do it. Much | worse, even, if I _know_ they 're watching and we're not in | some kind of relatively high-trust relationship (so, | definitely not in an interview setting). | lbotos wrote: | Exactly, all man pages and google is fair. We want to see | _how they think_ not _rote memorization_. | Multicomp wrote: | I love this point. Joke: are are you hiring? | | I'm quite happy to try to demonstrate how I think, but I | hate hate hate leet code because A) it's not relevant to | showing how one thinks and B) I've read so much dunking | on it on HN that I'm now stopping interviews when they | pull out the hackerrank or live code to say 'without | using the library, reverse this linked list'. | deeblering4 wrote: | > Why are you afraid of this? | | > My org has run a hands-on technical exam with a stack of | linux admin basics ... they are based on real problems we've | had and the feedback is overwhelmingly "this was one of the | best technical interviews I've ever had." | | You essentially answered your own question. | | Putting thought into the interview process and working with | candidates through real problems is valuable. I cannot say | the same for outsourcing or "automating" this portion of an | interview using 3rd party SaaS. | mathverse wrote: | People in higher up positions like yourself will rarely be | subjected to testing with tools like this. You are basically | trying to remove the human from equation and industrialize | the whole process. | splitstud wrote: | rednerrus wrote: | What we're trying to do is respect peoples' time. We can | get more about someone's technical understanding in 30 | minutes of hands on exercises than we can in a full day of | panel interviews. It's better for us as we have a much | better understanding of where you're at Linux wise and it's | better for you because you only need to come to two hours | of interviews, total. Seems like a win win to me. | deeblering4 wrote: | Framing a question like "a system has a high load | average, what commands would you use to begin diagnosing | that?" and taking that conversation as deep as the | candidate can go is neither time consuming nor requires a | panel of people. | mike_d wrote: | In my experience this type of interview (and coding | interviews in general) usually fall into one of two | categories: 1) "I learned this neat trick and want to | show candidates how smart I am" or 2) "I have this bug in | prod and I want to see if you can fix it for me." | | If the interview was along the lines of upgrading the | packages on the system, debugging why nginx was crashing, | figuring out the specs of the system, etc. that is | totally fine with me and I believe respectful of a | candidates time. Unfortunately it always turns into | something else when people need to come up with new | "challenges" for canidates. | deathanatos wrote: | No, I'm trying to make sure the person who is interviewing | for a job where they will deal with computers on a daily | basis appears to have seen a computer at some prior point | in their life. | | I wouldn't feel the need to do this if so many candidates | didn't fail rudimentary tests. A SWE candidate MUST be able | to write the function min(), in the language and tooling of | their choice. But in an interview, a sizable fraction | cannot. (The actual bar is far higher than min(), ofc., but | min() _ought to be trivial_.) | deathanatos wrote: | Yeah, we did this at a previous employer. | | One example, is we had them ssh, download & extract a tarball | (the Linux source, but the content doesn't matter). | Sometimes, they'd gunzip to stdout. The reaction tells you a | lot "lol _whoopsie_ " followed by a quick fix: person knows | what they're doing. "uh... what is going on? did I break it?" | followed with general cluelessness... maybe not. | | That did occasionally break tmux, though. | | Part of it was "what are the specs of this thing you're SSH'd | into?" and we had one candidate who was _adamant_ the numbers | must be wrong: 2 GiB is too little RAM, no machine is that | small! Yeah we didn 't spin up 128 GiB VM for your | interview... | Volundr wrote: | I never cease to be amazed at how few people really realize | just how little hardware is often required for getting real | work done. You'd be surprised just how much that 2GB vm | with a couple cores can handle! | sorongopowa wrote: | I started with a single 1xx MHz core and 16MB of RAM. And | I'm sure some with even less, lol. | | Supporting your point: Hardware is awesome if you use it | wisely. | icedchai wrote: | My first Linux box was a 20mhz 386SX laptop with 3 megs | of RAM (1 meg on the motherboard, 2 in an expansion.) I | could barely run Linux 0.99.x. The distro was SLS, and it | came on 12 or so floppy disks. I quickly upgraded to a | 486 with 8 megs RAM, then 20... which seemed incredible | at the time (1994-ish.) | | It's amazing how bloated today's software is... | rednerrus wrote: | We do this in our org as well. 30 minutes of troubleshooting | linux issues is a good way to evaluate a candidates | experience. We run it as a team exercise with the candidate | so that we also get the added bonus of how do they work in a | team setting, how do they communicate, etc. | aliqot wrote: | I knew this is where it was headed :/ | Nextgrid wrote: | Is it bad though? The problem with Leetcode is that it's an | extremely unrealistic test. This on the other hand seems like | it actually tests real-world scenarios, and you can get there | without grinding. I'm pretty sure I can pass all the tests | they've currently got despite having no formal sysadmin | experience, just using common developer knowledge, common sense | and strategic Google-fu. | x258wang_hn wrote: | yapril wrote: | andrewmcwatters wrote: | My only feedback is that this is unrealistic because today | developers wouldn't try to debug something, they'd just destroy | the instance, push a commit and hope it fixed something infra | related then recreate it. | | Why would you need to understand how something works? Just use | containers. /s | [deleted] | vsareto wrote: | Developers just need to understand everything because we need | developers to do everything and meet all deadlines. We wouldn't | dare consider a support role that could troubleshoot it because | then there would be no point to having developers that can do | everything! /s | cube00 wrote: | Support doesn't deliver features, we need new features! /s | grepLeigh wrote: | If most developers can't debug a VM, then anyone who can will | be able to charge a premium. If you have a proficiency in ops, | remember that the next time you negotiate a compensation | package. | | [Edited my compensation numbers to avoid down votes - yikes] | andrewmcwatters wrote: | I feel like you definitely have to target particular | companies and more specifically specific titles and skills to | offer to do so. | | My guess is trying to sell high end services as a "principal | software engineer" isn't going to be enough to justify that | cash comp to a lot of people hiring. | grepLeigh wrote: | I wouldn't think of it as trying to sell yourself as a | "principal software engineer" on an open market. | | I'd make a list of the companies where hiring/scaling the | ops team will make or break the business's value delivery, | and filter by companies _aware_ of this. | | You can knock this out at the recruiting step, just by | asking about open developer headcount vs. open SRE ops | headcount. Ask which direction that ratio seems to be | going, and if there's anyone you can talk to whose job it | is to change that ratio (director or VP mandate). | | The referral network from working at a hyperscaler co in | ops is a great way to break into the space. | andrewmcwatters wrote: | Thanks for the heads up! | sshd wrote: | This is so sad but so true! | edmcnulty101 wrote: | If its dumb and it works it's not dumb. | 10g1k wrote: | "Have you turned it off and on again?" | hotpotamus wrote: | Are you familiar with Trueability? https://www.trueability.com/ | | It seems like this is a similar SaaS. | fduran wrote: | Didn't know about this one. There's quite a few labs/sandbox | SaaS but what I've seen so far is that they are more for | training with a "follow the recipe" model (do this do that to | configure something, rather than "this (real) server is broken, | fix it (with possibly different solutions)" which imho is more | real-life and useful. | hotpotamus wrote: | I believe the company was founded by some coworkers of mine | way back when at Rackspace who often interviewed Linux admins | with a lab VM and I assume they just automated the setup and | spun it off as their own business. At least that's what | happened as far as I can tell; I didn't know the parties | involved. | jer0me wrote: | New challenge: Fix SadServers' sad servers | Pr0ject217 wrote: | Cool! | imwillofficial wrote: | This is badass, just what I need! | dugmartin wrote: | I'd suggest integrating https://bellard.org/jslinux/ and running | the VM in the browser if you can - then you can scale without | running out of resources. | m00dy wrote: | or linux kernel port on webassembly. | fduran wrote: | Thanks, I've been looking at WASM, for ex | https://github.com/snaplet/postgres-wasm/tree/main/packages/... | , it would certainly simplify everything to "download a fat | file". | jodrellblank wrote: | Have you seen https://copy.sh/v86/ ? It doesn't run as fast | as jslinux but is BSD Licensed, on Github, and supports | resuming the VM from a snapshot. | | https://github.com/copy/v86 | fduran wrote: | Didn't know about this, thanks! | DeathArrow wrote: | >Practice for your next SRE/DevOps interview. | | Are SREs and DevOps tasked with administration of operating | systems? | jen_h wrote: | Yeah. Random data point: One of my most favorite SRE interviews | ever (serious fun!) involved hands-on troubleshooting that | eventually required gdb. | asmr wrote: | Both SRE and DevOps are essentially evolved sysadmin roles. The | DevOps philosophy is cross-functional and many sysadmins have | adopted a DevOps approach. The latest edition of the classic | sysadmin book "The Practice of System and Network | Administration" is now centered around DevOps. | KaiserPro wrote: | > Are SREs and DevOps tasked with administration of operating | systems? | | yes, eventually. | | you can dress it up in all the fancy terms that you like. but | devops and SREs are sysadmins with better PR. | | its critical that SREs understand _how_ to debug a system, so | that they can work out how to put in fixes, and or design | better systems. | dsr_ wrote: | If you have ops somewhere in your responsibilities, then yes. | jabroni_salad wrote: | depends on what layer the issue is happening at. I know | everyone thinks the OS has been abstracted away but my ticket | queue says otherwise. "yaml engineering" is just a control | surface, I still need to pop the hood often. | BossingAround wrote: | How do you automate something you can't do manually? | PanosJee wrote: | Hack The Box -> Fix The Box | Timja wrote: | The idea is really cool, but all I see is "Waiting for server..." | and nothing happens. | kiyundai wrote: | That's the trick you failed the first challenge : "Did you try | to turn it off and on again?" | apawloski wrote: | Based on your architecture diagram it looks like you're spinning | up an instance per-user? As you're probably finding now, you will | hit AWS limits quickly. | | You might instead want to have a smaller pool of (larger) servers | that you run co-resident VMs on with https://firecracker- | microvm.github.io/. That will avoid account limits and also keep | your AWS costs more predictable. | fduran wrote: | Yes thanks! | temp0826 wrote: | I haven't fully grokked this yet, but one trick I've used in | the past to get around limits is AWS Organizations, creating a | sub-account per property. A bit more setup but can keep things | cleaner administratively. | icedchai wrote: | AWS will raise limits if you ask. Increasing EC2 instance | limits is usually a quick turn around. | andrewstuart2 wrote: | At least for the tests I've done on a small startup | recently, they've also implemented some automatic quota | increases for EC2. I ran commands that would have (or did) | eclipsed my quota, and got an email that my quotas were | bumped a few minutes later. | ericbarrett wrote: | Yes, the default limits are there to prevent abuse and | runaway misconfigurations. They won't turn down revenue if | you confirm it's intentional. | yamtaddle wrote: | Just run them in Linux VMs with WASM, on the users' browsers. | Make them all pay for it with higher utility bills and greater | wear & tear on their hardware. | | _trollface.jpg_ | freeone3000 wrote: | This is actually a good idea for this -- the user wants the | education, they can pay for it with their own hardware. Keep | your costs low! | cogman10 wrote: | Probably a better experience for everyone. You just have to | distribute the image (rather than running vms) and the user | gets instantaneous responses. | BossingAround wrote: | Why not spin up containers instead of VMs? Seems to me | containers would fit much better than VMs. | cogman10 wrote: | Bypassing container security is easier than bypassing VM | security. | tamrix wrote: | Then wouldn't that be the ultimate test ;) | spiffytech wrote: | Containers have a history of escape vulnerabilities, for | reasons like sharing a kernel with the host and other | containers. | | VMs are designed from the ground up to isolate guests, rather | than focusing on application deployment. | | Firecracker is the modern container alternative in untrusted | compute scenarios, with Fly.io even converting container | images into Firecracker VMs. | NovemberWhiskey wrote: | > _Containers have a history of escape vulnerabilities_ | | Generally agreed, but for this use-case do we care? | ilyt wrote: | That's kinda nice use case for the WASM machine/linux | emulators, then you just need to provide image and user can run | it in the browser | | > You might instead want to have a smaller pool of (larger) | servers that you run co-resident VMs on with | https://firecracker-microvm.github.io/. That will avoid account | limits and also keep your AWS costs more predictable. | | I'd imagine (still waiting for it to load lmao) most of it | could be containers too. | twalla wrote: | Someone else linked https://github.com/copy/v86 which seems | really neat. | | I like making jokes with coworkers about implementing this or | that bit of infra with WASM-based tools mostly to get a rise | out of them but each time I make the joke I look into some of | the tools or projects and the balance of joke to "I'm | actually serious" shifts a little bit to the right. | lagrange77 wrote: | Really cool idea. | | After choosing a problem, the endpoint you poll at | https://sadservers.com/celery-progress/xxxx repeatedly returns | {pending: true, current: 0, total: 100, percent: 0} for me. | b20000 wrote: | did you read up on the problems with leetcode? | fduran wrote: | Hi, not sure what the question means, I came up with the | scenarios not copying from leetcode if that's what you mean. | pxc wrote: | I think they mean 'are you aware of the limitations of | Leetcode-like tests and the downsides of their (over)use in | hiring processes?' | | (FWIW I think this is a very cool and fun educational project | regardless of what usefulness it might or might not have in | IT hiring decisions, and I'm looking forward to playing with | it) | vermon wrote: | Seems like it's out of capacity: An error | occurred (VcpuLimitExceeded) when calling the RunInstances | operation: You have requested more vCPU capacity than your | current vCPU limit of 64 allows for the instance bucket that the | specified instance type belongs to. Please visit | http://aws.amazon.com/contact-us/ec2-request to request an | adjustment to this limit. | | Maybe something like https://leaningtech.com/webvm-server- | less-x86-virtual-machin... would be cheaper and more reliable for | this kind of thing? | fduran wrote: | Yes, HN effect lol-sob. | | Mitigation: reducing servers life time temporarily so more | people can try. | warent wrote: | Usually I roll my eyes when someone posts their own website | to HN and it crashes under load. But given the nature and | complexity of yours I think there's room for understanding | and patience :) | fduran wrote: | Thanks, I did some stress-testing and infra is scalable | enough but I forgot about the AWS quotas, my bad. Quota | increase requested and servers are killed off so hopefully | "soon" the issue will go away. | Nextgrid wrote: | Scaling this service without breaking the bank could become | its own "sad server" scenario. | | I'd start by moving the test VMs to bare-metal servers | running libvirt. You can get a 128GB RAM server for ~110 EUR | and that should be able to run around 120 concurrent VMs | assuming 1GB of RAM to each (CPU isn't a major issue in this | case). | mewse-hn wrote: | Completed the first challenge and it was a lot of fun - _spoiler_ | I 've never had to use the 'lsof' command before. | grepLeigh wrote: | Very cool! This reminds me of the ops challenge @ Slack. I'm not | sure if they still do this, but the SRE/platform infra interview | used to involve a VM running a malfunctioning LAMP stack. | | You'd get SSH access to the VM, then submit a diagnostic report | of what was broken (and how you fixed it). | | Reminded me of how Red Hat used to run their certification test | (RHCE). I probably still have the live CDs for my RHCE laying | around somewhere. | stevekemp wrote: | I've had interviews like that in the past, and really enjoyed | them. Much better than "Draw an architecture diagram for how | you'd handle a serverless IoT application" - where you lose | points, silenly, because you didn't pick something the | interviewer expected you to do. | | Usually a simple combination of immutable files, SELinux | policies, and types in configuration files were enough for most | of the challenges. Though now and again you'd find they'd given | you a server with packages removed, or not yet installed. | fduran wrote: | Oh that reminds me, I loved the original Stripe CTF, it's been | 10 years already! | https://twitter.com/fduran/status/240321390698442753 | yubiox wrote: | Can't get to the first problem because of HN hug but anyway there | are fake ways to "solve" it like renaming the logfile (what they | test for solved is provided). | Timja wrote: | Depends on how the broken program writes to the log. | | If it does while true; do echo hello >> | bad.log; done | | Then renaming bad.log will not solve the challenge. | teddyh wrote: | Replace it with a symlink to /dev/null! Or /dev/full if we | feel like it. | | (Yes, these are bad solutions, since the instructions | explicitly said to stop the process which is writing.) | fduran wrote: | There are ways to cheat but not so simple; there's a script | that checks for the solution and a hash of the script is | checked for modifications. | BossingAround wrote: | This is a self-test, not a certification. The goal is not to | defeat the verification goal, but to learn something. So yeah, | it's perfectly acceptable that the tests are not bullet-proof. | bm-rf wrote: | I'm assuming you're spinning up an EC2 instance for each lab. | What do you think about using pre-built docker images for each | challenge instead? that way they can spin up in just a couple of | seconds. Might also be cheaper? | clvx wrote: | probably lxd would be better. | bravetraveler wrote: | Not a bad idea but something to consider; this limits the | options for kernel level things quite considerably | fduran wrote: | I wanted to do full VMs rather than Docker images but yes I | could do Docker images or dedicated big instances with VMs on | top like somebody else is suggesting. | bravetraveler wrote: | Commenting to give this a try later, I've routinely been the | person to get these kinds of gremlins escalated | | I've long wanted for some sort of mock, "things are broken - I | want to see how you think" approach for sysad | shagie wrote: | In the "tricks of hacker news" - 188 points | by fduran 3 hours ago | unvote | flag | hide | past | favorite | | 68 comments | | If you click 'favorite' it will save it to your favorites list. | This is a publicly visible list - yours is | https://news.ycombinator.com/favorites?id=bravetraveler and | mine is https://news.ycombinator.com/favorites?id=shagie which | makes it easy to get a bookmark type style functionality within | HN. | | As I tend to favorite less often than I comment, it makes it | easier to find those things I want to find again. | bravetraveler wrote: | Much appreciated! I'm woeful about using not using features | like this, it's a character fault at this point. | | The HN interface too tends to just have my eyes filter out | those links... but that's no defense. | | Especially good to know that it's publicly viewable! | | Not that I'm particularly worried of being outed by anything | I favorite here, it's just good to be mindful of the data we | make and where it goes. ___________________________________________________________________ (page generated 2022-10-26 23:00 UTC)