[HN Gopher] Notes on Theory of Distributed Systems [pdf]
       ___________________________________________________________________
        
       Notes on Theory of Distributed Systems [pdf]
        
       Author : htfy96
       Score  : 68 points
       Date   : 2022-08-24 19:51 UTC (3 hours ago)
        
 (HTM) web link (www.cs.yale.edu)
 (TXT) w3m dump (www.cs.yale.edu)
        
       | infogulch wrote:
       | 15 pages of just TOC. 400+ pages of content
       | 
       | > These are notes for the Fall 2022 semester version of the Yale
       | course CPSC 465/565 Theory of Distributed Systems
       | 
       | There are a lot of algorithms, but I don't see CRDTs mentioned by
       | name. Perhaps it's most closely related to "19.3 Faster snapshots
       | using lattice agreement"?
        
         | dragontamer wrote:
         | > CRDTs
         | 
         | Wrong level of abstraction. This is clearly a lower level
         | course than that and discusses more fundamental ideas.
         | 
         | A quickie look through chapter 6 reminds me of CRDTs, at least
         | the vector clock concept. Other bits from other parts of this
         | course probably need to be combined into what would be called a
         | CRDT.
        
       | phtrivier wrote:
       | Is anyone teaching "Practice of boring Distributed Systems 101
       | for dummies on a budget with a tight schedule" ?
       | 
       | As in, "we have a PHP monolith used by all of 12 people in the
       | accounting department, and for some reason we've been tasked with
       | making it run on multiple machines ("for redundancy" or
       | something) by next month.
       | 
       | The original developpers left to start a Bitcoin scam.
       | 
       | Some exec read about the "cloud", but we'll probably get just
       | enough budget to buy a coffee to an AWS salesman.
       | 
       | Don't even dream of hiring a "DevOps" to deploy a kubernetes
       | cluster to orchestrate anything. Don't dream of hiring anyone,
       | actually. Or, paying anything, for that matter.
       | 
       | You had one machine ; here is a second machine. That's a 100%
       | increase in your budget, now go get us some value with that !
       | 
       | And don't come back in three months to ask for another budget to
       | 'upgrade'."
       | 
       | Where would someone start ?
       | 
       | (EDIT: To clarify, this is a tongue in cheek hyperbole scenario,
       | not a cry for immediate help. Thanks to all who offered help ;)
       | 
       | Yet, I'm curious about any resource on how to attack such
       | problems, because I can only find material on how to handle large
       | scale multi million users high availability stuff.)
        
         | keule wrote:
         | > As in, "we have a PHP monolith used by all of 12 people in
         | the accounting department, and for some reason we've been
         | tasked with making it run on multiple machines ("for
         | redundancy" or something) by next month.
         | 
         | Usually, your monolith has these components: a web server
         | (apache/nginx + php), a database, and other custom tooling.
         | 
         | > Where would someone start ?
         | 
         | I think a first step is to move the database to something
         | managed, like AWS RDS or Azure Managed Databases. Herein lies
         | the basis for scaling out your web tier later. And here you
         | will find the most pain because there are likely: custom backup
         | scripts, cron jobs, and other tools that access the DB in
         | unforeseen ways.
         | 
         | If you get over that hump you have done your first big step
         | towards a more robust model. Your DB will have automated
         | backups, managed updates, rollover, read replicas etc. You may
         | or may not see a performance increase, because you effectively
         | split your workload across two machines.
         | 
         | _THEN_ you can front your web tier with a load balancer, i.e.
         | you load balance to one machine. This gives you: better
         | networking, custom error pages, support for sticky sessions
         | (you likely need them later), and better/more monitoring.
         | 
         | From thereon you can start working on removing those custom
         | scripts of the web tier machine and start splitting this into
         | an _actual_ load-balanced infrastructure, going to two web-tier
         | machines, where traffic is routed using sticky-sessions.
         | 
         | Depending on the application design you can start introducing
         | containers.
         | 
         | Now, this approach will not give you a _cloud-native awesome
         | microservice architecture_ with CI/CD and devops. But it will
         | be enough to have higher availability and more robust handling
         | of the (predictable) load in the near future. And on the way,
         | you will remove bad patterns that eventually allow you to go to
         | a better approach.
         | 
         | I would be interested in hearing if more people face this
         | challenge. I don't know if guides exist around this on the
         | webs.
        
         | qntty wrote:
         | Sounds like you could be looking for something like VMware
         | vSphere if primary-backup replication is what you want
        
         | throwaway787544 wrote:
         | If someone would pay for it I'd write that book. There are lots
         | of different methods for different scenarios. There are some
         | books on it but they're either very dry and technical or have
         | very few examples.
         | 
         | Here's the cliffs notes version for your situation:
         | 
         | 1. Build a server. Make an image/snapshot of it.
         | 
         | 2. Build a second server from the snapshot.
         | 
         | 3. Use rsync to copy files your PHP app writes from one machine
         | ('primary') to another ('secondary').
         | 
         | 4. To make a "safe" change, change the secondary server, test
         | it.
         | 
         | 5. To "deploy" the change, snapshot the secondary, build a new
         | third server, stop writes on the primary, sync over the files
         | to the third server one last time, point the primary hostname
         | at the third server IP, test this new primary server, destroy
         | the old primary server.
         | 
         | 6. If you ever need to "roll back" a change, you can do that
         | while there's still three servers up (blue/green), or deploy a
         | new server with the last working snapshot.
         | 
         | 7. Set up PagerDuty to wake you up if the primary dies. When it
         | does, change the hostname of the first box to point to the IP
         | of the second box.
         | 
         | That's just one way that is very simple. It is a redundant
         | active/passive distributed system with redundant storage and
         | immutable blue/green deployments. It can be considered high-
         | availability although that term is somewhat loaded; ideally
         | you'd make as much of the system HA as possible, such as
         | independent network connections to the backbone, independent
         | power drops, UPC, etc (both for bare-metal and VMs).
         | 
         | You can get much more complicated but that's good enough for
         | what they want (redundancy) and it buys you a lot of other
         | benefits.
        
         | fredsmith219 wrote:
         | I can't believe at 12 people would actually be stressing the
         | system. Could you meet the requirements of the project by
         | setting up the second machine as a hot back up at an offsite
         | location?
        
           | phtrivier wrote:
           | Maybe. How do I find the O'Reilly book that explains that ?
           | And the petty details about knowing the first one is down and
           | starting the backup ? And just enough data replication to
           | actually have some data in the second machine ? Etc, etc...
           | 
           | My pet peeves with distributed and ops books is that they
           | usually start by laying out all those problems, but then move
           | on to either :
           | 
           | - explain how Big Tech has even bigger problems, before
           | explainig how you can fix Big Tech problems with Big Tech
           | budgets and headcound by deploying just one more layer of
           | distributed cache or queue that vietually ensures your app is
           | never going to work again (That's "Desifning Data Intensive
           | Applications", in bad faith.)
           | 
           | - or, not really explain anything, wave their hands chanting
           | "trade offs trade offs" and start telling kids stories about
           | Byzantine Generals.
        
             | EddySchauHai wrote:
             | What you're describing there sounds like general Linux
             | sysadmin to me?
        
               | phtrivier wrote:
               | Not entirely, I would argue, if you look at it from the
               | application developper.
               | 
               | You have to adapt parts of your app to handle the fact
               | that two machines might be handling the service (either
               | at the same time, or in succession.)
               | 
               | This has impact on how you use memory, how you persist
               | stuff, etc...
               | 
               | None of which is rocket science, probably - but even
               | things that look "obvious" to lots of people get their
               | O'Reilly books, so...
               | 
               | But you're right that a part of the "distribution" of a
               | system is in the hands of ops more than devs.
        
               | EddySchauHai wrote:
               | I guess it's just experience to be honest. It happens
               | rarely, you might be lucky enough to be involved with
               | solving it, and then you focus on the important parts of
               | the project again. I've only worked in startups so don't
               | know about the 'Big Tech' solutions but a little
               | knowledge of general linux sysadmin, containers, and
               | queues has yet to block me :) Once the company is big
               | enough to need some complexity beyond that I assume
               | there's enough money to hire someone to come in and put
               | everything into CNCFs 1000 layer tech stack.
               | 
               | Edit: Thinking on this, if I want to scale something it'd
               | be specific to the problem I'm having so some sort of
               | debugging process like https://netflixtechblog.com/linux-
               | performance-analysis-in-60... to find the root cause
               | would be generic advice. Then you can decide to scale
               | vertically/horizontally/refactor to solve the problem and
               | move on.
        
             | lmwnshn wrote:
             | More entertainment than how-to guide, and oriented more
             | towards developers than ops, but if you haven't read
             | "Scalability! But at what COST?" [0], I think you'll enjoy
             | it.
             | 
             | [0] https://www.frankmcsherry.org/graph/scalability/cost/20
             | 15/01...
        
             | arinlen wrote:
             | > _explain how Big Tech has even bigger problems, before
             | explainig how you can fix Big Tech problems with Big Tech
             | budgets and headcound (...)_
             | 
             | What do you have to say about the fact that the career
             | goals of those interested in this sort of topic is... Be
             | counted as part of the headcount of these Big Tech
             | companies while getting paid Big Tech budget salaries?
             | 
             | Because if you count yourself among those interested in the
             | topic, that's precisely the type of stuff you're eager to
             | learn so that you're in a better position to address those
             | problems.
             | 
             | What's your answer to that? Continue writing "hello world"
             | services with Spring Initializr because that's all you
             | need?
        
               | phtrivier wrote:
               | > Because if you count yourself among those interested in
               | the topic, that's precisely the type of stuff you're
               | eager to learn so that you're in a better position to
               | address those problems.
               | 
               | People will work on problems of different scales in a
               | career ; will you agree that different scales of problems
               | call for different techniques ?
               | 
               | I have no problem with FANGs documenting how to fix FANGs
               | issues !
               | 
               | I'm a little bit concerned about FANGs-devs-wanabee
               | applying the same techniques to non-FANGs issues, though,
               | for lack of training resources about the "not trivial but
               | a bit boring" techniques.
               | 
               | Your insight about the budget / salaries makes sense,
               | though : a book about "building your first boring IT
               | project right" is definitely not going to be a best
               | seller anytime soon :D !
        
               | enumjorge wrote:
               | Nothing wrong with having those aspirations, but sounds
               | like the parent commenter has non-Big-Tech-sized problems
               | he needs to solve now.
        
         | slt2021 wrote:
         | distributed systems are usually for millions of users, not 12
         | users.
         | 
         | for your problem you can start by configuring nginx to work as
         | load balancer and spin up 2nd VM with php app
        
           | phtrivier wrote:
           | "But what if _the_ machine goes down ? What if it goes down
           | _during quarter earnings legally requested reporting
           | consolidation period_ ? We need _redundancy_ !!"
           | 
           | Also, philosophically, I guess, a "distributed" systems
           | starts at "two machines". (And you actually get most of the
           | "fun" of distributed systems with "two processes on the same
           | machine".)
           | 
           | We're taught how to deal with "N=1" in school, and "N=all
           | fans of Taylor Swift in the same seconds" in FAANGS.
           | 
           | Yet I suspect most people will be working on "N=12, 5 hours a
           | day during office hours, except twice a year." And I'm not
           | sure what's the reference techniques for that.
        
             | arinlen wrote:
             | > _Also, philosophically, I guess, a "distributed" systems
             | starts at "two machines"._
             | 
             | People opening a page in a browser that sends requests to a
             | server is already a distributed system.
             | 
             | A monolith sending requests to a database instance is
             | already a distributed system.
             | 
             | Having a metrics sidecar running along your monolith is
             | already a distributed system.
        
               | phtrivier wrote:
               | > A monolith sending requests to a database instance is
               | already a distributed system.
               | 
               | True, of course.
               | 
               | And even a simple set like this brings in "distribution"
               | issues for the app developper:
               | 
               | When do you connect ? When do you reconnect ?
               | 
               | Where do you get your connection credentials from ?
               | 
               | What should happen when those credentials have to change
               | ?
               | 
               | Do you ever decide to connect to a backup db ?
               | 
               | Do you ever switch your application logic to a mode where
               | you know the DB is down, but you still try to work
               | without it anyway ?
               | 
               | Etc..
               | 
               | Those examples are specific to DBS, but in a distributed
               | system any other services brings in the same questions.
               | 
               | With experience you get opinions and intuitions about how
               | to attack each issues ; my question is still : "should
               | you need to point a newcomer to some reference / book
               | about those questions, where would you point to ?"
        
           | random_coder wrote:
           | It's a joke.
        
         | arinlen wrote:
         | > _As in, "we have a PHP monolith used by all of 12 people in
         | the accounting department, and for some reason we've been
         | tasked with making it run on multiple machines ("for
         | redundancy" or something) by next month._
         | 
         | I find this comment highly ignorant. The need to deploy a
         | distributed system is not always tied to performance or
         | scalability or reliability.
         | 
         | Sometimes all it takes is having to reuse a system developed by
         | a third party, or consume an API.
         | 
         | Do you believe you'll always have the luxury of having a single
         | process working on a single machine that does zero
         | communication over a network?
         | 
         | Hell, even a SPA calling your backend is a distributed system.
         | Is this not a terribly common usecase?
         | 
         | Enough about these ignorant comments. They add nothing to the
         | discussion and are completely detach from reality
        
           | phtrivier wrote:
           | I failed to make the requester sound more obnoxious than the
           | request.
           | 
           | My point is precisely that transitioning from a single app on
           | a machine is a natural and necessary part of a system's life,
           | but that I can't find satisfying resources on how to handle
           | thise phase, as opposed to handling much higher load.
           | 
           | Sorry for the missed joke.
        
         | salawat wrote:
         | Easiest starting point is modeliing the problem between you and
         | your co-workers paying painstaking attention to the flow of
         | knowledge.
         | 
         | Seriously. Most of the difficulty of distributed systems is
         | because you're actually having to manage the flow of
         | information between distinct members of a networked composite.
         | Every time someone is out of the loop, what do you do?
         | 
         | Can you tell if someone is out of the loop? What happens if
         | your detector breaks?
         | 
         | Try it with your coworkers. You have to be super serious on
         | running down the "but how did you know" parts.
         | 
         | Once you have a handle of the way you trip, go hit the books,
         | and learn all the names to the SNAFUs you just acted out.
        
       | tychota wrote:
       | Why teach Paxos and not raft. I thought raft was easier to grasp,
       | and is used a lot nowadays?.
        
       ___________________________________________________________________
       (page generated 2022-08-24 23:00 UTC)