[HN Gopher] Notes on Theory of Distributed Systems [pdf] ___________________________________________________________________ Notes on Theory of Distributed Systems [pdf] Author : htfy96 Score : 68 points Date : 2022-08-24 19:51 UTC (3 hours ago) (HTM) web link (www.cs.yale.edu) (TXT) w3m dump (www.cs.yale.edu) | infogulch wrote: | 15 pages of just TOC. 400+ pages of content | | > These are notes for the Fall 2022 semester version of the Yale | course CPSC 465/565 Theory of Distributed Systems | | There are a lot of algorithms, but I don't see CRDTs mentioned by | name. Perhaps it's most closely related to "19.3 Faster snapshots | using lattice agreement"? | dragontamer wrote: | > CRDTs | | Wrong level of abstraction. This is clearly a lower level | course than that and discusses more fundamental ideas. | | A quickie look through chapter 6 reminds me of CRDTs, at least | the vector clock concept. Other bits from other parts of this | course probably need to be combined into what would be called a | CRDT. | phtrivier wrote: | Is anyone teaching "Practice of boring Distributed Systems 101 | for dummies on a budget with a tight schedule" ? | | As in, "we have a PHP monolith used by all of 12 people in the | accounting department, and for some reason we've been tasked with | making it run on multiple machines ("for redundancy" or | something) by next month. | | The original developpers left to start a Bitcoin scam. | | Some exec read about the "cloud", but we'll probably get just | enough budget to buy a coffee to an AWS salesman. | | Don't even dream of hiring a "DevOps" to deploy a kubernetes | cluster to orchestrate anything. Don't dream of hiring anyone, | actually. Or, paying anything, for that matter. | | You had one machine ; here is a second machine. That's a 100% | increase in your budget, now go get us some value with that ! | | And don't come back in three months to ask for another budget to | 'upgrade'." | | Where would someone start ? | | (EDIT: To clarify, this is a tongue in cheek hyperbole scenario, | not a cry for immediate help. Thanks to all who offered help ;) | | Yet, I'm curious about any resource on how to attack such | problems, because I can only find material on how to handle large | scale multi million users high availability stuff.) | keule wrote: | > As in, "we have a PHP monolith used by all of 12 people in | the accounting department, and for some reason we've been | tasked with making it run on multiple machines ("for | redundancy" or something) by next month. | | Usually, your monolith has these components: a web server | (apache/nginx + php), a database, and other custom tooling. | | > Where would someone start ? | | I think a first step is to move the database to something | managed, like AWS RDS or Azure Managed Databases. Herein lies | the basis for scaling out your web tier later. And here you | will find the most pain because there are likely: custom backup | scripts, cron jobs, and other tools that access the DB in | unforeseen ways. | | If you get over that hump you have done your first big step | towards a more robust model. Your DB will have automated | backups, managed updates, rollover, read replicas etc. You may | or may not see a performance increase, because you effectively | split your workload across two machines. | | _THEN_ you can front your web tier with a load balancer, i.e. | you load balance to one machine. This gives you: better | networking, custom error pages, support for sticky sessions | (you likely need them later), and better/more monitoring. | | From thereon you can start working on removing those custom | scripts of the web tier machine and start splitting this into | an _actual_ load-balanced infrastructure, going to two web-tier | machines, where traffic is routed using sticky-sessions. | | Depending on the application design you can start introducing | containers. | | Now, this approach will not give you a _cloud-native awesome | microservice architecture_ with CI/CD and devops. But it will | be enough to have higher availability and more robust handling | of the (predictable) load in the near future. And on the way, | you will remove bad patterns that eventually allow you to go to | a better approach. | | I would be interested in hearing if more people face this | challenge. I don't know if guides exist around this on the | webs. | qntty wrote: | Sounds like you could be looking for something like VMware | vSphere if primary-backup replication is what you want | throwaway787544 wrote: | If someone would pay for it I'd write that book. There are lots | of different methods for different scenarios. There are some | books on it but they're either very dry and technical or have | very few examples. | | Here's the cliffs notes version for your situation: | | 1. Build a server. Make an image/snapshot of it. | | 2. Build a second server from the snapshot. | | 3. Use rsync to copy files your PHP app writes from one machine | ('primary') to another ('secondary'). | | 4. To make a "safe" change, change the secondary server, test | it. | | 5. To "deploy" the change, snapshot the secondary, build a new | third server, stop writes on the primary, sync over the files | to the third server one last time, point the primary hostname | at the third server IP, test this new primary server, destroy | the old primary server. | | 6. If you ever need to "roll back" a change, you can do that | while there's still three servers up (blue/green), or deploy a | new server with the last working snapshot. | | 7. Set up PagerDuty to wake you up if the primary dies. When it | does, change the hostname of the first box to point to the IP | of the second box. | | That's just one way that is very simple. It is a redundant | active/passive distributed system with redundant storage and | immutable blue/green deployments. It can be considered high- | availability although that term is somewhat loaded; ideally | you'd make as much of the system HA as possible, such as | independent network connections to the backbone, independent | power drops, UPC, etc (both for bare-metal and VMs). | | You can get much more complicated but that's good enough for | what they want (redundancy) and it buys you a lot of other | benefits. | fredsmith219 wrote: | I can't believe at 12 people would actually be stressing the | system. Could you meet the requirements of the project by | setting up the second machine as a hot back up at an offsite | location? | phtrivier wrote: | Maybe. How do I find the O'Reilly book that explains that ? | And the petty details about knowing the first one is down and | starting the backup ? And just enough data replication to | actually have some data in the second machine ? Etc, etc... | | My pet peeves with distributed and ops books is that they | usually start by laying out all those problems, but then move | on to either : | | - explain how Big Tech has even bigger problems, before | explainig how you can fix Big Tech problems with Big Tech | budgets and headcound by deploying just one more layer of | distributed cache or queue that vietually ensures your app is | never going to work again (That's "Desifning Data Intensive | Applications", in bad faith.) | | - or, not really explain anything, wave their hands chanting | "trade offs trade offs" and start telling kids stories about | Byzantine Generals. | EddySchauHai wrote: | What you're describing there sounds like general Linux | sysadmin to me? | phtrivier wrote: | Not entirely, I would argue, if you look at it from the | application developper. | | You have to adapt parts of your app to handle the fact | that two machines might be handling the service (either | at the same time, or in succession.) | | This has impact on how you use memory, how you persist | stuff, etc... | | None of which is rocket science, probably - but even | things that look "obvious" to lots of people get their | O'Reilly books, so... | | But you're right that a part of the "distribution" of a | system is in the hands of ops more than devs. | EddySchauHai wrote: | I guess it's just experience to be honest. It happens | rarely, you might be lucky enough to be involved with | solving it, and then you focus on the important parts of | the project again. I've only worked in startups so don't | know about the 'Big Tech' solutions but a little | knowledge of general linux sysadmin, containers, and | queues has yet to block me :) Once the company is big | enough to need some complexity beyond that I assume | there's enough money to hire someone to come in and put | everything into CNCFs 1000 layer tech stack. | | Edit: Thinking on this, if I want to scale something it'd | be specific to the problem I'm having so some sort of | debugging process like https://netflixtechblog.com/linux- | performance-analysis-in-60... to find the root cause | would be generic advice. Then you can decide to scale | vertically/horizontally/refactor to solve the problem and | move on. | lmwnshn wrote: | More entertainment than how-to guide, and oriented more | towards developers than ops, but if you haven't read | "Scalability! But at what COST?" [0], I think you'll enjoy | it. | | [0] https://www.frankmcsherry.org/graph/scalability/cost/20 | 15/01... | arinlen wrote: | > _explain how Big Tech has even bigger problems, before | explainig how you can fix Big Tech problems with Big Tech | budgets and headcound (...)_ | | What do you have to say about the fact that the career | goals of those interested in this sort of topic is... Be | counted as part of the headcount of these Big Tech | companies while getting paid Big Tech budget salaries? | | Because if you count yourself among those interested in the | topic, that's precisely the type of stuff you're eager to | learn so that you're in a better position to address those | problems. | | What's your answer to that? Continue writing "hello world" | services with Spring Initializr because that's all you | need? | phtrivier wrote: | > Because if you count yourself among those interested in | the topic, that's precisely the type of stuff you're | eager to learn so that you're in a better position to | address those problems. | | People will work on problems of different scales in a | career ; will you agree that different scales of problems | call for different techniques ? | | I have no problem with FANGs documenting how to fix FANGs | issues ! | | I'm a little bit concerned about FANGs-devs-wanabee | applying the same techniques to non-FANGs issues, though, | for lack of training resources about the "not trivial but | a bit boring" techniques. | | Your insight about the budget / salaries makes sense, | though : a book about "building your first boring IT | project right" is definitely not going to be a best | seller anytime soon :D ! | enumjorge wrote: | Nothing wrong with having those aspirations, but sounds | like the parent commenter has non-Big-Tech-sized problems | he needs to solve now. | slt2021 wrote: | distributed systems are usually for millions of users, not 12 | users. | | for your problem you can start by configuring nginx to work as | load balancer and spin up 2nd VM with php app | phtrivier wrote: | "But what if _the_ machine goes down ? What if it goes down | _during quarter earnings legally requested reporting | consolidation period_ ? We need _redundancy_ !!" | | Also, philosophically, I guess, a "distributed" systems | starts at "two machines". (And you actually get most of the | "fun" of distributed systems with "two processes on the same | machine".) | | We're taught how to deal with "N=1" in school, and "N=all | fans of Taylor Swift in the same seconds" in FAANGS. | | Yet I suspect most people will be working on "N=12, 5 hours a | day during office hours, except twice a year." And I'm not | sure what's the reference techniques for that. | arinlen wrote: | > _Also, philosophically, I guess, a "distributed" systems | starts at "two machines"._ | | People opening a page in a browser that sends requests to a | server is already a distributed system. | | A monolith sending requests to a database instance is | already a distributed system. | | Having a metrics sidecar running along your monolith is | already a distributed system. | phtrivier wrote: | > A monolith sending requests to a database instance is | already a distributed system. | | True, of course. | | And even a simple set like this brings in "distribution" | issues for the app developper: | | When do you connect ? When do you reconnect ? | | Where do you get your connection credentials from ? | | What should happen when those credentials have to change | ? | | Do you ever decide to connect to a backup db ? | | Do you ever switch your application logic to a mode where | you know the DB is down, but you still try to work | without it anyway ? | | Etc.. | | Those examples are specific to DBS, but in a distributed | system any other services brings in the same questions. | | With experience you get opinions and intuitions about how | to attack each issues ; my question is still : "should | you need to point a newcomer to some reference / book | about those questions, where would you point to ?" | random_coder wrote: | It's a joke. | arinlen wrote: | > _As in, "we have a PHP monolith used by all of 12 people in | the accounting department, and for some reason we've been | tasked with making it run on multiple machines ("for | redundancy" or something) by next month._ | | I find this comment highly ignorant. The need to deploy a | distributed system is not always tied to performance or | scalability or reliability. | | Sometimes all it takes is having to reuse a system developed by | a third party, or consume an API. | | Do you believe you'll always have the luxury of having a single | process working on a single machine that does zero | communication over a network? | | Hell, even a SPA calling your backend is a distributed system. | Is this not a terribly common usecase? | | Enough about these ignorant comments. They add nothing to the | discussion and are completely detach from reality | phtrivier wrote: | I failed to make the requester sound more obnoxious than the | request. | | My point is precisely that transitioning from a single app on | a machine is a natural and necessary part of a system's life, | but that I can't find satisfying resources on how to handle | thise phase, as opposed to handling much higher load. | | Sorry for the missed joke. | salawat wrote: | Easiest starting point is modeliing the problem between you and | your co-workers paying painstaking attention to the flow of | knowledge. | | Seriously. Most of the difficulty of distributed systems is | because you're actually having to manage the flow of | information between distinct members of a networked composite. | Every time someone is out of the loop, what do you do? | | Can you tell if someone is out of the loop? What happens if | your detector breaks? | | Try it with your coworkers. You have to be super serious on | running down the "but how did you know" parts. | | Once you have a handle of the way you trip, go hit the books, | and learn all the names to the SNAFUs you just acted out. | tychota wrote: | Why teach Paxos and not raft. I thought raft was easier to grasp, | and is used a lot nowadays?. ___________________________________________________________________ (page generated 2022-08-24 23:00 UTC)