hngopher.com

       [HN Gopher] How do you reason about a probabilistic distributed ...
       ___________________________________________________________________
        
       How do you reason about a probabilistic distributed system?
        
       Author : ahelwer
       Score  : 65 points
       Date   : 2020-09-11 16:10 UTC (6 hours ago)
        
 (HTM) web link (ahelwer.ca)
 (TXT) w3m dump (ahelwer.ca)
        
       | hinkley wrote:
       | Statistics was one of the first classes I really struggled with
       | in school.
       | 
       | When something grabs your attention, you start to suffer from
       | Frequency Bias (like 'all of a sudden' everyone has the same car
       | you bought or are going to buy). But in this case, everyone
       | _really_ does suck at statistics. Which probably explains why
       | when I learned what an actuary was and that they made about the
       | same as an anesthesiologist, I wasn 't that surprised.
       | 
       | The ironic part for me is that I kept thinking dependent
       | variables were independent, whereas most of the people I observe
       | think independent variables are dependent (gambler's fallacy,
       | etc), and so vastly underestimate the potential for a 'bad run'
       | of events.
       | 
       | When I still played WoW they posted an article about how they
       | tweaked probabilities a bit to avoid frustrating people. Much of
       | the questing in WoW is based on succeeding at an action a set
       | number of times that has a probability of failure, which turns it
       | into a normal distribution. Your buddy might have gotten lucky
       | and finished a quest in two minutes. Meanwhile, there's a 1 in a
       | million chance that you could still be working on that quest 5
       | hours later. And there are 10 million users...
       | 
       | So to prevent rage quitting, they made the success rate
       | dependent. Which they can do because the scenario is fictitious.
       | You can't do the same thing in the real world.
       | 
       | I'm not sure if there's a way through this or around it (except
       | maybe "always hire a professional"). More likely in 10 years we
       | will discuss how we should have just backed away slowly from
       | these probabilistic situations and done something else.
        
       | duutfhhh wrote:
       | That's a worthy idea to learn. I've actually found it quite
       | trivial, which means I'm either too smart or too dumb to
       | understand it. That MDP diagram reminded me of the Feynman's path
       | integral model of electrons: the double slit experiment looks as
       | if electrons take all possible, within constraints, paths and the
       | interference image is the sum of them. Except that in the MDP
       | diagram, parameters of the two slits can vary, within some
       | boundaries, and that varying changes the interference image, so
       | the question that MDP answers is "how the interference image can
       | change if we vary the slits parameters within these boundaries"?
        
         | hwayne wrote:
         | > That's a worthy idea to learn. I've actually found it quite
         | trivial, which means I'm either too smart or too dumb to
         | understand it.
         | 
         |  _Or_ that Andrew is very good at explaining things :)
        
       | mjb wrote:
       | Very cool post. I like this point particularly:
       | 
       | > In this case we can't assume, in general, that the thread
       | scheduler will assign each thread processor time with uniform
       | probability. That assumption makes even less sense in a fully
       | distributed system with processes running on separate computers
       | connected by a network.
       | 
       | I would love to see tools in this space mature some more. TLA+
       | has had a huge influence on how I approach distributed and
       | concurrency problems, and a large influence on the industry in
       | general. Modelling non-determinism (beyond the 'try all paths'
       | approach), and assigning probabilities to outcomes, seems like
       | the large missing piece needed to model real-world systems and
       | protocols.
       | 
       | In systems with limited redundancy, for example, there's always a
       | "transition to a failure state" path which tends to get ignored
       | in modelling with tools like TLA+. That's because these tools can
       | tell you that the whole thing _might_ fail (which is useful), but
       | not the probability that the whole thing would fail (which is
       | significantly more useful).
        
       | hinkley wrote:
       | One of the interesting things about Raft that doesn't get focused
       | on much is that strictly speaking Raft does not require that all
       | nodes be able to see each other, just that some nodes can see all
       | of the other nodes.
       | 
       | One of the reasons we use consensus protocols is because with
       | enough hardware involved, the machines cannot even agree about
       | which machines are up and responding to traffic.
       | 
       | In Raft, a leader has to be able to communicate with more than
       | half of the cluster. But the followers don't strictly have to be
       | able to see each other (and if I understand the election protocol
       | correctly, not even during election). The leader is the hub of a
       | wheel, not a member of a star-connected graph, and as long as the
       | 'spokes' are all working, there is no partition. Even if
       | individual links keep disappearing.
       | 
       | Is that true of the snowflake protocols mentioned here? Or do
       | they require a star topology to mostly be intact?
        
         | mjb wrote:
         | > One of the interesting things about Raft that doesn't get
         | focused on much is that strictly speaking Raft does not require
         | that all nodes be able to see each other, just that some nodes
         | can see all of the other nodes.
         | 
         | The same is true of Paxos. If you have a common view of
         | participant health, then consensus is much easier than in the
         | general async model, so Raft and Paxos (and Viewstamped
         | Replication, etc) are like this for a reason.
         | 
         | I don't know about SnowFlake, but there is a very interesting
         | tradeoff here between graph connectedness and resilience to the
         | failure of masters (e.g. if the topology is really a star, then
         | you have no real redundancy and you may as well have a single
         | master rather than using consensus).
        
       | stack_underflow wrote:
       | Couldn't message you on linkedin for some reason (looks like
       | they've locked the messaging feature to premium now??) - but just
       | FYI, looks like the twitter account linked on your blog in the
       | footer is owned by someone else.
        
         | ahelwer wrote:
         | Oh thanks! Looks like another Helwer reclaimed it. I think you
         | can message people on linkedin as part of a connect request
         | maybe?
        
       | hwayne wrote:
       | One thing I've started doing with my PRISM models that helps a
       | lot is using formulas as a poor man's strings. Like instead of
       | // Not-yet-flipped: 0, heads: 1, tails: 2
       | 
       | I'd do                   formula unflipped = 0;         formula
       | heads = 1;         formula tails = 2;
       | 
       | Another thing that helps a lot is writing templates and then
       | writing scripts that write out the spec for you. I'm doing this
       | with a queueing problem and it's godsent.
        
       ___________________________________________________________________
       (page generated 2020-09-11 23:01 UTC)