[HN Gopher] Finding Goroutine Bugs with TLA+
       ___________________________________________________________________
        
       Finding Goroutine Bugs with TLA+
        
       Author : headalgorithm
       Score  : 82 points
       Date   : 2020-09-25 15:55 UTC (7 hours ago)
        
 (HTM) web link (www.hillelwayne.com)
 (TXT) w3m dump (www.hillelwayne.com)
        
       | pron wrote:
       | I think this spec tries to do two things, only one of which fits
       | in with the purpose of TLA+. I think this is caused by some
       | ambiguity regarding the source of the bug: is it due to a
       | difficulty writing a subtle algorithm, or is it due to a
       | difficulty understanding the Go language's semantics.
       | 
       | If it's the latter, then TLA+ isn't the right tool, and neither
       | is SPIN. There are other tools for verifying language semantics.
       | If it's the former -- i.e. the author of the spec knows the
       | meaning of each line and its intent -- then TLA+ is a very good
       | tool, but then a simpler spec could suffice, perhaps something
       | like this: https://pron.github.io/files/Channels.pdf
        
         | hwayne wrote:
         | The point was to do demonstrate the bug while staying close to
         | the Go example. You can find the bug at a higher level, but
         | then there's the objection that you're not finding the bug in
         | the Go code, you're finding the bug in simpler toy code. That
         | raises the further objection that we only find the bug because
         | we were looking for it, which is why I showed we find the bug
         | even if we do a straightforward translation of the code.
        
       | wanghq wrote:
       | TLA: It's the thing I wish I know but haven't got a chance to
       | learn.
        
         | jaytaylor wrote:
         | I feel similarly, except it's the thing I wish was _easier_ to
         | learn! <3
        
           | Jtsummers wrote:
           | Hillel's book is good, as is his online tutorial
           | (https://learntla.com/introduction/).
        
           | dwohnitmok wrote:
           | Going through Lamport's video lectures
           | (https://lamport.azurewebsites.net/video/videos.html) is a
           | great way to get the basics down. It proceeds at a fairly
           | lesiurely pace so if you just block off some amount of time
           | every day (say a video or two a day which usually works out
           | to 30min - 1 hour depending on how well you want to practice
           | the concepts in the video) you'll be able to use TLA+ by the
           | end of a week or so.
        
           | Joker_vD wrote:
           | It's so low-level, every time I try to look into it I am
           | reminded of the 6th normal form, it's such a mess to gather
           | all the parts of the condition into one big conjunction...
        
         | amelius wrote:
         | You'd now be building more robust software, except 100 times
         | slower.
        
           | pron wrote:
           | TLA+ makes development faster, not slower (once you get a
           | feel for how to use it).
           | 
           | I mean, it would be slower if you'd spend time writing
           | mechanically-checked formal proofs, but that's working 10x
           | for just slightly more confidence than you can get with TLC
           | (a model-checker for TLA+) alone, so in almost all cases
           | people just stick to specifying and model-checking.
        
             | Jtsummers wrote:
             | I was working on something similar to say, stepped away,
             | and saw your post. So I'll add to it:
             | 
             | TLA+'s place is as part of your design process (you have
             | one, right?). It's the thing you do before committing any
             | code to the project or radically restructuring an existing
             | project (prototype code fits in the design process, but it
             | shouldn't be considered _committed_ most of the time). By
             | spending a bit of time thinking about the design, and using
             | tools like TLA+ to validate that design, you can save
             | yourself months and years of heartache down the road.
             | 
             | By way of anecdote, some colleagues were working on a
             | concurrent algorithm, needing to share data between two
             | embedded processors talking over a databus. A bit more time
             | spent on the design, versus the rather slapdash approach
             | they took, would've saved them 6 months of debugging and
             | tweaking until it worked. The issues were subtle, so it
             | "worked" but wasn't totally correct and tracing the errors
             | they were seeing back to that code was non-obvious (and
             | hard, debugging embedded isn't trivial). It was quite
             | literally a case where skipping a couple weeks of design
             | cost them months later.
        
               | sagichmal wrote:
               | > TLA+'s place is as part of your design process ... By
               | spending a bit of time thinking about the design, and
               | using tools like TLA+ to validate that design
               | 
               | I think you dramatically underestimate the amount of
               | effort required to produce a formal model of any
               | nontrivial system, and overestimate the stability of
               | system definitions in response to changing business
               | requirements.
        
               | hwayne wrote:
               | I think you dramatically underestimate the benefit of
               | writing a formal model! I've run classes where we've
               | managed to discover bugs in the client's actual
               | production system with an afternoon of modeling.
        
               | neolog wrote:
               | How important is it to discover new bugs? Surely any
               | large production system already has a backlog of tons of
               | known ones. So if I want to fix bugs in my system, I can
               | just go to the issue tracker and pick one.
        
               | hwayne wrote:
               | At the level that you're usually writing specs, the bugs
               | you discover are often very serious ones, such as
               | dropping data.
        
               | Jtsummers wrote:
               | Depends on what "discover" means. In the sense that
               | Hillel is speaking of here, it probably means more like:
               | Discover the bug exists and also discover, to a very high
               | degree of certainty if not absolute certainty, the
               | location of the bug in the code base.
               | 
               | That discovery is actionable, many other bug discoveries
               | are not without further investigation.
        
               | ahelwer wrote:
               | Eh. I used TLA+ for a few projects inside Microsoft. It
               | probably takes a couple of weeks to write a spec for a
               | complicated problem. Then less time to implement it,
               | because all the heavy lifting has already been done - in
               | many cases you're directly transcribing the spec logic
               | without having to think very hard.
               | 
               | The sort of things for which you write specs are
               | generally "core" functionality that doesn't change very
               | often. If people are constantly mucking around in that
               | area without writing specs, your system will be a pile of
               | crap.
        
               | Jtsummers wrote:
               | I'm not underestimating anything because I'm not
               | suggesting producing a formal model of arbitrary,
               | nontrivial systems. My critical point is: Design your
               | system before committing any code to the end result [0].
               | Formal methods and models are one tool in your designer's
               | kit.
               | 
               | [0] This statement should be qualified with: Especially
               | if it's at all novel to you, your team, your
               | organization, or the world. And the amount of time to
               | spend varies based on its true novelty. A team that's
               | made hundreds of RoR sites can churn a new one out in a
               | week with minimal upfront design effort. A team that's
               | making the next RoR should probably spend more time
               | thinking about what that means before making, or
               | committing code to, it.
        
               | pron wrote:
               | In practice, the effort is smaller and cheaper than
               | finding bugs in other ways. But yes, learning where and
               | how to best apply TLA+ takes some practice (less than
               | learning a programming language, but more than learning a
               | new build tool).
        
           | dwohnitmok wrote:
           | The crucial thing about TLA+ is that because it's not
           | actually executable code your investment in it is completely
           | up to you. You could decide to put in 100x the time if you
           | wanted, you could decide to put in 0.1x the time if you
           | wanted. Your investment in it is totally up to you and
           | completely independent of the time you'd spend in writing the
           | code.
        
           | zimpenfish wrote:
           | I think if people built software 100 times slower, they'd
           | probably end up with more robust software, TLA+ or no.
        
       | justicezyx wrote:
       | The barrier to TLA+ remains the high-cost of rewriting a piece of
       | code in another language.
       | 
       | The translation process itself could be error prone. The
       | disintegrated concepts requires considerable amount of mental
       | effort.
       | 
       | I am very interested in using the host language (the language
       | that used to write the software one wants to verify) compiler to
       | compile into a TLA+ spec.
        
         | dwohnitmok wrote:
         | This article notwithstanding, I don't think the right mental
         | framework to approach TLA+ is to think of it as a way to verify
         | your code. I view TLA+ as a formal design language. It is there
         | to verify and explore the consequences of architectural and
         | design decisions. It is also there to present a formal design
         | document that engineers can refer to as a common artifact when
         | discussing various design decisions.
         | 
         | Now architecture and design is a very vague domain. It could be
         | anything from specific algorithms all the way up to entire
         | distributed systems (and indeed TLA+ can be used for any of
         | those), but the point is that it is a separate entity from
         | code.
         | 
         | I also think in the current formal methods environment this is
         | where TLA+ can make the most impact. It is a far less risky
         | business proposition to use TLA+ to record design decisions
         | where the maturity of its toolchain has minimal impact on your
         | actual codebase vs using a formal methods toolchain that
         | directly affects the code you write, where the maturity of that
         | toolchain is an extremely important business decision.
        
         | hcarvalhoalves wrote:
         | I don't think you can call it "re-writing" when the
         | implementation has to handle a lot of the details that you can
         | and should ignore on the specification.
         | 
         | Trying to compile a specification _from_ the implementation
         | would be equivalent to trying to do  "top-down" via "bottom-
         | up".
        
         | Jtsummers wrote:
         | It's not rewriting, it's modeling. And that's a critical
         | distinction. You're able to elide many, if not most, of the
         | details and distill the effort to the critical algorithms and
         | design elements. I don't care _how_ the memory is stored, or
         | even how big, just that it exists (as an example). If I _do_
         | care, I can add that to the model.
         | 
         | And in the case of the example with Go as the implementation
         | language (in the linked article), many of the details (like how
         | channels work) can be reused between specs. This permits
         | analysis of Go programs generally, without needing to respecify
         | Go's concurrency mechanisms _every. single. time._
        
         | exdsq wrote:
         | I believe the main use case of TLA+ is to first design the code
         | as a spec and then implement it, rather than the other way
         | around.
        
           | kinmi wrote:
           | I'd like to learn more: how would one go about ensuring the
           | implementation was an accurate implementation of the
           | specification? Is there a process for doing this?
        
             | Jtsummers wrote:
             | To the best of my knowledge there's no universal, formal
             | method to translate the spec into code and verify the code
             | properly implements the spec. Just like without TLA+,
             | you're left to conduct your own testing and analysis to
             | verify your code is correct, but with TLA+ you at least
             | have a specification to verify it against (even if it's
             | tedious).
             | 
             | That said, some languages offer more capabilities and
             | tooling to support this. Check out Frama-C and its Mthread
             | plugin [0] (NB: I have no experience with either) for
             | something that could help. In Ada there's SPARK, but I'm
             | not sure the extent that it helps with verifying
             | concurrency properties.
             | 
             | In some languages, like Erlang, Elixir, and Go, the
             | concurrency mechanism is sufficiently well integrated into
             | the languages that it becomes very clear while reading the
             | code what's happening (especially if you properly segment
             | your code into functions and modules to aid in
             | readability). These can make the analysis much easier to
             | perform.
             | 
             | And regardless of language, if you have a specification and
             | design the code tends to come out _much_ cleaner, which
             | makes the manual task of reviewing and verifying these
             | critical parts less onerous. Additionally, it 's not as
             | hard to conceive of tests when you have the simpler spec to
             | look at versus the much larger code constructed from it.
             | 
             | [0] https://frama-c.com/mthread.html
        
             | thamer wrote:
             | There's no perfect answer and it's always been a limitation
             | to take into account. There are model-checking solutions
             | which take a spec and generate code, but you don't get this
             | with TLA+.
             | 
             | In a way, you could make the same argument about writing
             | tests: the code paths you exercise in your CI environment
             | are not quite the same as what's running in prod, and
             | neither is the environment in which the code is running
             | (e.g. the load your process or host is under, the memory
             | pressure, etc).
             | 
             | That said, don't let this small gap and the lack of a
             | perfect solution discourage you from using model checking,
             | there is still a lot to gain from adding it as one more
             | tool in your toolbox.
        
         | pron wrote:
         | I think that the specification in this article -- that tries to
         | do too many things at once -- gives a somewhat wrong impression
         | of how TLA+ is best used. It's not intended to find errors in
         | your code. See my other comment on this matter:
         | https://news.ycombinator.com/item?id=24593196
         | 
         | To apply your claim to how TLA+ _is_ best is to say that the
         | problem with architecture is the high cost of having two (or
         | more) representations of the building, once in steel and
         | concrete and once as a blueprint, and conclude it would be
         | better to not have a blueprint at all.
         | 
         | The spec is supposed to focus on different things than your
         | code and show the bigger picture. While it would be great to
         | then verify the correctness of your actual program, humanity
         | currently does not possess this capability. The biggest
         | programs ever verified "end-to-end" were ~10KLOC, or 1/5 the
         | size of jQuery, and required years of work by world experts.
        
         | k__ wrote:
         | I think it's like writing an outline for an article and writing
         | the actual article.
         | 
         | I mean, both tasks are _writing_ , but the outline is on
         | another abstraction level and makes the article a breeze to
         | write.
        
       | jeffbee wrote:
       | This program finds the problem not because TLA+ is nifty, but
       | because it sweeps out the parameter space, which is exactly what
       | the original Go program should have done in a unit test. What
       | happens when `len(pss)` is just below, at, or just above, or way
       | above `concurrencyProcesses`? That is the type of question that
       | unit tests are supposed to address. Go has a perfectly good
       | deadlock detector.
       | 
       | The real problem cannot be solved by re-modelling in a different
       | high-level language, because the real problem with that code is
       | it gets `pss` from a library call and `concurrencyProcesses` from
       | a global variable, when they should both be function parameters a
       | test case can exercise.
        
         | guessmyname wrote:
         | This is why I invest more time Fuzzing [1] than writing
         | software models or formal specifications.
         | 
         | Google's gofuzz [2] and dvyukov's go-fuzz [3] libraries have
         | been instrumental for my success every time I embark on a
         | journey to write a complex system with concurrency in its
         | heart. The list of bugs I have found in my co-workers programs
         | was so big we decided to spend a month configuring the fuzzing
         | tests and our continuous integration environment to do it
         | automatically. Code quality increased by an order of magnitude
         | and we are now more confident about the code we ship to
         | production.
         | 
         | [1] https://en.wikipedia.org/wiki/Fuzzing
         | 
         | [2] https://github.com/google/gofuzz
         | 
         | [3] https://github.com/dvyukov/go-fuzz
        
       ___________________________________________________________________
       (page generated 2020-09-25 23:00 UTC)