[HN Gopher] Finding Goroutine Bugs with TLA+ ___________________________________________________________________ Finding Goroutine Bugs with TLA+ Author : headalgorithm Score : 82 points Date : 2020-09-25 15:55 UTC (7 hours ago) (HTM) web link (www.hillelwayne.com) (TXT) w3m dump (www.hillelwayne.com) | pron wrote: | I think this spec tries to do two things, only one of which fits | in with the purpose of TLA+. I think this is caused by some | ambiguity regarding the source of the bug: is it due to a | difficulty writing a subtle algorithm, or is it due to a | difficulty understanding the Go language's semantics. | | If it's the latter, then TLA+ isn't the right tool, and neither | is SPIN. There are other tools for verifying language semantics. | If it's the former -- i.e. the author of the spec knows the | meaning of each line and its intent -- then TLA+ is a very good | tool, but then a simpler spec could suffice, perhaps something | like this: https://pron.github.io/files/Channels.pdf | hwayne wrote: | The point was to do demonstrate the bug while staying close to | the Go example. You can find the bug at a higher level, but | then there's the objection that you're not finding the bug in | the Go code, you're finding the bug in simpler toy code. That | raises the further objection that we only find the bug because | we were looking for it, which is why I showed we find the bug | even if we do a straightforward translation of the code. | wanghq wrote: | TLA: It's the thing I wish I know but haven't got a chance to | learn. | jaytaylor wrote: | I feel similarly, except it's the thing I wish was _easier_ to | learn! <3 | Jtsummers wrote: | Hillel's book is good, as is his online tutorial | (https://learntla.com/introduction/). | dwohnitmok wrote: | Going through Lamport's video lectures | (https://lamport.azurewebsites.net/video/videos.html) is a | great way to get the basics down. It proceeds at a fairly | lesiurely pace so if you just block off some amount of time | every day (say a video or two a day which usually works out | to 30min - 1 hour depending on how well you want to practice | the concepts in the video) you'll be able to use TLA+ by the | end of a week or so. | Joker_vD wrote: | It's so low-level, every time I try to look into it I am | reminded of the 6th normal form, it's such a mess to gather | all the parts of the condition into one big conjunction... | amelius wrote: | You'd now be building more robust software, except 100 times | slower. | pron wrote: | TLA+ makes development faster, not slower (once you get a | feel for how to use it). | | I mean, it would be slower if you'd spend time writing | mechanically-checked formal proofs, but that's working 10x | for just slightly more confidence than you can get with TLC | (a model-checker for TLA+) alone, so in almost all cases | people just stick to specifying and model-checking. | Jtsummers wrote: | I was working on something similar to say, stepped away, | and saw your post. So I'll add to it: | | TLA+'s place is as part of your design process (you have | one, right?). It's the thing you do before committing any | code to the project or radically restructuring an existing | project (prototype code fits in the design process, but it | shouldn't be considered _committed_ most of the time). By | spending a bit of time thinking about the design, and using | tools like TLA+ to validate that design, you can save | yourself months and years of heartache down the road. | | By way of anecdote, some colleagues were working on a | concurrent algorithm, needing to share data between two | embedded processors talking over a databus. A bit more time | spent on the design, versus the rather slapdash approach | they took, would've saved them 6 months of debugging and | tweaking until it worked. The issues were subtle, so it | "worked" but wasn't totally correct and tracing the errors | they were seeing back to that code was non-obvious (and | hard, debugging embedded isn't trivial). It was quite | literally a case where skipping a couple weeks of design | cost them months later. | sagichmal wrote: | > TLA+'s place is as part of your design process ... By | spending a bit of time thinking about the design, and | using tools like TLA+ to validate that design | | I think you dramatically underestimate the amount of | effort required to produce a formal model of any | nontrivial system, and overestimate the stability of | system definitions in response to changing business | requirements. | hwayne wrote: | I think you dramatically underestimate the benefit of | writing a formal model! I've run classes where we've | managed to discover bugs in the client's actual | production system with an afternoon of modeling. | neolog wrote: | How important is it to discover new bugs? Surely any | large production system already has a backlog of tons of | known ones. So if I want to fix bugs in my system, I can | just go to the issue tracker and pick one. | hwayne wrote: | At the level that you're usually writing specs, the bugs | you discover are often very serious ones, such as | dropping data. | Jtsummers wrote: | Depends on what "discover" means. In the sense that | Hillel is speaking of here, it probably means more like: | Discover the bug exists and also discover, to a very high | degree of certainty if not absolute certainty, the | location of the bug in the code base. | | That discovery is actionable, many other bug discoveries | are not without further investigation. | ahelwer wrote: | Eh. I used TLA+ for a few projects inside Microsoft. It | probably takes a couple of weeks to write a spec for a | complicated problem. Then less time to implement it, | because all the heavy lifting has already been done - in | many cases you're directly transcribing the spec logic | without having to think very hard. | | The sort of things for which you write specs are | generally "core" functionality that doesn't change very | often. If people are constantly mucking around in that | area without writing specs, your system will be a pile of | crap. | Jtsummers wrote: | I'm not underestimating anything because I'm not | suggesting producing a formal model of arbitrary, | nontrivial systems. My critical point is: Design your | system before committing any code to the end result [0]. | Formal methods and models are one tool in your designer's | kit. | | [0] This statement should be qualified with: Especially | if it's at all novel to you, your team, your | organization, or the world. And the amount of time to | spend varies based on its true novelty. A team that's | made hundreds of RoR sites can churn a new one out in a | week with minimal upfront design effort. A team that's | making the next RoR should probably spend more time | thinking about what that means before making, or | committing code to, it. | pron wrote: | In practice, the effort is smaller and cheaper than | finding bugs in other ways. But yes, learning where and | how to best apply TLA+ takes some practice (less than | learning a programming language, but more than learning a | new build tool). | dwohnitmok wrote: | The crucial thing about TLA+ is that because it's not | actually executable code your investment in it is completely | up to you. You could decide to put in 100x the time if you | wanted, you could decide to put in 0.1x the time if you | wanted. Your investment in it is totally up to you and | completely independent of the time you'd spend in writing the | code. | zimpenfish wrote: | I think if people built software 100 times slower, they'd | probably end up with more robust software, TLA+ or no. | justicezyx wrote: | The barrier to TLA+ remains the high-cost of rewriting a piece of | code in another language. | | The translation process itself could be error prone. The | disintegrated concepts requires considerable amount of mental | effort. | | I am very interested in using the host language (the language | that used to write the software one wants to verify) compiler to | compile into a TLA+ spec. | dwohnitmok wrote: | This article notwithstanding, I don't think the right mental | framework to approach TLA+ is to think of it as a way to verify | your code. I view TLA+ as a formal design language. It is there | to verify and explore the consequences of architectural and | design decisions. It is also there to present a formal design | document that engineers can refer to as a common artifact when | discussing various design decisions. | | Now architecture and design is a very vague domain. It could be | anything from specific algorithms all the way up to entire | distributed systems (and indeed TLA+ can be used for any of | those), but the point is that it is a separate entity from | code. | | I also think in the current formal methods environment this is | where TLA+ can make the most impact. It is a far less risky | business proposition to use TLA+ to record design decisions | where the maturity of its toolchain has minimal impact on your | actual codebase vs using a formal methods toolchain that | directly affects the code you write, where the maturity of that | toolchain is an extremely important business decision. | hcarvalhoalves wrote: | I don't think you can call it "re-writing" when the | implementation has to handle a lot of the details that you can | and should ignore on the specification. | | Trying to compile a specification _from_ the implementation | would be equivalent to trying to do "top-down" via "bottom- | up". | Jtsummers wrote: | It's not rewriting, it's modeling. And that's a critical | distinction. You're able to elide many, if not most, of the | details and distill the effort to the critical algorithms and | design elements. I don't care _how_ the memory is stored, or | even how big, just that it exists (as an example). If I _do_ | care, I can add that to the model. | | And in the case of the example with Go as the implementation | language (in the linked article), many of the details (like how | channels work) can be reused between specs. This permits | analysis of Go programs generally, without needing to respecify | Go's concurrency mechanisms _every. single. time._ | exdsq wrote: | I believe the main use case of TLA+ is to first design the code | as a spec and then implement it, rather than the other way | around. | kinmi wrote: | I'd like to learn more: how would one go about ensuring the | implementation was an accurate implementation of the | specification? Is there a process for doing this? | Jtsummers wrote: | To the best of my knowledge there's no universal, formal | method to translate the spec into code and verify the code | properly implements the spec. Just like without TLA+, | you're left to conduct your own testing and analysis to | verify your code is correct, but with TLA+ you at least | have a specification to verify it against (even if it's | tedious). | | That said, some languages offer more capabilities and | tooling to support this. Check out Frama-C and its Mthread | plugin [0] (NB: I have no experience with either) for | something that could help. In Ada there's SPARK, but I'm | not sure the extent that it helps with verifying | concurrency properties. | | In some languages, like Erlang, Elixir, and Go, the | concurrency mechanism is sufficiently well integrated into | the languages that it becomes very clear while reading the | code what's happening (especially if you properly segment | your code into functions and modules to aid in | readability). These can make the analysis much easier to | perform. | | And regardless of language, if you have a specification and | design the code tends to come out _much_ cleaner, which | makes the manual task of reviewing and verifying these | critical parts less onerous. Additionally, it 's not as | hard to conceive of tests when you have the simpler spec to | look at versus the much larger code constructed from it. | | [0] https://frama-c.com/mthread.html | thamer wrote: | There's no perfect answer and it's always been a limitation | to take into account. There are model-checking solutions | which take a spec and generate code, but you don't get this | with TLA+. | | In a way, you could make the same argument about writing | tests: the code paths you exercise in your CI environment | are not quite the same as what's running in prod, and | neither is the environment in which the code is running | (e.g. the load your process or host is under, the memory | pressure, etc). | | That said, don't let this small gap and the lack of a | perfect solution discourage you from using model checking, | there is still a lot to gain from adding it as one more | tool in your toolbox. | pron wrote: | I think that the specification in this article -- that tries to | do too many things at once -- gives a somewhat wrong impression | of how TLA+ is best used. It's not intended to find errors in | your code. See my other comment on this matter: | https://news.ycombinator.com/item?id=24593196 | | To apply your claim to how TLA+ _is_ best is to say that the | problem with architecture is the high cost of having two (or | more) representations of the building, once in steel and | concrete and once as a blueprint, and conclude it would be | better to not have a blueprint at all. | | The spec is supposed to focus on different things than your | code and show the bigger picture. While it would be great to | then verify the correctness of your actual program, humanity | currently does not possess this capability. The biggest | programs ever verified "end-to-end" were ~10KLOC, or 1/5 the | size of jQuery, and required years of work by world experts. | k__ wrote: | I think it's like writing an outline for an article and writing | the actual article. | | I mean, both tasks are _writing_ , but the outline is on | another abstraction level and makes the article a breeze to | write. | jeffbee wrote: | This program finds the problem not because TLA+ is nifty, but | because it sweeps out the parameter space, which is exactly what | the original Go program should have done in a unit test. What | happens when `len(pss)` is just below, at, or just above, or way | above `concurrencyProcesses`? That is the type of question that | unit tests are supposed to address. Go has a perfectly good | deadlock detector. | | The real problem cannot be solved by re-modelling in a different | high-level language, because the real problem with that code is | it gets `pss` from a library call and `concurrencyProcesses` from | a global variable, when they should both be function parameters a | test case can exercise. | guessmyname wrote: | This is why I invest more time Fuzzing [1] than writing | software models or formal specifications. | | Google's gofuzz [2] and dvyukov's go-fuzz [3] libraries have | been instrumental for my success every time I embark on a | journey to write a complex system with concurrency in its | heart. The list of bugs I have found in my co-workers programs | was so big we decided to spend a month configuring the fuzzing | tests and our continuous integration environment to do it | automatically. Code quality increased by an order of magnitude | and we are now more confident about the code we ship to | production. | | [1] https://en.wikipedia.org/wiki/Fuzzing | | [2] https://github.com/google/gofuzz | | [3] https://github.com/dvyukov/go-fuzz ___________________________________________________________________ (page generated 2020-09-25 23:00 UTC)