[HN Gopher] The Hardest Program I've Ever Written - How a code f... ___________________________________________________________________ The Hardest Program I've Ever Written - How a code formatter works (2015) Author : pcr910303 Score : 116 points Date : 2020-03-27 19:58 UTC (3 hours ago) (HTM) web link (journal.stuffwithstuff.com) (TXT) w3m dump (journal.stuffwithstuff.com) | vjeux wrote: | I wrote a super short description of the overall architecture of | prettier if people are interested: | https://blog.vjeux.com/2017/javascript/anatomy-of-a-javascri... | Groxx wrote: | Previously: | | 2018: https://news.ycombinator.com/item?id=17271963 | | 2017: https://news.ycombinator.com/item?id=15063193 | | 2015: https://news.ycombinator.com/item?id=10195091 (shortly | after it was written) | wstrange wrote: | The dart formatter is awesome - kudos to Bob. | | All languages should come with a highly opinionated formatter. It | puts an end to useless style arguments. | andrewzah wrote: | I didn't really understand this until I started using golang. | Now I don't want to go back to other languages and have to deal | with varying syntaxes and manually running a formatter. | thedance wrote: | What's a language that lacks such a facility? ClangFormat | covers C/C++/Java/JavaScript/Objective-C/Protobuf/C#. gofmt | obviously can do Go. buildifier will rewrite your Bazel BUILD | files. | pje wrote: | ruby | frosted-flakes wrote: | Those aren't highly opionated _default_ formatters though. | The JavaScript community seems to have settled on Prettier | as the go-to formatter, but not everyone uses it, and even | Prettier had to add a bunch of options for certain things | that will _always_ be highly personal. Stuff like trailing | commas, tabs vs spaces (and how many spaces!?), line | lengths, and semicolons. | nicoburns wrote: | Every time I use one of these it ends up doing ridiculous | things to the formatting, like splitting what should really be | 1 line over 4 or 5. Which completely destroys whole-file | readability/scanability. Does nobody else find it really hard | to read codebases subject to these formatters? | cbhl wrote: | There are always corner cases. For example, clang-format at | $day_job doesn't understand semantically that I want my 4x4 | matrix constant to be formatted as 4 rows with 4 numbers | each; putting all 16 numbers in a single long array will get | a lower score. | | But as a whole, I found it easier to navigate the monolithic | code base when most of it has been auto-formatted, exceptions | aside. When applying a formatter to existing code, I usually | do that in a separate commit and then follow it with any | hand-rewritten lines afterwards (in case either the formatter | or my changes afterwards accidentally regress something). | | And once the formatter is enabled by default, it's usually | easy enough to come up with a set of smaller statements that | auto-format well to replace a longer statement that auto- | formats poorly. | lhorie wrote: | Don't just blindly run a formatter. | | I've developed a habit of refactoring long lines early (e.g. | breaking large expressions and assigning to intermediary | variables) to prevent the formatter from adding hard-to-read | mid-expression line breaks | deadbunny wrote: | This is basically the approach I take, write towards the | standards of the auto formatter as much as possible and | it'll usually handle the rest. | mholt wrote: | I just rewrote our Caddy v2 Caddyfile formatter this week [1]. | It's not as "smart" as a true code formatter, but what I thought | would be just a 1 hour task ended up taking all day. And I'm | lucky it wasn't longer! | | Lessons learned: | | - State is hard. There's something to be said for functional | languages, or at least functional patterns. | | - The Caddyfile is a very theoretically-impure syntax. | | - User input is always messier than you think it will be. (Spend | some time helping people on the forums, you'll see what I mean. | I'm a neat freak though, so I notice it...) | | [1]: | https://github.com/caddyserver/caddy/commit/7ee3ab7baa216599... | sfvisser wrote: | I never grasped why this is so hard, you don't need an | exponential blowup with some common sense heuristics. Just look | at the AST and put the breaks as high up in the tree as possible. | | For example: If I have a function application with several | arguments one of which is a slightly longer lambda expression, | don't even try to break the lambda before putting the arguments | (nicely aligned) on their own line. | | I'm a big fan of the ergonomics of prettier, but some of its | output is objectively ugly and seems the result of over- | complicating their search space. | rattray wrote: | Shameless plug that if you're bored and want a really tough, fun | challenge, the prettier project has many open tickets: | https://github.com/prettier/prettier/issues?q=is%3Aopen+is%3... | | Only two are tagged "difficulty:easy", likely for good reason - | when I contributed a few fixes a while back, I found it much more | challenging than implementing a language syntax, an unrelated | project which I was also doing at the same time. | | It's really fun, though, and watching colleagues code and use the | features of prettier that you built is a nice feeling - "I helped | write that code for you!" | qorrect wrote: | I really don't like this guys writing style. | QuinnWilton wrote: | Philip Wadler has a seminal paper [0] on implementing pretty | printers by modeling them using an algebra. There's | implementations of the algorithm in most programming languages | nowadays, and some state of the art advances that help with | performance [1] | | It's a very elegant algorithm, and is very pleasant to work. | Elixir's formatter uses a variant of it [2], and it appears in | the standard library [3]. I've personally made use of it to write | a code formatter for GraphQL queries, and it was a very small | amount of code for some powerful results. | | Definitely take a look if you ever need to do something like | this. | | [0] | https://homepages.inf.ed.ac.uk/wadler/papers/prettier/pretti... | | [1] https://jyp.github.io/pdf/Prettiest.pdf | | [2] | http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2... | | [3] https://hexdocs.pm/elixir/Inspect.Algebra.html | westoncb wrote: | I'm gonna take a look at Wadler's paper--but does anyone want | to take a shot at a high-level description of how an algebra | would be applicable to the problem of pretty printing? | | Any more general comments on applying algebras to solving | programming problems would also be appreciated. | | My rough understanding of the benefits/rationale atm: | | 1) the algebra is a nice/minimal factoring of the problem | | 2) being able to formally classify it means we know a bunch of | its properties, which tells us certain approaches are | definitely (im)possible (and these properties are drawn from a | fairly small set: e.g. associativity, distributivity, existence | of inverses) | | 3) If the algebra is explicitly represented in code (maybe in a | library), we can use that general structure to verify (and | generate?) certain aspects of our particular problem solution. | | Edit: The paper mentions "Over the years, Richard Bird and | others have developed the algebra of programming to a fine art" | --and I found Bird's book "Algebra of Programming". Seems like | what I'm looking for maybe, but would be grateful to hear of | any alternative resources as well. | QuinnWilton wrote: | You'd likely be more interested in the original paper [0] by | John Hughes that Wadler built off of. It gets into a ton more | detail about the algebraic structure of the documents they're | defining: | | > The question addressed in this paper is: how should | libraries of combinators be designed? Our goal is to show how | we can use formal specification of the combinators, and a | study of their algebraic properties, to guide both the design | and implementation of a combinator library. Our case study is | a library for pretty-printing. But the methods we present are | of wider applicability, and we also show how they can be used | to develop a number of different monads. | | Wadler's paper touches upon some of the same ideas, but the | primary goal of his paper is to build a prettier printer, | whereas Hughes aimed to demonstrate the power of algebraic | properties. | | [0] http://www.cse.chalmers.se/~rjmh/Papers/pretty.html | westoncb wrote: | This looks great--thanks! | QuinnWilton wrote: | It's a little bit different from the other resources I've | linked, but "Elements of Programming" [0] is another | excellent (though incredibly challenging) read. It's | written by Alexander Stepanov, and basically walks | through the way he built STL by decomposing algorithms | and data structures into abstract mathematical | structures. | | > Elements of Programming provides a different | understanding of programming than is presented elsewhere. | Its major premise is that practical programming, like | other areas of science and engineering,must be based on a | solid mathematical foundation. The book shows that | algorithms implemented in a real programming language, | such as C++, can operate in the most general mathematical | setting. For example, the fast exponentiation algorithm | is defined to work with any associative operation. Using | abstract algorithms leads to efficient, reliable, secure, | and economical software. | | I revisit it every few years and get a little bit further | every time before I start to struggle with the exercises. | | [0] http://elementsofprogramming.com/ | amykyta wrote: | Here is an interesting use | https://www.lexifi.com/instrumentbox/. Algebraic description | of financial contracts that then gives all sorts of common | features you want to analyze them 'for free'. | einpoklum wrote: | The author of that post has it _so easy_! Formatting well-formed | programs, in a young language with a short spec. He just starts | with an AST! | | No no my friend. The challenge is when the program is _not_ well- | formed, i.e. has errors, or worse yet - refers to to code | elsewhere, so you can't even decide if it's well-formed or not. | So no AST for you. You have to go into the dark and dangerous | word of speculative and partial parsing of language grammars, | representing ambiguity, replacing undefined, ambiguous or | erroneously-defined semantics with deductions from the author's | existing formatting and so on. That's where the real challenges | lie. | | And - it's not gonna be 3,853 lines, I can tell you this much. | eterm wrote: | I actually really like a "format on save; don't format without | AST" combination. | | It's really nice knowing as soon as you hit ctrl+s whether | you've just written in a syntax error or not. | bmc7505 wrote: | If you enjoyed reading this, you might want to check out | CodeBuff: | | https://github.com/antlr/codebuff | | CodeBuff is based on a paper by Terrence Parr and Juergen Vinju, | "Towards a Universal Code Formatter through Machine Learning": | | https://arxiv.org/abs/1606.08866 | dfox wrote: | 10 years ago I though that writing S-expression pretty printer | should be significantly easier than what the complexity of | typical CL/Scheme pretty printer (ie. Waters' XP as described in | AIM-1102) implies. Well that is not true ;) | | And since trying to do that I almost involuntarily format any | code I write by rigorously applying the pretty printing algorithm | of XP by combination of manual line breaks and Emacs' newline- | and-indent. | jansan wrote: | How do you indent block comments? Do they stay untouched? Do you | indent them the same as the previous line? If so, what to do if | there are already indents in the comment? Add the additional | indent? | | I wrote a simple but IMHO really nice XML prettifier a while ago | and remember that I finally gave up on block comments, because I | could not find a way to handle all cases in a satisfying manner. | So I decided to leave the indenting untouched. | munificent wrote: | _> How do you indent block comments? Do they stay untouched?_ | | Block comments are a little tricky and there are a handful of | heuristics to handle them. Generally they stick to the | preceding token. Block comments are rarely used in Dart and | when they are, it's often just a small marker to describe the | previous element like: foo(null /* missing | */, blah); | | So adhering it to the previous token usually does the right | thing. | | _> If so, what to do if there are already indents in the | comment?_ | | It doesn't touch the body of the comment itself. There's too | many ways for that to go wrong given all of the various things | a human might choose to stuff in there: ASCII art, tables, | prose, etc. | blumomo wrote: | I can't help myself, the required semicolon at the end of each | line in Dart programs keep hurting after writing for many years | in Python, JS, Kotlin and Swift. I grew up semicolon-less with | QBasic and learned semicolon since Delphi 1.0 and Java 1.0 but | today, wanting as few noise on the screen as possible, they are | hurting in my eyes. | seanmcdirmid wrote: | Semicolons are just extra signal to humans and compilers, they | shouldn't hurt eyes like the use of fixed width fonts do. | jakear wrote: | It's not that they hurt the eyes for me as much as they | introduce tons of micro delays and extra keystrokes when | editing. For instance, any "X to end of line" (cut, copy, | etc), you now need to think about what the destination is and | whether you should include the semi, change the selection, go | back and delete the semi later, etc. | | It sounds slight, but it really builds up over time. Once you | get used to "powerediting" without them it's hard to go back. | seanmcdirmid wrote: | End of line inference algorithms are incredibly fragile, | both by the compiler and by the human reader, having a | semi-colon can make things clearer in the absence of | another secondary signal (like indentation). There are so | many type errors in scala or typescript that can be solved | by adding a semicolon somewhere to eliminate some idiotic | ambiguity. | tom_ wrote: | Why not infer the end of a line when finding the end of a | line? ;) | | This is what most BASICs and assemblers did. | | (I think QBasic had a line continuation character? - or | maybe that came with Visual BASIC. I don't remember. But, | anyway, if you wanted, it was possible to continue a | source line past a line break, it just wasn't the | default.) | saagarjha wrote: | > End of line inference algorithms are incredibly fragile | | I am working on a language _right now_ with optional | semicolons (that is, newlines terminate statements unless | there is a semicolon, which does the same early) and it's | not really that hard. The only change is that instead of | looking for a newline character for statement termination | I often also need to look for a semicolon. | jakear wrote: | I think the issue is more when the new line _doesn't_ | terminate the statement. Like in JS: | | const a = doThing() | | (() => throw Error) | jackewiehose wrote: | Only-EOL like python is fine and always-semicolon like C | is fine. Sometimes-optional but sometimes-required | semicolon like Javascript is bad in my experience. | Javascripts C-like syntax doesn't cope well with missing | semicolons | munificent wrote: | I'm the author of the post and work on the Dart language. I | spent a bunch of time last year trying to figure out a good way | to make semicolons optional in Dart without breaking a lot of | stuff. | | Because of a handful of syntax quirks particular to Dart, it's | really difficult. Most other languages that have optional | semicolons were designed from the start to make them optional. | Dart was not. On top of that it: | | * Uses C-style variable declarations which means there's no | keyword to indicate the start of a variable declaration. | | * Likewise uses C syntax for function and even local function | declarations. Again no keyword marking the beginning of a | function. | | * Has complex syntax for things like function types which means | a type annotation can be arbitarily long and may split over | multiple lines. | | * Has lots of "contextual keywords" that are not true reserved | words but behave like them only in certain contexts. | | * Has a rich syntax that overloads a lot of tokens to mean | different things in different contexts. "{}" can be an empty | map, empty set, empty block, empty function body, or empty | lambda body. | | All of that makes optional semicolons just really difficult. I | haven't given up entirely. We have a thing now called "language | versioning" that lets different Dart libraries target different | specific versions of the language. That lets us evolve the | language in nominally "breaking" ways without actually breaking | existing code. (For example, this is how we will migrate the | world to non-nullable types.) | | That may give us a way to make other grammar simplifications | that would then also let us make semicolons optional. But | syntax changes like this are always difficult. People have | _very strong_ opinions about you changing their existing syntax | under them, and the value proposition in return isn 't super | compelling. | | There's an old saying, the best time to plant a tree is twenty | years ago. The second best time is today. | | Language syntax isn't always like that because the migration | cost once you have a lot of extant code (and users) can just be | too painful. The best time to make semicolons optional was | twenty years ago. The next best time may be today, but it may | just be never. ___________________________________________________________________ (page generated 2020-03-27 23:00 UTC)