[HN Gopher] The Hardest Program I've Ever Written - How a code f...
       ___________________________________________________________________
        
       The Hardest Program I've Ever Written - How a code formatter works
       (2015)
        
       Author : pcr910303
       Score  : 116 points
       Date   : 2020-03-27 19:58 UTC (3 hours ago)
        
 (HTM) web link (journal.stuffwithstuff.com)
 (TXT) w3m dump (journal.stuffwithstuff.com)
        
       | vjeux wrote:
       | I wrote a super short description of the overall architecture of
       | prettier if people are interested:
       | https://blog.vjeux.com/2017/javascript/anatomy-of-a-javascri...
        
       | Groxx wrote:
       | Previously:
       | 
       | 2018: https://news.ycombinator.com/item?id=17271963
       | 
       | 2017: https://news.ycombinator.com/item?id=15063193
       | 
       | 2015: https://news.ycombinator.com/item?id=10195091 (shortly
       | after it was written)
        
       | wstrange wrote:
       | The dart formatter is awesome - kudos to Bob.
       | 
       | All languages should come with a highly opinionated formatter. It
       | puts an end to useless style arguments.
        
         | andrewzah wrote:
         | I didn't really understand this until I started using golang.
         | Now I don't want to go back to other languages and have to deal
         | with varying syntaxes and manually running a formatter.
        
           | thedance wrote:
           | What's a language that lacks such a facility? ClangFormat
           | covers C/C++/Java/JavaScript/Objective-C/Protobuf/C#. gofmt
           | obviously can do Go. buildifier will rewrite your Bazel BUILD
           | files.
        
             | pje wrote:
             | ruby
        
             | frosted-flakes wrote:
             | Those aren't highly opionated _default_ formatters though.
             | The JavaScript community seems to have settled on Prettier
             | as the go-to formatter, but not everyone uses it, and even
             | Prettier had to add a bunch of options for certain things
             | that will _always_ be highly personal. Stuff like trailing
             | commas, tabs vs spaces (and how many spaces!?), line
             | lengths, and semicolons.
        
         | nicoburns wrote:
         | Every time I use one of these it ends up doing ridiculous
         | things to the formatting, like splitting what should really be
         | 1 line over 4 or 5. Which completely destroys whole-file
         | readability/scanability. Does nobody else find it really hard
         | to read codebases subject to these formatters?
        
           | cbhl wrote:
           | There are always corner cases. For example, clang-format at
           | $day_job doesn't understand semantically that I want my 4x4
           | matrix constant to be formatted as 4 rows with 4 numbers
           | each; putting all 16 numbers in a single long array will get
           | a lower score.
           | 
           | But as a whole, I found it easier to navigate the monolithic
           | code base when most of it has been auto-formatted, exceptions
           | aside. When applying a formatter to existing code, I usually
           | do that in a separate commit and then follow it with any
           | hand-rewritten lines afterwards (in case either the formatter
           | or my changes afterwards accidentally regress something).
           | 
           | And once the formatter is enabled by default, it's usually
           | easy enough to come up with a set of smaller statements that
           | auto-format well to replace a longer statement that auto-
           | formats poorly.
        
           | lhorie wrote:
           | Don't just blindly run a formatter.
           | 
           | I've developed a habit of refactoring long lines early (e.g.
           | breaking large expressions and assigning to intermediary
           | variables) to prevent the formatter from adding hard-to-read
           | mid-expression line breaks
        
             | deadbunny wrote:
             | This is basically the approach I take, write towards the
             | standards of the auto formatter as much as possible and
             | it'll usually handle the rest.
        
       | mholt wrote:
       | I just rewrote our Caddy v2 Caddyfile formatter this week [1].
       | It's not as "smart" as a true code formatter, but what I thought
       | would be just a 1 hour task ended up taking all day. And I'm
       | lucky it wasn't longer!
       | 
       | Lessons learned:
       | 
       | - State is hard. There's something to be said for functional
       | languages, or at least functional patterns.
       | 
       | - The Caddyfile is a very theoretically-impure syntax.
       | 
       | - User input is always messier than you think it will be. (Spend
       | some time helping people on the forums, you'll see what I mean.
       | I'm a neat freak though, so I notice it...)
       | 
       | [1]:
       | https://github.com/caddyserver/caddy/commit/7ee3ab7baa216599...
        
       | sfvisser wrote:
       | I never grasped why this is so hard, you don't need an
       | exponential blowup with some common sense heuristics. Just look
       | at the AST and put the breaks as high up in the tree as possible.
       | 
       | For example: If I have a function application with several
       | arguments one of which is a slightly longer lambda expression,
       | don't even try to break the lambda before putting the arguments
       | (nicely aligned) on their own line.
       | 
       | I'm a big fan of the ergonomics of prettier, but some of its
       | output is objectively ugly and seems the result of over-
       | complicating their search space.
        
       | rattray wrote:
       | Shameless plug that if you're bored and want a really tough, fun
       | challenge, the prettier project has many open tickets:
       | https://github.com/prettier/prettier/issues?q=is%3Aopen+is%3...
       | 
       | Only two are tagged "difficulty:easy", likely for good reason -
       | when I contributed a few fixes a while back, I found it much more
       | challenging than implementing a language syntax, an unrelated
       | project which I was also doing at the same time.
       | 
       | It's really fun, though, and watching colleagues code and use the
       | features of prettier that you built is a nice feeling - "I helped
       | write that code for you!"
        
       | qorrect wrote:
       | I really don't like this guys writing style.
        
       | QuinnWilton wrote:
       | Philip Wadler has a seminal paper [0] on implementing pretty
       | printers by modeling them using an algebra. There's
       | implementations of the algorithm in most programming languages
       | nowadays, and some state of the art advances that help with
       | performance [1]
       | 
       | It's a very elegant algorithm, and is very pleasant to work.
       | Elixir's formatter uses a variant of it [2], and it appears in
       | the standard library [3]. I've personally made use of it to write
       | a code formatter for GraphQL queries, and it was a very small
       | amount of code for some powerful results.
       | 
       | Definitely take a look if you ever need to do something like
       | this.
       | 
       | [0]
       | https://homepages.inf.ed.ac.uk/wadler/papers/prettier/pretti...
       | 
       | [1] https://jyp.github.io/pdf/Prettiest.pdf
       | 
       | [2]
       | http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2...
       | 
       | [3] https://hexdocs.pm/elixir/Inspect.Algebra.html
        
         | westoncb wrote:
         | I'm gonna take a look at Wadler's paper--but does anyone want
         | to take a shot at a high-level description of how an algebra
         | would be applicable to the problem of pretty printing?
         | 
         | Any more general comments on applying algebras to solving
         | programming problems would also be appreciated.
         | 
         | My rough understanding of the benefits/rationale atm:
         | 
         | 1) the algebra is a nice/minimal factoring of the problem
         | 
         | 2) being able to formally classify it means we know a bunch of
         | its properties, which tells us certain approaches are
         | definitely (im)possible (and these properties are drawn from a
         | fairly small set: e.g. associativity, distributivity, existence
         | of inverses)
         | 
         | 3) If the algebra is explicitly represented in code (maybe in a
         | library), we can use that general structure to verify (and
         | generate?) certain aspects of our particular problem solution.
         | 
         | Edit: The paper mentions "Over the years, Richard Bird and
         | others have developed the algebra of programming to a fine art"
         | --and I found Bird's book "Algebra of Programming". Seems like
         | what I'm looking for maybe, but would be grateful to hear of
         | any alternative resources as well.
        
           | QuinnWilton wrote:
           | You'd likely be more interested in the original paper [0] by
           | John Hughes that Wadler built off of. It gets into a ton more
           | detail about the algebraic structure of the documents they're
           | defining:
           | 
           | > The question addressed in this paper is: how should
           | libraries of combinators be designed? Our goal is to show how
           | we can use formal specification of the combinators, and a
           | study of their algebraic properties, to guide both the design
           | and implementation of a combinator library. Our case study is
           | a library for pretty-printing. But the methods we present are
           | of wider applicability, and we also show how they can be used
           | to develop a number of different monads.
           | 
           | Wadler's paper touches upon some of the same ideas, but the
           | primary goal of his paper is to build a prettier printer,
           | whereas Hughes aimed to demonstrate the power of algebraic
           | properties.
           | 
           | [0] http://www.cse.chalmers.se/~rjmh/Papers/pretty.html
        
             | westoncb wrote:
             | This looks great--thanks!
        
               | QuinnWilton wrote:
               | It's a little bit different from the other resources I've
               | linked, but "Elements of Programming" [0] is another
               | excellent (though incredibly challenging) read. It's
               | written by Alexander Stepanov, and basically walks
               | through the way he built STL by decomposing algorithms
               | and data structures into abstract mathematical
               | structures.
               | 
               | > Elements of Programming provides a different
               | understanding of programming than is presented elsewhere.
               | Its major premise is that practical programming, like
               | other areas of science and engineering,must be based on a
               | solid mathematical foundation. The book shows that
               | algorithms implemented in a real programming language,
               | such as C++, can operate in the most general mathematical
               | setting. For example, the fast exponentiation algorithm
               | is defined to work with any associative operation. Using
               | abstract algorithms leads to efficient, reliable, secure,
               | and economical software.
               | 
               | I revisit it every few years and get a little bit further
               | every time before I start to struggle with the exercises.
               | 
               | [0] http://elementsofprogramming.com/
        
           | amykyta wrote:
           | Here is an interesting use
           | https://www.lexifi.com/instrumentbox/. Algebraic description
           | of financial contracts that then gives all sorts of common
           | features you want to analyze them 'for free'.
        
       | einpoklum wrote:
       | The author of that post has it _so easy_! Formatting well-formed
       | programs, in a young language with a short spec. He just starts
       | with an AST!
       | 
       | No no my friend. The challenge is when the program is _not_ well-
       | formed, i.e. has errors, or worse yet - refers to to code
       | elsewhere, so you can't even decide if it's well-formed or not.
       | So no AST for you. You have to go into the dark and dangerous
       | word of speculative and partial parsing of language grammars,
       | representing ambiguity, replacing undefined, ambiguous or
       | erroneously-defined semantics with deductions from the author's
       | existing formatting and so on. That's where the real challenges
       | lie.
       | 
       | And - it's not gonna be 3,853 lines, I can tell you this much.
        
         | eterm wrote:
         | I actually really like a "format on save; don't format without
         | AST" combination.
         | 
         | It's really nice knowing as soon as you hit ctrl+s whether
         | you've just written in a syntax error or not.
        
       | bmc7505 wrote:
       | If you enjoyed reading this, you might want to check out
       | CodeBuff:
       | 
       | https://github.com/antlr/codebuff
       | 
       | CodeBuff is based on a paper by Terrence Parr and Juergen Vinju,
       | "Towards a Universal Code Formatter through Machine Learning":
       | 
       | https://arxiv.org/abs/1606.08866
        
       | dfox wrote:
       | 10 years ago I though that writing S-expression pretty printer
       | should be significantly easier than what the complexity of
       | typical CL/Scheme pretty printer (ie. Waters' XP as described in
       | AIM-1102) implies. Well that is not true ;)
       | 
       | And since trying to do that I almost involuntarily format any
       | code I write by rigorously applying the pretty printing algorithm
       | of XP by combination of manual line breaks and Emacs' newline-
       | and-indent.
        
       | jansan wrote:
       | How do you indent block comments? Do they stay untouched? Do you
       | indent them the same as the previous line? If so, what to do if
       | there are already indents in the comment? Add the additional
       | indent?
       | 
       | I wrote a simple but IMHO really nice XML prettifier a while ago
       | and remember that I finally gave up on block comments, because I
       | could not find a way to handle all cases in a satisfying manner.
       | So I decided to leave the indenting untouched.
        
         | munificent wrote:
         | _> How do you indent block comments? Do they stay untouched?_
         | 
         | Block comments are a little tricky and there are a handful of
         | heuristics to handle them. Generally they stick to the
         | preceding token. Block comments are rarely used in Dart and
         | when they are, it's often just a small marker to describe the
         | previous element like:                   foo(null /* missing
         | */, blah);
         | 
         | So adhering it to the previous token usually does the right
         | thing.
         | 
         |  _> If so, what to do if there are already indents in the
         | comment?_
         | 
         | It doesn't touch the body of the comment itself. There's too
         | many ways for that to go wrong given all of the various things
         | a human might choose to stuff in there: ASCII art, tables,
         | prose, etc.
        
       | blumomo wrote:
       | I can't help myself, the required semicolon at the end of each
       | line in Dart programs keep hurting after writing for many years
       | in Python, JS, Kotlin and Swift. I grew up semicolon-less with
       | QBasic and learned semicolon since Delphi 1.0 and Java 1.0 but
       | today, wanting as few noise on the screen as possible, they are
       | hurting in my eyes.
        
         | seanmcdirmid wrote:
         | Semicolons are just extra signal to humans and compilers, they
         | shouldn't hurt eyes like the use of fixed width fonts do.
        
           | jakear wrote:
           | It's not that they hurt the eyes for me as much as they
           | introduce tons of micro delays and extra keystrokes when
           | editing. For instance, any "X to end of line" (cut, copy,
           | etc), you now need to think about what the destination is and
           | whether you should include the semi, change the selection, go
           | back and delete the semi later, etc.
           | 
           | It sounds slight, but it really builds up over time. Once you
           | get used to "powerediting" without them it's hard to go back.
        
             | seanmcdirmid wrote:
             | End of line inference algorithms are incredibly fragile,
             | both by the compiler and by the human reader, having a
             | semi-colon can make things clearer in the absence of
             | another secondary signal (like indentation). There are so
             | many type errors in scala or typescript that can be solved
             | by adding a semicolon somewhere to eliminate some idiotic
             | ambiguity.
        
               | tom_ wrote:
               | Why not infer the end of a line when finding the end of a
               | line? ;)
               | 
               | This is what most BASICs and assemblers did.
               | 
               | (I think QBasic had a line continuation character? - or
               | maybe that came with Visual BASIC. I don't remember. But,
               | anyway, if you wanted, it was possible to continue a
               | source line past a line break, it just wasn't the
               | default.)
        
               | saagarjha wrote:
               | > End of line inference algorithms are incredibly fragile
               | 
               | I am working on a language _right now_ with optional
               | semicolons (that is, newlines terminate statements unless
               | there is a semicolon, which does the same early) and it's
               | not really that hard. The only change is that instead of
               | looking for a newline character for statement termination
               | I often also need to look for a semicolon.
        
               | jakear wrote:
               | I think the issue is more when the new line _doesn't_
               | terminate the statement. Like in JS:
               | 
               | const a = doThing()
               | 
               | (() => throw Error)
        
               | jackewiehose wrote:
               | Only-EOL like python is fine and always-semicolon like C
               | is fine. Sometimes-optional but sometimes-required
               | semicolon like Javascript is bad in my experience.
               | Javascripts C-like syntax doesn't cope well with missing
               | semicolons
        
         | munificent wrote:
         | I'm the author of the post and work on the Dart language. I
         | spent a bunch of time last year trying to figure out a good way
         | to make semicolons optional in Dart without breaking a lot of
         | stuff.
         | 
         | Because of a handful of syntax quirks particular to Dart, it's
         | really difficult. Most other languages that have optional
         | semicolons were designed from the start to make them optional.
         | Dart was not. On top of that it:
         | 
         | * Uses C-style variable declarations which means there's no
         | keyword to indicate the start of a variable declaration.
         | 
         | * Likewise uses C syntax for function and even local function
         | declarations. Again no keyword marking the beginning of a
         | function.
         | 
         | * Has complex syntax for things like function types which means
         | a type annotation can be arbitarily long and may split over
         | multiple lines.
         | 
         | * Has lots of "contextual keywords" that are not true reserved
         | words but behave like them only in certain contexts.
         | 
         | * Has a rich syntax that overloads a lot of tokens to mean
         | different things in different contexts. "{}" can be an empty
         | map, empty set, empty block, empty function body, or empty
         | lambda body.
         | 
         | All of that makes optional semicolons just really difficult. I
         | haven't given up entirely. We have a thing now called "language
         | versioning" that lets different Dart libraries target different
         | specific versions of the language. That lets us evolve the
         | language in nominally "breaking" ways without actually breaking
         | existing code. (For example, this is how we will migrate the
         | world to non-nullable types.)
         | 
         | That may give us a way to make other grammar simplifications
         | that would then also let us make semicolons optional. But
         | syntax changes like this are always difficult. People have
         | _very strong_ opinions about you changing their existing syntax
         | under them, and the value proposition in return isn 't super
         | compelling.
         | 
         | There's an old saying, the best time to plant a tree is twenty
         | years ago. The second best time is today.
         | 
         | Language syntax isn't always like that because the migration
         | cost once you have a lot of extant code (and users) can just be
         | too painful. The best time to make semicolons optional was
         | twenty years ago. The next best time may be today, but it may
         | just be never.
        
       ___________________________________________________________________
       (page generated 2020-03-27 23:00 UTC)