[HN Gopher] Postgres Language Server: Implementing the Parser
       ___________________________________________________________________
        
       Postgres Language Server: Implementing the Parser
        
       Author : todsacerdoti
       Score  : 123 points
       Date   : 2023-12-08 12:33 UTC (10 hours ago)
        
 (HTM) web link (supabase.com)
 (TXT) w3m dump (supabase.com)
        
       | kiwicopple wrote:
       | (i'm on the supabase team)
       | 
       | this is an update from the launch here:
       | 
       | https://news.ycombinator.com/item?id=37020610
       | 
       | The focus for the past few months has been getting the Parser
       | right. It has been a lot of work. I'll ping Philipp and get him
       | to join the discussion if there are any questions
       | 
       | We'd also love more contributors, if this sort of thing is up
       | your alley
        
         | specialist wrote:
         | > _...getting the Parser right. It has been a lot of work._
         | 
         | True.
         | 
         | I have an ANTLR4 grammar for SQL. Am noob. Getting things
         | "correct" was some effort. I have the quixotic goal of
         | accepting multiple dialects. It's a lot of time referencing
         | docs, playing with SQL fiddles, and learning how other grammars
         | work.
         | 
         | Now working on performance. Just as tough for noob me. TIL That
         | means removing any potential ambiquities at runtime. ANTLR4's
         | ALL(*) algorithm does some runtime magic whenever the happy
         | path fails. Which tanks performance. Great for making "good
         | enough" grammars. Not so great for chewing thru a corpus of
         | tests.
         | 
         | SQLite has a terrific test suite. Including generated 'random'
         | machine queries. My naive grammar took > 1000ms on these
         | monsters, when it worked. After a few, it'd just ABEND with out
         | of memory errors. Whoops!
         | 
         | By removing ambiquities, those monsters now take < 100ms. I
         | really shouldn't make any claims until all the tests pass. (I'm
         | still trying to figure out if I can handle both T-SQL style
         | bracketed [names] and Postgres style name[index] array
         | references, in the same grammar, without symmantic predicates.)
         | 
         | Any way. This experience has piqued my interest in PEG. I'm
         | worried my LL(k) grammar is too brittle. Making it hard to for
         | any one to maintain and adapt. As you well know, for sure. (For
         | a taste of this challenge, peek at some of the other available
         | ANTLR SQL grammars.) Oh well; that's a concern for later.
         | 
         | I'm looking forward to checking out Supabase's grammar. I'll
         | share mine (Show HN) when I think it's good enough.
        
           | steinroe wrote:
           | that's a huge task to take upon, looking forward to go
           | through it! compared to you, we have gone the "easy" way and
           | use the actual parser from the Postgres server. so no grammar
           | definition and the like. our work was mainly around adapting
           | libpg_query (which is build to parse executable SQL) for our
           | use case.
        
             | CoastalCoder wrote:
             | > use the actual parser from the Postgres server. so no
             | grammar definition and the like.
             | 
             | Doesn't Postgres use Lex/Yacc to describe the grammar?
        
               | sjansen wrote:
               | It actually uses Flex/Bison, and does take advantage of
               | the extended features they provide.
        
           | atonse wrote:
           | If we're all being postgres-specific, isn't there presumably
           | a grammar somewhere in postgres's code that is used to parse
           | the query SQL? Wouldn't you all want to use that same file as
           | a source so it's 100% identical to what Postgres is actually
           | doing?
           | 
           | Genuine question. But maybe the tooling is incompatible.
           | 
           | Also, have you looked at Azure Data Studio? I'm wondering how
           | much you can reuse from the work they did.
        
             | SR2Z wrote:
             | Not necessarily. It might be a recursive descent parser (to
             | generate better error messages).
        
         | claytongulick wrote:
         | Only tangentially related, but if it's of any interest I'm
         | about 90% done with a PEG grammar for the PostgREST query DSL
         | for similar reasons. Happy to share it if it'd be useful.
        
       | steinroe wrote:
       | hey, author here. Thanks for posting it!
       | 
       | A bit of background: a few months ago we announced a Postgres
       | language server[0]. A language server adds features like syntax
       | error diagnostic and autocomplete to your editor (vscode, neovim,
       | etc). We have iterated a lot on the parser over the past few
       | months and want to share an update today.
       | 
       | the parser is a core piece of any language server that constructs
       | syntax trees from the raw input string. Usually first an untyped
       | concrete syntax tree (cst) that represents the syntactic
       | structure of the input, and subsequently a typed abstract syntax
       | tree (ast) containing the meaning of the source.
       | 
       | In our implementation, we leverage the actual Postgres parser to-
       | do the heavy lifting. However, the parser is designed to parse
       | executable SQL -- not to provide language intelligence. For
       | example, it does not handle incomplete inputs, and outputs just
       | the ast, not the cst. To use it for a language server we had to
       | work around these limitations as good as possible.
       | 
       | While we leverage procedural macros in rust to generate a lot of
       | the repetitive parser code, there remains a portion that requires
       | a bit of manual work. But the groundwork is completed, and we can
       | finally start working on the data model and the actual server
       | next. Our aim is to bring this to a usable state as swiftly as
       | possible.
       | 
       | Huge shout-out to pg_analyze for creating and maintaining
       | libpg_query[1], without which this project would not be possible!
       | 
       | [0] https://news.ycombinator.com/item?id=37020610
       | 
       | [1] https://github.com/pganalyze/libpg_query
        
         | kageiit wrote:
         | Have you considered using something like a treesitter grammar?
         | It could solve the editor specific uses cases like highlighting
         | and even linting as it creates asts that are more amenable for
         | a language server implementation
        
           | chatmasta wrote:
           | At Splitgraph we compiled tree-sitter to wasm for Postgres
           | autocomplete. Indeed, one major reason we opted for it is
           | that it has error handling so it can work with incomplete
           | syntax (which is what your code is 90% of the time you're
           | editing it).
           | 
           | https://www.splitgraph.com/blog/parsing-pgsql-with-tree-
           | sitt...
        
             | steinroe wrote:
             | thanks for the link, very interesting read! and you are
             | right, libpg_query has its limitations.
             | 
             | the idea is to first implement the parser with libpg_query
             | and work around its limitations as good as possible. Since
             | the scan api also returns all tokens for invalid sql, the
             | language server will then have basic features and syntax
             | error diagnostics for invalid statements, and advanced
             | features for valid ones. once the server itself is done, we
             | want to go back to the parser and replace the libpg_query-
             | based parser with a more resilient alternative statement by
             | statement. ultimately, the libpg_query-based parser should
             | just be the fallback.
             | 
             | that being said, very excited that there is so much
             | development in postgres dx.
        
               | chatmasta wrote:
               | Yes, I think the optimal solution here is a combination
               | of tree-sitter for real time (as-you-type) with a
               | fallback to libpg_query. I mean it's technically the
               | other way around, since libpg_query is preferred when it
               | parses correctly. But yeah I think you inevitably need a
               | combination. Pretty sure TS does similar things in
               | VSCode.
        
         | gen220 wrote:
         | Thank you for doing this!
         | 
         | Every year I get frustrated with a PostgreSQL formatter, look
         | out into the webs for hope, and begrudgingly return to my sub-
         | par editing experience.
         | 
         | Please take your time to do a good job, and thank you, again!
        
         | anarazel wrote:
         | Have you investigated generating your parser from postgres'
         | gram.y instead of basically basing it on the bison output?
        
           | steinroe wrote:
           | that's a very interesting idea!
           | 
           | for now, our goal is to take the "easy" route with
           | libpg_query and build a language server that provides basic
           | lsp features for invalid sql, and advanced lsp features for
           | valid sql as fast as possible. we then want to go back to the
           | parser and replace the libpg_query-based approach with a more
           | resilient alternative. as of now, the plan is to implement a
           | handwritten recursive-descent statement by statement. will
           | definitely do research to what extend we could leverage
           | gram.y there, especially to potentially fast-track it.
        
             | anarazel wrote:
             | If some annotations or such would make that easier, it
             | might be possible to do that upstream. Postgres currently
             | has hand-generated code for tab-completion in psql and
             | that's uh, not great (it has gotten large enough that msvc
             | has trouble compiling the file).
        
       | zubairq wrote:
       | I Really like this!
        
       | netcraft wrote:
       | I can't wait for this. Just getting out a great AST would be
       | amazing for things like formatters, but I want to build things
       | like a library that knows everywhere you use a table in a
       | particular way or reference a column, or linting that can help
       | you identify problems with the way you do a join or forgot to
       | apply the right filter for multi-tenant queries.
        
         | steinroe wrote:
         | that is definitely the goal, both a formatter and a linter. we
         | want to add something like squawk and plpgsql_check directly to
         | the language server, so you get eslint-like dx. with both the
         | ast and the database schema at hand, you can basically add any
         | rule you like.
         | 
         | and this is far out, but eventually we are maybe even able to
         | combine the language server with declarative schema management
         | and have go-to-definition etc working.
         | 
         | [0] https://github.com/sbdchd/squawk/tree/master [1]
         | https://github.com/okbob/plpgsql_check
        
         | mhh__ wrote:
         | For a formatter you only need tokens and rules. ASTs are
         | unnecessary unless you want to do transformations at the same
         | time.
        
       | 2mm3mm3 wrote:
       | When will PG finally start using tasks instead of
       | superheavyweight threads? When will they automatically
       | precompile/hash statements to avoid reparsing? And when
       | materialized views will be automatically updated? Where is an
       | alternative to SQL Server Always On Availability Group released
       | >10 years ago? PostgreSQL is the only database where guys
       | recommend you to configure a separate connection pooler, because
       | their database is unable to handle even mid. size number of
       | connections.
        
         | kiwicopple wrote:
         | it's a great list. No database is perfect, and this is a good
         | set of features that are missing from Postgres
         | 
         | that said, the flexibility/extensibility of Postgres still
         | makes it a great choice. You could easily point at other
         | databases and produce a list similar (or longer) than this.
        
         | Yasuraka wrote:
         | I'll take vertical sharding until no longer possible and then
         | dropping all further requests over touching Always On
         | Availability Groups ever again
        
         | anarazel wrote:
         | When will you contribute?
        
         | paulddraper wrote:
         | lol, salty
        
       | epgui wrote:
       | Out of curiosity, why are parser combinators not the choice
       | approach for something like this? Is it mainly a performance
       | thing?
       | 
       | PS-- this is super exciting!
        
         | steinroe wrote:
         | in our specific case, we needed something handwritten for the
         | statement-level parser anyways. And the requirements for the LL
         | parser are very simple: extract individual sql statements from
         | a source input. we just compare the next n tokens with a list
         | of tokens from which any statement starts, which is
         | straightforward to implement with a handwritten LL parser.
         | 
         | so after all, I would say its a decision based on the special
         | requirements we have working around the limitation of
         | libpg_query. I think its also the fastest one, but this was not
         | the main reason.
        
           | epgui wrote:
           | Thanks for your response!
           | 
           | I'm a little confused by it though... Maybe I am
           | misunderstanding what you're telling me.
           | 
           | I would definitely consider a parser implemented with parser
           | combinators to be "handwritten". The main difference for me
           | is that parser combinators allow you to express a language
           | grammar in a more declarative way, which makes refactoring
           | and correctness verification much simpler. I think what you
           | describe as "straightforward" without combinators would be
           | "completely trivial" with combinators.
           | 
           | I know from reading recent comp sci papers (particularly in
           | the Journal of Functional Programming) that "the parser
           | problem" is not fully solved yet (!) and that there are
           | various theoretical limits to all currently known formal
           | approaches. I suspect that performance could be a limiting
           | factor for certain types of syntaxes, although I'm not sure
           | if Postgres' would be particularly problematic in this
           | regard.
        
             | steinroe wrote:
             | sorry, I think I misunderstood your question!
             | 
             | after all, we did not implement a "real" parser. we just
             | use libpg_query, the actual Postgres parser, and work
             | around its limitations as good as possible. The
             | implementation thereby required maximum flexibility. we
             | never define any grammar other than "a select statement
             | starts with a SELECT keyword".
        
       ___________________________________________________________________
       (page generated 2023-12-08 23:00 UTC)