[HN Gopher] Postgres Language Server: Implementing the Parser ___________________________________________________________________ Postgres Language Server: Implementing the Parser Author : todsacerdoti Score : 123 points Date : 2023-12-08 12:33 UTC (10 hours ago) (HTM) web link (supabase.com) (TXT) w3m dump (supabase.com) | kiwicopple wrote: | (i'm on the supabase team) | | this is an update from the launch here: | | https://news.ycombinator.com/item?id=37020610 | | The focus for the past few months has been getting the Parser | right. It has been a lot of work. I'll ping Philipp and get him | to join the discussion if there are any questions | | We'd also love more contributors, if this sort of thing is up | your alley | specialist wrote: | > _...getting the Parser right. It has been a lot of work._ | | True. | | I have an ANTLR4 grammar for SQL. Am noob. Getting things | "correct" was some effort. I have the quixotic goal of | accepting multiple dialects. It's a lot of time referencing | docs, playing with SQL fiddles, and learning how other grammars | work. | | Now working on performance. Just as tough for noob me. TIL That | means removing any potential ambiquities at runtime. ANTLR4's | ALL(*) algorithm does some runtime magic whenever the happy | path fails. Which tanks performance. Great for making "good | enough" grammars. Not so great for chewing thru a corpus of | tests. | | SQLite has a terrific test suite. Including generated 'random' | machine queries. My naive grammar took > 1000ms on these | monsters, when it worked. After a few, it'd just ABEND with out | of memory errors. Whoops! | | By removing ambiquities, those monsters now take < 100ms. I | really shouldn't make any claims until all the tests pass. (I'm | still trying to figure out if I can handle both T-SQL style | bracketed [names] and Postgres style name[index] array | references, in the same grammar, without symmantic predicates.) | | Any way. This experience has piqued my interest in PEG. I'm | worried my LL(k) grammar is too brittle. Making it hard to for | any one to maintain and adapt. As you well know, for sure. (For | a taste of this challenge, peek at some of the other available | ANTLR SQL grammars.) Oh well; that's a concern for later. | | I'm looking forward to checking out Supabase's grammar. I'll | share mine (Show HN) when I think it's good enough. | steinroe wrote: | that's a huge task to take upon, looking forward to go | through it! compared to you, we have gone the "easy" way and | use the actual parser from the Postgres server. so no grammar | definition and the like. our work was mainly around adapting | libpg_query (which is build to parse executable SQL) for our | use case. | CoastalCoder wrote: | > use the actual parser from the Postgres server. so no | grammar definition and the like. | | Doesn't Postgres use Lex/Yacc to describe the grammar? | sjansen wrote: | It actually uses Flex/Bison, and does take advantage of | the extended features they provide. | atonse wrote: | If we're all being postgres-specific, isn't there presumably | a grammar somewhere in postgres's code that is used to parse | the query SQL? Wouldn't you all want to use that same file as | a source so it's 100% identical to what Postgres is actually | doing? | | Genuine question. But maybe the tooling is incompatible. | | Also, have you looked at Azure Data Studio? I'm wondering how | much you can reuse from the work they did. | SR2Z wrote: | Not necessarily. It might be a recursive descent parser (to | generate better error messages). | claytongulick wrote: | Only tangentially related, but if it's of any interest I'm | about 90% done with a PEG grammar for the PostgREST query DSL | for similar reasons. Happy to share it if it'd be useful. | steinroe wrote: | hey, author here. Thanks for posting it! | | A bit of background: a few months ago we announced a Postgres | language server[0]. A language server adds features like syntax | error diagnostic and autocomplete to your editor (vscode, neovim, | etc). We have iterated a lot on the parser over the past few | months and want to share an update today. | | the parser is a core piece of any language server that constructs | syntax trees from the raw input string. Usually first an untyped | concrete syntax tree (cst) that represents the syntactic | structure of the input, and subsequently a typed abstract syntax | tree (ast) containing the meaning of the source. | | In our implementation, we leverage the actual Postgres parser to- | do the heavy lifting. However, the parser is designed to parse | executable SQL -- not to provide language intelligence. For | example, it does not handle incomplete inputs, and outputs just | the ast, not the cst. To use it for a language server we had to | work around these limitations as good as possible. | | While we leverage procedural macros in rust to generate a lot of | the repetitive parser code, there remains a portion that requires | a bit of manual work. But the groundwork is completed, and we can | finally start working on the data model and the actual server | next. Our aim is to bring this to a usable state as swiftly as | possible. | | Huge shout-out to pg_analyze for creating and maintaining | libpg_query[1], without which this project would not be possible! | | [0] https://news.ycombinator.com/item?id=37020610 | | [1] https://github.com/pganalyze/libpg_query | kageiit wrote: | Have you considered using something like a treesitter grammar? | It could solve the editor specific uses cases like highlighting | and even linting as it creates asts that are more amenable for | a language server implementation | chatmasta wrote: | At Splitgraph we compiled tree-sitter to wasm for Postgres | autocomplete. Indeed, one major reason we opted for it is | that it has error handling so it can work with incomplete | syntax (which is what your code is 90% of the time you're | editing it). | | https://www.splitgraph.com/blog/parsing-pgsql-with-tree- | sitt... | steinroe wrote: | thanks for the link, very interesting read! and you are | right, libpg_query has its limitations. | | the idea is to first implement the parser with libpg_query | and work around its limitations as good as possible. Since | the scan api also returns all tokens for invalid sql, the | language server will then have basic features and syntax | error diagnostics for invalid statements, and advanced | features for valid ones. once the server itself is done, we | want to go back to the parser and replace the libpg_query- | based parser with a more resilient alternative statement by | statement. ultimately, the libpg_query-based parser should | just be the fallback. | | that being said, very excited that there is so much | development in postgres dx. | chatmasta wrote: | Yes, I think the optimal solution here is a combination | of tree-sitter for real time (as-you-type) with a | fallback to libpg_query. I mean it's technically the | other way around, since libpg_query is preferred when it | parses correctly. But yeah I think you inevitably need a | combination. Pretty sure TS does similar things in | VSCode. | gen220 wrote: | Thank you for doing this! | | Every year I get frustrated with a PostgreSQL formatter, look | out into the webs for hope, and begrudgingly return to my sub- | par editing experience. | | Please take your time to do a good job, and thank you, again! | anarazel wrote: | Have you investigated generating your parser from postgres' | gram.y instead of basically basing it on the bison output? | steinroe wrote: | that's a very interesting idea! | | for now, our goal is to take the "easy" route with | libpg_query and build a language server that provides basic | lsp features for invalid sql, and advanced lsp features for | valid sql as fast as possible. we then want to go back to the | parser and replace the libpg_query-based approach with a more | resilient alternative. as of now, the plan is to implement a | handwritten recursive-descent statement by statement. will | definitely do research to what extend we could leverage | gram.y there, especially to potentially fast-track it. | anarazel wrote: | If some annotations or such would make that easier, it | might be possible to do that upstream. Postgres currently | has hand-generated code for tab-completion in psql and | that's uh, not great (it has gotten large enough that msvc | has trouble compiling the file). | zubairq wrote: | I Really like this! | netcraft wrote: | I can't wait for this. Just getting out a great AST would be | amazing for things like formatters, but I want to build things | like a library that knows everywhere you use a table in a | particular way or reference a column, or linting that can help | you identify problems with the way you do a join or forgot to | apply the right filter for multi-tenant queries. | steinroe wrote: | that is definitely the goal, both a formatter and a linter. we | want to add something like squawk and plpgsql_check directly to | the language server, so you get eslint-like dx. with both the | ast and the database schema at hand, you can basically add any | rule you like. | | and this is far out, but eventually we are maybe even able to | combine the language server with declarative schema management | and have go-to-definition etc working. | | [0] https://github.com/sbdchd/squawk/tree/master [1] | https://github.com/okbob/plpgsql_check | mhh__ wrote: | For a formatter you only need tokens and rules. ASTs are | unnecessary unless you want to do transformations at the same | time. | 2mm3mm3 wrote: | When will PG finally start using tasks instead of | superheavyweight threads? When will they automatically | precompile/hash statements to avoid reparsing? And when | materialized views will be automatically updated? Where is an | alternative to SQL Server Always On Availability Group released | >10 years ago? PostgreSQL is the only database where guys | recommend you to configure a separate connection pooler, because | their database is unable to handle even mid. size number of | connections. | kiwicopple wrote: | it's a great list. No database is perfect, and this is a good | set of features that are missing from Postgres | | that said, the flexibility/extensibility of Postgres still | makes it a great choice. You could easily point at other | databases and produce a list similar (or longer) than this. | Yasuraka wrote: | I'll take vertical sharding until no longer possible and then | dropping all further requests over touching Always On | Availability Groups ever again | anarazel wrote: | When will you contribute? | paulddraper wrote: | lol, salty | epgui wrote: | Out of curiosity, why are parser combinators not the choice | approach for something like this? Is it mainly a performance | thing? | | PS-- this is super exciting! | steinroe wrote: | in our specific case, we needed something handwritten for the | statement-level parser anyways. And the requirements for the LL | parser are very simple: extract individual sql statements from | a source input. we just compare the next n tokens with a list | of tokens from which any statement starts, which is | straightforward to implement with a handwritten LL parser. | | so after all, I would say its a decision based on the special | requirements we have working around the limitation of | libpg_query. I think its also the fastest one, but this was not | the main reason. | epgui wrote: | Thanks for your response! | | I'm a little confused by it though... Maybe I am | misunderstanding what you're telling me. | | I would definitely consider a parser implemented with parser | combinators to be "handwritten". The main difference for me | is that parser combinators allow you to express a language | grammar in a more declarative way, which makes refactoring | and correctness verification much simpler. I think what you | describe as "straightforward" without combinators would be | "completely trivial" with combinators. | | I know from reading recent comp sci papers (particularly in | the Journal of Functional Programming) that "the parser | problem" is not fully solved yet (!) and that there are | various theoretical limits to all currently known formal | approaches. I suspect that performance could be a limiting | factor for certain types of syntaxes, although I'm not sure | if Postgres' would be particularly problematic in this | regard. | steinroe wrote: | sorry, I think I misunderstood your question! | | after all, we did not implement a "real" parser. we just | use libpg_query, the actual Postgres parser, and work | around its limitations as good as possible. The | implementation thereby required maximum flexibility. we | never define any grammar other than "a select statement | starts with a SELECT keyword". ___________________________________________________________________ (page generated 2023-12-08 23:00 UTC)