[HN Gopher] Show HN: Pg_jsonschema - A Postgres extension for JS...
       ___________________________________________________________________
        
       Show HN: Pg_jsonschema - A Postgres extension for JSON validation
        
       pg_jsonschema is a solution we're exploring to allow enforcing more
       structure on json and jsonb typed postgres columns.  We initially
       wrote the extension as an excuse to play with pgx, the rust
       framework for writing postgres extensions. That let us lean on
       existing rust libs for validation (jsonschema), so the extension's
       implementation is only 10 lines of code :)
       https://github.com/supabase/pg_jsonschema/blob/fb7ab09bf6050...
       happy to answer any questions!
        
       Author : oliverrice
       Score  : 128 points
       Date   : 2022-07-21 14:31 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | pella wrote:
       | Thank you!
       | 
       | Some suggestion for the next roadmap:
       | 
       | - a Dockerfile ( The dockerfile helps me a lot in trying out new
       | technologies )
       | 
       | - info about the compatibility with new PG15
       | 
       | - CI/CD
        
         | dewey wrote:
         | Do you really need a Dockerfile for a library that consists of
         | one file?
        
           | pella wrote:
           | I do not have rust installed by default.
           | 
           | I mean I don't think it's just 1 line
           | 
           | - this is just a rust part: https://github.com/rust-
           | lang/docker-rust/blob/master/Dockerf...
           | 
           | - plus "postgresql-server-dev-14"
           | 
           | - ???
        
       | cpursley wrote:
       | I just love the work both Supabase & Hasura have done making
       | people aware of how powerful Postgres is.
        
       | shroompasta wrote:
       | How does this validate data with a variable amount of keys with
       | the same value type for example a to-do list
       | 
       | my day to day to do list varies in the number of tasks, but the
       | completion will always be in boolean                   [
       | {              "task": "do Foo",               "completed":
       | False,              },              {              "task": "do
       | Bar",               "completed": False,              },
       | {              "task": "do Baz",               "completed":
       | False,              },              ...         ]
       | 
       | Also, what is the issue of schema validation before inserting
       | into the json column, as this is what I'm doing with a small
       | microservice with Redis.
        
       | hardwaresofton wrote:
       | This is awesome -- really excited that Supabase is picking this
       | up with their commitment to open source in general and PG in
       | particular.
       | 
       | Some prior art:
       | 
       | - https://github.com/gavinwahl/postgres-json-schema (mentioned in
       | the repo)
       | 
       | - https://github.com/furstenheim/is_jsonb_valid
       | 
       | pgx[0] is going to be pretty revolutionary for the postgres
       | ecosystem I think -- there is so much functionality that would
       | benefit from happening inside the database and I can't think of a
       | language I want to use at the DB level more than Rust.
       | 
       | [0]: https://github.com/tcdi/pgx
        
         | rad_gruchalski wrote:
         | I was working on a custom YugabyteDB rollout last year. Part of
         | that work was building some custom Postgres extensions for
         | additional multi-tenancy enforcement in a distributed cluster.
         | 
         | I must admit that the whole extension ecosystem in Postgres is
         | amazing, even without any additional language layer.
         | 
         | I don't write C very often but writing Postgres extensions was
         | so easy to approach, it blew my mind!
        
       | dividuum wrote:
       | Not experienced with Postgres and its ecosystem unfortunately,
       | but all those Postgres extensions popping up on hn lately
       | certainly make me envious. To someone with more insight: How
       | risky is it to rely on those extensions? I guess rust handles the
       | 'accidental data corruption or crashes' aspect. How difficult is
       | it to continue to use such an extension once the original author
       | walks away? Is the extension API somewhat (or perfectly?) stable?
       | Given that this example probably mostly used in CHECK contraints,
       | I guess it could be fairly easy removed or replaced from a
       | running installation?
        
         | CodeIsTheEnd wrote:
         | The most common constraint is whether the DB as a service
         | offerings support a given extension, since they don't support
         | installing custom ones. Naturally choosing to support an
         | extension across a fleet of hundreds of thousands of instances
         | (running dozens of different minor versions) isn't a decision
         | made lightly, so it can take a while for new extensions to be
         | supported.
         | 
         | Supported extensions on AWS RDS:
         | https://docs.aws.amazon.com/AmazonRDS/latest/PostgreSQLRelea...
         | 
         | Supported extensions on Google Cloud SQL:
         | https://cloud.google.com/sql/docs/postgres/extensions
         | 
         | Supported extensions on Azure Database:
         | https://docs.microsoft.com/en-us/azure/postgresql/single-ser...
        
       | paulmd wrote:
       | Nice work, however, I am structurally dubious of putting too much
       | functionality onto a classical centralized RDBMS since it can't
       | be scaled out if performance becomes a problem. It's CPU load and
       | it's tying up a connection which is a large memory load (as
       | implemented in postgres, connections are "expensive") and since
       | this occurs inside a transaction it's holding locks/etc as well.
       | I know it's all compiled native code so it's about as fast as it
       | can be, but, it's just a question of whether it's the right place
       | to do that as a general concern.
       | 
       | I'd strongly prefer to have the application layer do generic
       | json-schema validation since you can spawn arbitrary containers
       | to spread the load. Obviously some things are unavoidable if you
       | want to maintain foreign-key constraints or db-level check
       | constraints/etc but people frown on check constraints sometimes
       | as well. Semantic validity should be checked before it gets to
       | the DB.
       | 
       | I was exploring a project with JSON generation views inside the
       | database for coupling the DB directly to SOLR for direct data
       | import, and while it worked fine (and performed fine with toy
       | problems) that was just always my concern... even there where
       | it's not holding write locks/etc, how much harder are you hitting
       | the DB for stuff that, ultimately, can. be done slower but more
       | scalably in an application container?
       | 
       | YAGNI, I know, cross the bridge when it comes, butjust as a
       | blanket architectural concern that's not really where it belongs
       | imo.
       | 
       | In my case at least, probably it's something that could be pushed
       | off to followers in a leader-follower cluster as a kind of read
       | replica, but I dunno if that's how it's implemented or not. "Read
       | replicas" are something that are a lot more fleshed out in Citus,
       | Enterprise, and the other commercial offerings built on raw
       | Postgres iirc.
        
         | billythemaniam wrote:
         | All this does is make writing a CHECK constraint for a JSON
         | blob relatively straightforward. And yes you should use CHECKs
         | in the database. In practice, it is very easy for application
         | code to either not check or screw it up. It is much harder to
         | find and clean up bad data in the database than prevent it from
         | being inserted. By making it a schema config, it is much easier
         | to audit by other engineers and more likely to be correct.
         | 
         | If CHECKs are causing perf issues, then okay maybe put the
         | problematic ones somewhere else.
        
         | sgk284 wrote:
         | If data validation doesn't belong in the database, then what
         | does? At that point you're treating your RDBMS as a fancy
         | filesystem and drastically underutilizing the tool.
         | 
         | By centralizing data validation, it removes many potential
         | failure and inconsistency scenarios (w.r.t different services
         | validating things differently).
         | 
         | Worrying about CPU, without seeing if it's a real problem for
         | your use case, is a premature optimization. Similar to worrying
         | about foreign key constraint checks being too expensive. This
         | is rarely the case, but if it winds up being a problem, you can
         | relax it later and move the check it elsewhere in your stack
         | (or remove it entirely).
        
       | BrandiATMuhkuh wrote:
       | This is really cool. This will make it much simpler to convert
       | from firestore to supabase. I think that's the last missing
       | feature that firestore provided which supabase didn't.
       | 
       | We are already running a sync process between firestore and
       | postgres. So we can do aggregations on JSON data. At this point
       | it's only a matter of time before we move to superbase
        
       | rubenfiszel wrote:
       | This is absolutely brilliant.
       | 
       | In windmill, https://github.com/windmill-labs/windmill (self-
       | hostable AWS Lambda, OSS AGPLv3) we infer the jsonschema of your
       | script by doing static analysis but so far we were not doing
       | validation of the payload itself, if your script failed because
       | of incorrect payload that was your problem. Now without any
       | additional effort I will be able to add validation and great
       | error reporting "for free".
        
         | tmp_anon_22 wrote:
         | That project looks awesome, there are so many opportunities to
         | take entrenched cloud products that leave a lot to be desired
         | and turn them into smaller businesses that better serve the
         | customers that AWS largely ignores due to their preference for
         | volume.
        
           | rubenfiszel wrote:
           | Thanks, the self-hostable AWS lambda is actually just one
           | part. On top you have a flow builder and platform which is
           | aiming to provide an OSS self-hostable alternative to
           | solutions like pipedream, airplane, aws step. I am not
           | necessarily aiming to replace AWS per se because a great way
           | to host it is EC2 but what I want to reduce is lock-in into
           | proprietary services and architecture complexity, not every
           | lambda need to scale to a 1M req/s.
           | 
           | Supabase is actually a great comparable, complementary and
           | inspiration. Ideally, windmill allow tech teams and startups
           | to focus on their business logic of their internal tools, not
           | the infra to run it.
        
             | tmp_anon_22 wrote:
             | I particularly appreciate the project structure of the
             | linked repo. Its good to see actual production application
             | code beyond a hello-world todo style app for languages like
             | Rust and toolkits like Svelte. I look forward to going
             | through the various pieces and learning them better by
             | example.
        
       | warmwaffles wrote:
       | If adding this check to an existing table with millions of
       | records, will it scan the entire table checking all of the
       | records, or just check when the records are inserted or updated.
        
         | oliverrice wrote:
         | if you add a check constraint to an existing table only
         | new/updated records are checked. Existing records will not be
         | checked
        
           | mjw1007 wrote:
           | What you describe is not the default behaviour with
           | PostgreSQL (though it is available, via ALTER TABLE ADD
           | CONSTRAINT ... NOT VALID).
        
       | mekoka wrote:
       | As an aside, I'm a long time backend developer writing my first
       | mobile app with Dart/Flutter. I tried the popular backend
       | recommendation in that ecosystem. After weeks wasted on it, I
       | googled _" {PopularBackend} alternatives"_ out of frustration and
       | this thing called "Supabase" appeared in the results. What a
       | breath of fresh air it's been. It uses Postgres as a backend
       | (with postgREST), which means I can put all those skills to good
       | use (you can go far with all the batteries Postgres comes
       | equipped with: row-level security, triggers, functions, etc).
       | It's open source and the free tier lets me do pretty much all I
       | need for development (and probably for a production MVP). Which
       | means I don't have to worry about "after questions" until after.
       | 
       | Supabase team keep doing what you're doing!
        
         | kiwicopple wrote:
         | thank you for your kind words. I'll share this with the team -
         | they will really enjoy reading it
         | 
         | good luck building, and make sure to share the critical
         | feedback with us too. it really helps to improve the product
        
           | mekoka wrote:
           | > make sure to share the critical feedback with us too. it
           | really helps to improve the product
           | 
           | Will do.
        
       | legalcorrection wrote:
       | So we've come full circle and now JSON is just XML with lighter
       | syntax.
        
         | drewda wrote:
         | To a certain degree, yes. Agreed that it's funny and ironic.
         | But also from an overall software-engineering and developer-
         | experience perspective, this isn't the worst outcome...
        
         | alphalima wrote:
         | I don't view that as a loss. JSON by itself is just a string
         | with no context information so either the context is in the
         | programmer's head, in a human-readable document, or in a schema
         | file.
         | 
         | Adding a schema to it seems natural to me (as someone who
         | prefers strong-typing), and the JSONSchema implementation seems
         | pretty lightweight and sensible.
         | 
         | I'd go even further down the mining XML's legacy route and add
         | better XPATH2.0 + XQuery - style support. XQuery was a great
         | tool, so expressive. Though it did spawn XQueryX [1] which I'm
         | still unsure whether it was a joke or not.
         | 
         | [1]
         | https://www.oreilly.com/library/view/xquery/0596006349/ch22s...
        
           | legalcorrection wrote:
           | It's funny to think about where we'd be if someone had come
           | up with a better xml syntax before json took off.
        
           | michelpp wrote:
           | Postgres does support jsonpath syntax which covers some of
           | those cases:
           | 
           | https://www.postgresql.org/docs/current/datatype-
           | json.html#D...
        
         | yakubin wrote:
         | With the difference that it's not a markup language, and it
         | doesn't have those pesky namespaces.
        
         | dragonwriter wrote:
         | Nah, we still don't have JSL (either JSLT or JSL:FO.) And,
         | given that JSON isn't markup, I don't see that as being likely,
         | at least not JSL:FO.
        
       | edmundsauto wrote:
       | I have a very dumb question: why would you use this instead of a
       | traditional schema? I thought the value of json columns was to be
       | partially schemaless and flexible
       | 
       | What is the use case for this versus normal column definitions,
       | if you're looking to enforce schemas?
        
         | spprashant wrote:
         | Not the author.
         | 
         | Developers sometimes really just want to dump the data as a
         | JSON. For them this means not having to write a lot of boiler
         | plate ORM or SQL templates, and shipping trivial features
         | quickly.
         | 
         | Example, user UI preferences are a good candidate for something
         | like this. You probably don't want to add a new column just to
         | remember the status of checkbox, when the user last logged in.
         | 
         | As a DBA, you probably still want to define a schema for this
         | data, so as to not cause unexpected web app crashes. It ensures
         | the some level of data consistency without increasing the
         | maintenance overhead.
         | 
         | Obviously you wouldn't use it for business critical data, in my
         | opinion.
        
         | shroompasta wrote:
         | I'm using a pg json column to store data as a second source of
         | truth for data that is primarily held on a separate
         | microservice.
         | 
         | The schema on that end is pretty intricate but to prevent
         | hitting two services for certain types of data, we just dump it
         | to a json column.
         | 
         | Furthermore, for a personal project of mine to help me with
         | productiving / daily schedules, i'm using a json column for a
         | to-do list in the schema of
         | 
         | {[some_todo_item]: boolean,}
         | 
         | which can't traditionally be represented in pg columns as the
         | to do items are variable.
        
       | [deleted]
        
       | banashark wrote:
       | Very cool!
       | 
       | I remember when kicking the tires on postgrest/postgraphile that
       | I found validation and error handling to be one of the less
       | intuitive areas. Not the actual field-level constraints, but how
       | to adapt it to fit a fast-fail vs slow-fail model.
       | 
       | When I had attempted before, the only ergonomic option was fast-
       | fail (the first check constraint violated would bubble the error
       | upward) rather than slow-fail (collect all invalid fields and
       | return the collection of errors, which IME is more common on
       | average web forms or api requests).
       | 
       | Looking at the single code file and tests, I see only singular
       | field errors. Has a more ergonomic approach to validation-error
       | collection been developed other than writing a large function to
       | iterate the new record fieldwise against the schema?
        
         | oliverrice wrote:
         | > fast-fail (the first check constraint violated would bubble
         | the error upward) rather than slow-fail (collect all invalid
         | fields and return the collection of errors, which IME is more
         | common on average web forms or api requests)
         | 
         | ... > Has a more ergonomic approach to validation-error
         | collection been developed other than writing a large function
         | to iterate the new record fieldwise against the schema?
         | 
         | Thats an interesting problem that makes sense in the context of
         | form validation.
         | 
         | re: have the ergonomics improved -> no
         | 
         | but you could hack that behavior together like this:
         | 
         | ```
         | 
         | create or replace function handle_errors(errors text[])
         | returns bool              language plpgsql
         | 
         | as $$
         | 
         | begin
         | 
         | if errors = array[]::text[] then                 return true;
         | end if;           raise exception using errcode='22000',
         | message=(select string_agg(x, ', ') from unnest(errors) v(x));
         | 
         | end;
         | 
         | $$;
         | 
         | create table account(
         | 
         | id int primary key,
         | 
         | email text,
         | 
         | first_name text
         | 
         | check (                 handle_errors(
         | array[]::text[]             || case when email is null then
         | 'email must not be empty' else null end             || case
         | when first_name is null then 'first_name must not be empty'
         | else null end            )           )
         | 
         | );
         | 
         | insert into account(id, email, first_name)
         | 
         | values (1, null, null);
         | 
         | -- ERROR: email must not be empty, first_name must not be empty
         | 
         | ```
         | 
         | not that I'd recommend it :)
        
           | memco wrote:
           | Even in the fast fail case it would help if it told you which
           | constraint failed. In the example from the readme it says the
           | row failed validation, but not why. Would be nice if it said
           | something like "items of `tags` must be type string, but
           | integer was given". Or even just that it was an invalid item
           | type found in tags.
        
           | Tostino wrote:
           | Heh I really wish they supported this natively in the DB.
           | This would be super useful for crud stuff if you could just
           | set a GUC and get the fail slow mode, and get a list of
           | errors returned to the client.
        
       | mdaniel wrote:
       | It would be valuable to know _which_ JSON-Schema it supports,
       | since there are currently 4 different versions that differ in
       | their capabilities (as one might expect). Related to that, does
       | it honor the  "$schema" key allowing the schema to declare which
       | version it is?
       | 
       | The postgres-json-schema alternative that's mentioned in the repo
       | also ships with what appears to be a conformance test suite; does
       | this carry the same, or was the focus more on speed?
        
         | dragonwriter wrote:
         | > It would be valuable to know which JSON-Schema it supports,
         | since there are currently 4 different versions that differ in
         | their capabilities (as one might expect).
         | 
         | There are 9 drafts, at least six of which seem to have support
         | in implementations. (Plus there is a different variation [an
         | "extended subset" of Draft 5] defined and used within OpenAPI
         | 3.0.)
        
         | oliverrice wrote:
         | > 4 different versions that differ in their capabilities
         | 
         | pg_jsonschema is a (very) thin wrapper around
         | https://docs.rs/jsonschema/latest/jsonschema/ it supports
         | versions 4, 6, and 7
         | 
         | > appears to be a conformance test suite; does this carry the
         | same, or was the focus more on speed?
         | 
         | it was mostly an excuse to play with pgx that went surprisingly
         | well. The performance is very good but safety was our main
         | interest.
        
       ___________________________________________________________________
       (page generated 2022-07-21 23:00 UTC)