[HN Gopher] PostgreSQL Templates
       ___________________________________________________________________
        
       PostgreSQL Templates
        
       Author : kiwicopple
       Score  : 102 points
       Date   : 2020-07-09 16:16 UTC (6 hours ago)
        
 (HTM) web link (supabase.io)
 (TXT) w3m dump (supabase.io)
        
       | thinkingkong wrote:
       | I've used this in the past for testing depending on how much
       | seeding was required prior to the test, and how much parallelism
       | was required. Other than that I haven't been able to find any
       | useful real world scenarios to exploit this functionality. I
       | suppose if you wanted to have a source database and a bunch of
       | children and could prevent connections to the source it might be
       | helpful... somehow?
        
         | Squid_Tamer wrote:
         | I worked on a SaaS project who's customers are companies in an
         | old-fashioned industry. A common requirement was that their
         | data shouldn't be stored in the same database tables as any
         | other customer. They were fine with it being co-located on the
         | same cloud servers, using the same credentials... as long as
         | their data wasn't on the same tables. So I used postgres
         | templates to make a database per customer.
         | 
         | Didn't make much sense to me either but whatever pays the bills
         | I guess.
        
           | outworlder wrote:
           | Just don't fall into the trap of creating a database schema
           | per customer. Having different databases is better (as TXID
           | is not shared)
        
             | chrismorgan wrote:
             | Can you elucidate or provide some reference about this? I'm
             | not familiar with any problems in this space (it's not
             | clear to me what problems a shared TXID could cause), and
             | schemas seemed neat.
        
               | outworlder wrote:
               | That all depends on how many customers you expect to have
               | in a single database, and how many tables each one needs.
               | EDIT: also, transaction volume.
               | 
               | Ok, so let's imagine your application requires 100
               | database tables.
               | 
               | If you have 1 customer, you have 100 tables. Every
               | transaction performed in this database (including
               | updates) will increase the age(datfrozenxid) by one. You
               | now have 100 tables you need to vacuum.
               | 
               | If you only had one schema, that's no problem at all.
               | Even with the defaults and a high number of writes, the
               | default 3 autovacuum workers will quickly deal with it.
               | Even though they can only vacuum a table at the time.
               | 
               | Now your business is growing. 1 customer becomes 10,
               | 100... 10000?
               | 
               | At 10k customers, you now have 1 million tables (maybe
               | more, as TOAST tables are counted). Every single
               | transaction increases the 'age' of every table by one. It
               | doesn't matter that 99% of all tables have low TXIDs, the
               | overall DB "age" will be on the table with the highest
               | count.
               | 
               | The default 3 autovacuum workers will definitely NOT cut
               | it at this point(even with PG launching anti-wraparound
               | autovaccums). You can augment by "manually" running
               | vacuum, but in that case the overall database age will
               | only decrease once it's done vacuuming all 1 million
               | tables, so start early(before you reach the ~2.1 billion
               | mark)
               | 
               | Alternatively, you could write a script that will sort
               | tables by age and vacuum individually. This may not help
               | as much as it seems, as the distribution will depend on
               | how much work your autovacuum workers have been able to
               | do so far. Not to mention, even getting the list of
               | tables will be slow at that point.
               | 
               | So now you have to drastically increase the number of
               | workers (and other settings, like work_mem) - which may
               | also means higher costs - and you'll still have to watch
               | this like a hawk (or rather, your monitoring system has
               | to). There's no clear indication that the workers are
               | falling behind, you can only get an approximation by
               | trending TXID age.
               | 
               | You can make this work, but it is a pain. Even more so if
               | you haven't prepared for it. For a handful of customers
               | or a small number of tables or a small number of
               | transactions this may not matter. Our production systems
               | (~200 tables, excluding TOAST) started to fall behind
               | after a few thousand customers. We have had at least one
               | outage because we didn't even track this at the time and
               | the database shutdown. 20/20, but nobody remembered to
               | add this to the monitoring system.
               | 
               | Another unrelated problems with multiple schemas is:
               | database migrations are much more complex now, you have
               | to upgrade all schemas. This is both a blessing and a
               | curse. Database dumps also tend to take forever.
        
           | mulmen wrote:
           | It's really hard to join customer A's addresses table with
           | customer B's invoices when they are different databases.
           | 
           | There's no reason to join across customers in this situation
           | and it saves you from making that type of mistake.
           | 
           | Colocating this on the same database and using the same
           | users/credentials just makes it easier operationally. It's
           | about isolating the data.
        
           | Tostino wrote:
           | There are good reasons for that. At an old company I was at,
           | over a couple years I was around there were multiple (~5)
           | instances of clients asking for custom reports or data dumps,
           | and they were sent data from other clients (as well as their
           | own) because a tenant id was missed in some joins, or
           | something similar. Yes, sloppy practices all around allowed
           | that to happen. That being said, it would have been much less
           | likely to happen if data segregation were enforced at the
           | database level (using either RLS, or separate
           | database/schema/tables per-client).
        
       | thinkingemote wrote:
       | i use templates to create a "quick save" backup of a database
       | quickly during development:
       | 
       | createdb -T old_db new_db
        
         | silviogutierrez wrote:
         | Neat trick. Is this any faster than pg_dump and pg_load? Note:
         | there's plenty of optimization you can apply to pg_dump and
         | even use pg_restore, none applied below.
         | 
         | Like so:                   function save_database() {
         | mkdir -p /tmp/database-snapshots
         | DATABASE_NAME=$PROJECT_NAME             pg_dump -Fc
         | "$DATABASE_NAME" >"/tmp/database-snapshots/$DATABASE_NAME.dump"
         | }              function restore_database() {
         | DATABASE_NAME=$PROJECT_NAME
         | DATABASE_DUMP="/tmp/database-snapshots/$DATABASE_NAME.dump"
         | if [ ! -f "$DATABASE_DUMP" ]; then                 echo "No
         | dump to restore"                 return             else
         | psql postgres -c 'SELECT pg_terminate_backend(pid) FROM
         | pg_stat_activity WHERE pid <> pg_backend_pid()'
         | psql postgres -c "DROP DATABASE IF EXISTS \"$DATABASE_NAME\""
         | psql postgres -c "CREATE DATABASE \"$DATABASE_NAME\""
         | pg_restore -Fc -j 8 -d "$DATABASE_NAME" "$DATABASE_DUMP"
         | fi         }
        
           | gkop wrote:
           | Yes, it's significantly faster because it doesn't require
           | rebuilding indexes (among other reasons probably). Downside
           | is it consumes more disk space.
           | 
           | Stellar is a tool that wraps the template mechanism, and has
           | some benchmarks: https://github.com/fastmonkeys/stellar
        
             | jpitz wrote:
             | My understanding has always been that creating a database
             | is essentially a series of file copy operations from the
             | template.
        
               | gkop wrote:
               | That may be true! The point is the template has the
               | indexes already built ready to copy. Where the dump file
               | does not.
        
       | majodev wrote:
       | We are using PostgreSQL templates to speed up our integration
       | tests. This allow us to have full blown (and non mocked)
       | databases which are isolated to each parallel running test.
       | 
       | We have recently open-sourced our core-server managing this (go):
       | https://github.com/allaboutapps/integresql
       | 
       | Projects already utilizing Go and PostgreSQL may easily adapt
       | this concept for their testing via
       | https://github.com/allaboutapps/integresql-client-go
        
         | progval wrote:
         | How does this compare (performance-wise) to simply truncating
         | all tables?
        
           | majodev wrote:
           | We did several benchmarks back in the day and it definitely
           | depends on the size of your migrations + fixtures that you
           | need in every test. Performance increases were simply huge
           | for us, despite the serial execution of our tests in our
           | legacy backends.
           | 
           | A new database scaffolded from a predefined template becomes
           | available in several milliseconds or few seconds VS.
           | truncating, migrating up and seeding (fixtures) would always
           | take several seconds. We initially experimented with
           | recreating these test-databases directly via sql dumps, then
           | binary dumps until we finally realized that PostgreSQL
           | templates are the best option in these testing scenarios.
           | 
           | Sorry, I've got no benchmarks to share at this time.
        
         | miohtama wrote:
         | This is super nice! I have seen similar solutions for Python in
         | the past, but having an easy API + Docker makes this generic
         | and language agnostics.
         | 
         | Have you looked to tweak PSQL parameters to make it faster for
         | test runs e.g. by disabling disk writes?
        
           | majodev wrote:
           | Yep, disabling fsync, synchronous_commit and full_page_writes
           | _ONLY_ for this PostgreSQL integration testing strategy gives
           | us an additional boost of ~35% in overall testing speed.
           | https://github.com/allaboutapps/go-
           | starter/blob/master/docke...
           | 
           | Regarding being language agnostic: YES, clients for languages
           | other than go are very welcome. The integresql server
           | actually provides a trivial RESTful JSON api and should be
           | fairly easy to integrate with any language. This is our
           | second iteration regarding this PostgreSQL test strategy, the
           | last version came embedded with our legacy Node.js backend
           | stack and wasn't able to handle tests in parallel, which
           | became an additional hard requirement for us.
           | 
           | See our go-starter project for infos on how we are using
           | integresql, e.g. our WithTestDatabase/WithTestServer utility
           | function we use in all tests to "inject" the specific
           | database: https://github.com/allaboutapps/go-
           | starter/blob/master/inter...
        
       | FanaHOVA wrote:
       | Is this similar to doing something like `rails db:schema:load`
       | and then `rails db:seed`? i.e. set a brand new db to a certain
       | schema + preload a set of fixtures into it?
        
         | mccolin wrote:
         | The effect will be relatively similar -- you arrive at a new
         | database with the proper schema/structure in place, but the
         | difference is in execution. Doing this _within_ the database
         | engine is likely 10x faster than loading the schema through a
         | client library making a series of CREATE and ALTER statements
         | (ala the schema:load rake task).
        
       | KingOfCoders wrote:
       | We're using PostgreSQL teams to create a database from a template
       | for every integration test worker (Jest).
       | 
       | This way every worker works on their own database. The speedup
       | was 5x-7x as tests are now running in parallel.
        
         | jayd16 wrote:
         | Do you store test data in the template?
         | 
         | My team uses a migration framework (in our case Flyway) such
         | that we can easily create completely fresh DBs on a factory
         | fresh(+ user roles) instance of Postgres. We do this for our
         | testing as well. This means the schema tools are always tested
         | as well. Is there a benefit to using templates? They're faster
         | perhaps?
        
           | flavor8 wrote:
           | We use flyway to create a template. Then each test that
           | mutates the DB automatically invokes a cleanup routine which
           | recreates the test DB from the template. You get into hot
           | water around the connection pooling, but it's much, much
           | faster than refreshing using flyway.
        
           | KingOfCoders wrote:
           | For 10 workers it takes <0.5 seconds to create 10 databases
           | from the template. Test data is created from within tests.
        
           | tln wrote:
           | I'm using templates as well for the speedup in testing. We
           | build a postgresql docker image with the migrations + seed
           | data, then create a few databases for parallel testing.
           | 
           | When migration or seed files have changed, rules rebuild the
           | image. That way the price of migration (about 2 minutes for
           | our Django app) isn't paid on every test run.
           | 
           | Creating a DB from template is 2 seconds.
        
       ___________________________________________________________________
       (page generated 2020-07-09 23:01 UTC)