[HN Gopher] PostgreSQL Templates ___________________________________________________________________ PostgreSQL Templates Author : kiwicopple Score : 102 points Date : 2020-07-09 16:16 UTC (6 hours ago) (HTM) web link (supabase.io) (TXT) w3m dump (supabase.io) | thinkingkong wrote: | I've used this in the past for testing depending on how much | seeding was required prior to the test, and how much parallelism | was required. Other than that I haven't been able to find any | useful real world scenarios to exploit this functionality. I | suppose if you wanted to have a source database and a bunch of | children and could prevent connections to the source it might be | helpful... somehow? | Squid_Tamer wrote: | I worked on a SaaS project who's customers are companies in an | old-fashioned industry. A common requirement was that their | data shouldn't be stored in the same database tables as any | other customer. They were fine with it being co-located on the | same cloud servers, using the same credentials... as long as | their data wasn't on the same tables. So I used postgres | templates to make a database per customer. | | Didn't make much sense to me either but whatever pays the bills | I guess. | outworlder wrote: | Just don't fall into the trap of creating a database schema | per customer. Having different databases is better (as TXID | is not shared) | chrismorgan wrote: | Can you elucidate or provide some reference about this? I'm | not familiar with any problems in this space (it's not | clear to me what problems a shared TXID could cause), and | schemas seemed neat. | outworlder wrote: | That all depends on how many customers you expect to have | in a single database, and how many tables each one needs. | EDIT: also, transaction volume. | | Ok, so let's imagine your application requires 100 | database tables. | | If you have 1 customer, you have 100 tables. Every | transaction performed in this database (including | updates) will increase the age(datfrozenxid) by one. You | now have 100 tables you need to vacuum. | | If you only had one schema, that's no problem at all. | Even with the defaults and a high number of writes, the | default 3 autovacuum workers will quickly deal with it. | Even though they can only vacuum a table at the time. | | Now your business is growing. 1 customer becomes 10, | 100... 10000? | | At 10k customers, you now have 1 million tables (maybe | more, as TOAST tables are counted). Every single | transaction increases the 'age' of every table by one. It | doesn't matter that 99% of all tables have low TXIDs, the | overall DB "age" will be on the table with the highest | count. | | The default 3 autovacuum workers will definitely NOT cut | it at this point(even with PG launching anti-wraparound | autovaccums). You can augment by "manually" running | vacuum, but in that case the overall database age will | only decrease once it's done vacuuming all 1 million | tables, so start early(before you reach the ~2.1 billion | mark) | | Alternatively, you could write a script that will sort | tables by age and vacuum individually. This may not help | as much as it seems, as the distribution will depend on | how much work your autovacuum workers have been able to | do so far. Not to mention, even getting the list of | tables will be slow at that point. | | So now you have to drastically increase the number of | workers (and other settings, like work_mem) - which may | also means higher costs - and you'll still have to watch | this like a hawk (or rather, your monitoring system has | to). There's no clear indication that the workers are | falling behind, you can only get an approximation by | trending TXID age. | | You can make this work, but it is a pain. Even more so if | you haven't prepared for it. For a handful of customers | or a small number of tables or a small number of | transactions this may not matter. Our production systems | (~200 tables, excluding TOAST) started to fall behind | after a few thousand customers. We have had at least one | outage because we didn't even track this at the time and | the database shutdown. 20/20, but nobody remembered to | add this to the monitoring system. | | Another unrelated problems with multiple schemas is: | database migrations are much more complex now, you have | to upgrade all schemas. This is both a blessing and a | curse. Database dumps also tend to take forever. | mulmen wrote: | It's really hard to join customer A's addresses table with | customer B's invoices when they are different databases. | | There's no reason to join across customers in this situation | and it saves you from making that type of mistake. | | Colocating this on the same database and using the same | users/credentials just makes it easier operationally. It's | about isolating the data. | Tostino wrote: | There are good reasons for that. At an old company I was at, | over a couple years I was around there were multiple (~5) | instances of clients asking for custom reports or data dumps, | and they were sent data from other clients (as well as their | own) because a tenant id was missed in some joins, or | something similar. Yes, sloppy practices all around allowed | that to happen. That being said, it would have been much less | likely to happen if data segregation were enforced at the | database level (using either RLS, or separate | database/schema/tables per-client). | thinkingemote wrote: | i use templates to create a "quick save" backup of a database | quickly during development: | | createdb -T old_db new_db | silviogutierrez wrote: | Neat trick. Is this any faster than pg_dump and pg_load? Note: | there's plenty of optimization you can apply to pg_dump and | even use pg_restore, none applied below. | | Like so: function save_database() { | mkdir -p /tmp/database-snapshots | DATABASE_NAME=$PROJECT_NAME pg_dump -Fc | "$DATABASE_NAME" >"/tmp/database-snapshots/$DATABASE_NAME.dump" | } function restore_database() { | DATABASE_NAME=$PROJECT_NAME | DATABASE_DUMP="/tmp/database-snapshots/$DATABASE_NAME.dump" | if [ ! -f "$DATABASE_DUMP" ]; then echo "No | dump to restore" return else | psql postgres -c 'SELECT pg_terminate_backend(pid) FROM | pg_stat_activity WHERE pid <> pg_backend_pid()' | psql postgres -c "DROP DATABASE IF EXISTS \"$DATABASE_NAME\"" | psql postgres -c "CREATE DATABASE \"$DATABASE_NAME\"" | pg_restore -Fc -j 8 -d "$DATABASE_NAME" "$DATABASE_DUMP" | fi } | gkop wrote: | Yes, it's significantly faster because it doesn't require | rebuilding indexes (among other reasons probably). Downside | is it consumes more disk space. | | Stellar is a tool that wraps the template mechanism, and has | some benchmarks: https://github.com/fastmonkeys/stellar | jpitz wrote: | My understanding has always been that creating a database | is essentially a series of file copy operations from the | template. | gkop wrote: | That may be true! The point is the template has the | indexes already built ready to copy. Where the dump file | does not. | majodev wrote: | We are using PostgreSQL templates to speed up our integration | tests. This allow us to have full blown (and non mocked) | databases which are isolated to each parallel running test. | | We have recently open-sourced our core-server managing this (go): | https://github.com/allaboutapps/integresql | | Projects already utilizing Go and PostgreSQL may easily adapt | this concept for their testing via | https://github.com/allaboutapps/integresql-client-go | progval wrote: | How does this compare (performance-wise) to simply truncating | all tables? | majodev wrote: | We did several benchmarks back in the day and it definitely | depends on the size of your migrations + fixtures that you | need in every test. Performance increases were simply huge | for us, despite the serial execution of our tests in our | legacy backends. | | A new database scaffolded from a predefined template becomes | available in several milliseconds or few seconds VS. | truncating, migrating up and seeding (fixtures) would always | take several seconds. We initially experimented with | recreating these test-databases directly via sql dumps, then | binary dumps until we finally realized that PostgreSQL | templates are the best option in these testing scenarios. | | Sorry, I've got no benchmarks to share at this time. | miohtama wrote: | This is super nice! I have seen similar solutions for Python in | the past, but having an easy API + Docker makes this generic | and language agnostics. | | Have you looked to tweak PSQL parameters to make it faster for | test runs e.g. by disabling disk writes? | majodev wrote: | Yep, disabling fsync, synchronous_commit and full_page_writes | _ONLY_ for this PostgreSQL integration testing strategy gives | us an additional boost of ~35% in overall testing speed. | https://github.com/allaboutapps/go- | starter/blob/master/docke... | | Regarding being language agnostic: YES, clients for languages | other than go are very welcome. The integresql server | actually provides a trivial RESTful JSON api and should be | fairly easy to integrate with any language. This is our | second iteration regarding this PostgreSQL test strategy, the | last version came embedded with our legacy Node.js backend | stack and wasn't able to handle tests in parallel, which | became an additional hard requirement for us. | | See our go-starter project for infos on how we are using | integresql, e.g. our WithTestDatabase/WithTestServer utility | function we use in all tests to "inject" the specific | database: https://github.com/allaboutapps/go- | starter/blob/master/inter... | FanaHOVA wrote: | Is this similar to doing something like `rails db:schema:load` | and then `rails db:seed`? i.e. set a brand new db to a certain | schema + preload a set of fixtures into it? | mccolin wrote: | The effect will be relatively similar -- you arrive at a new | database with the proper schema/structure in place, but the | difference is in execution. Doing this _within_ the database | engine is likely 10x faster than loading the schema through a | client library making a series of CREATE and ALTER statements | (ala the schema:load rake task). | KingOfCoders wrote: | We're using PostgreSQL teams to create a database from a template | for every integration test worker (Jest). | | This way every worker works on their own database. The speedup | was 5x-7x as tests are now running in parallel. | jayd16 wrote: | Do you store test data in the template? | | My team uses a migration framework (in our case Flyway) such | that we can easily create completely fresh DBs on a factory | fresh(+ user roles) instance of Postgres. We do this for our | testing as well. This means the schema tools are always tested | as well. Is there a benefit to using templates? They're faster | perhaps? | flavor8 wrote: | We use flyway to create a template. Then each test that | mutates the DB automatically invokes a cleanup routine which | recreates the test DB from the template. You get into hot | water around the connection pooling, but it's much, much | faster than refreshing using flyway. | KingOfCoders wrote: | For 10 workers it takes <0.5 seconds to create 10 databases | from the template. Test data is created from within tests. | tln wrote: | I'm using templates as well for the speedup in testing. We | build a postgresql docker image with the migrations + seed | data, then create a few databases for parallel testing. | | When migration or seed files have changed, rules rebuild the | image. That way the price of migration (about 2 minutes for | our Django app) isn't paid on every test run. | | Creating a DB from template is 2 seconds. ___________________________________________________________________ (page generated 2020-07-09 23:01 UTC)