[HN Gopher] Shipping Multi-Tenant SaaS Using Postgres Row-Level ... ___________________________________________________________________ Shipping Multi-Tenant SaaS Using Postgres Row-Level Security Author : capiki Score : 124 points Date : 2022-07-26 18:16 UTC (4 hours ago) (HTM) web link (www.thenile.dev) (TXT) w3m dump (www.thenile.dev) | kache_ wrote: | Context aware data access is really cool. And hard :) | mkurz wrote: | Be aware when using RLS with views: By default the RLS policy | will be executed with the permissions of the owner of the view | instead with the permissions of the user executing the current | query. This way it can easily happen that the RLS policy will be | bypassed because the owner of the view is a admin account or the | same account that owns the underlying table (see the the gotchas | section of the original post). | | However, upcoming PostgreSQL 15 adds support for security invoker | views: | https://github.com/postgres/postgres/commit/7faa5fc84bf46ea6... | That means you can then define the security_invoker attribute | when creating a view and this "... causes the underlying base | relations to be checked against the privileges of the user of the | view rather than the view owner" (see | https://www.postgresql.org/docs/15/sql-createview.html) PG15 beta | 1 release notes: | https://www.postgresql.org/about/news/postgresql-15-beta-1-r... | simonw wrote: | I don't fully understand the performance implications here. | | Say I was using this for a blog engine, and I wanted to run this | SQL query: select * from entries; | | But I actually only want to get back entries that my current user | is allowed to view - where author_id = 57 for example. | | Would PostgreSQL automatically turn the above query into the | equivalent of this: select * from entries where | author_id = 57; | | And hence run quickly (assuming there's an index on that | author_id column)? | | Or would it need to run an additional SQL query check for every | single row returned by my query to check row permissions, adding | up to a lot of extra overhead? | ossopite wrote: | yes, postgres will add such a condition to the query and in | simple cases like this is able to use a corresponding index | | unfortunately this can break down in more complex cases. | roughly postgres trusts a limited set of functions and | operators not to leak information about rows (e.g. via error | messages) that the RLS policy says a query should not be able | to see. that set includes basic comparisons but not more | esoteric operations like JSON lookups. at some point postgres | will insist on checking the RLS policy result for a row before | doing any further work, which can preclude the use of indexes | a-dub wrote: | this is cool. next up, row level encryption with private | information retrieval methods for enabling queries and searches | homomorphically (on data encrypted by the client that the service | provider never has a key for). | spacemanmatt wrote: | If I were leaning into RLS today I would do it through PostgREST | pikdum wrote: | PostGraphile is also a really neat tool, using GraphQL instead | of REST. | ei8ths wrote: | I needed this two years ago, i was looking at this but couldn't | figure out how to do it with a existing db connection pool to | reuse connections. I might be migrating to this soon so that | things will be more isolated from the tenants. | mgl wrote: | Row-level security is always a tricky and hard to enforce | assumption as this not how we relational databases really. | | Much bigger fan of the approach described here: | | Scalability, Allocation, and Processing of Data for Multitenancy | | https://stratoflow.com/data-scalability-allocation-processin... | nbevans wrote: | Using RLS to implement multi-tenancy is a terrible idea. Just | deploy a database per tenant. It's not hard. Why overcomplicate | it? | bicijay wrote: | Deploying a database per tenant is not that easy. You have a | lot of new overhead, migrations become a pain in the ass | (already are) and a lot of other little problems... | | I would say a database per tenant is overcomplicating it. | Gigachad wrote: | You end up building just a bit of automation around it and | its fine. The migration isn't any harder, you just run an | ansible job or something to roll it out to all databases. | mbreese wrote: | A database per tenant makes the rest of the workflow | significantly easier though. No need to add clauses to SQL | WHERE statements for users/groups. Queries are faster (less | data). And data can be moved much easier between servers. | | Yes, it does add extra overhead at account creation, during | DB migrations, and for backups. | | But if you don't need cross-account or public data access, it | can make life much easier. | zie wrote: | > No need to add clauses to SQL WHERE statements for | users/groups. | | This is basically what RLS does for you. You specify the | access and you specify the current user(via a connection, | SET ROLE, etc). Then it does all that complicated query | filtering stuff for you, to ensure you don't screw it up. | | > Queries are faster (less data). And data can be moved | much easier between servers. | | Not really, the overhead is just different(and likely more | of it) in your solution. It's not wrong nor is using RLS | right. | pg_bot wrote: | Deploying multiple databases is typically costly in the | infrastructure as a service space. Plus you have more | operational overhead in ensuring backups work and keeping | things secure. It's much easier to use Postgres' schemas to | segment the data within one single database. Frankly schemas | are much easier to reason about, maintain, scale, and keep | compliant than row level security. | [deleted] | andrewstuart wrote: | I once implemented RLS/Postgres for Django. | | It worked pretty well. | | The basic mechanism was to intercept all outbound SQL queries and | wrap them in postgres environment variables that set up the RLS. | jzelinskie wrote: | As the developer of an external authorization system (full | disclosure)[0], I feel obligated to chime in the critiques of | external authorization systems in this article. I don't think | they're far off base, as we do recommend RLS for use cases like | what the article covers, but anyways, here's my two cents: | | 1+2: Cost + Unnecessary complexity: this argument can be used | against anything that doesn't fit the given use case. There's no | silver bullet for any choice of solution. You should only adopt | the solution that makes the most sense for you and vendors should | be candid about when they wouldn't recommend adopting their | solution -- it'd be bad for both the users and reputation of the | solution. | | 3: External dependencies: That depends on the toolchain. | Integration testing against SpiceDB is easier than Postgres, IMO | [1]. SpiceDB integration tests can run fully parallelized and can | also model check your schema so that you're certain there are no | flaws in your design. In practice, I haven't seen folks write | tests to assert that their assumptions about RLS are maintained | over time. The last place you want invariants to drift is | authorization code. | | 4: Multi-tenancy is core to our product: I'm not sure I'm steel- | manning this point, but I'll do my best. Most companies do not | employ authorization experts and solutions worth their salt | should support modeling multi-tenant use cases in a safe way. | SpiceDB has a schema language with idioms and recommendations to | implement functionality like multi-tenancy, but still leaves it | in the hands of developers to construct the abstraction that | matches their domain[2]. | | [0]: https://github.com/authzed/spicedb | | [1]: https://github.com/authzed/examples/tree/main/integration- | te... | | [2]: https://docs.authzed.com/guides/schema | gwen-shapira wrote: | The blog explicitly said that if the requirements involve | actual authorization models (beyond simple tenancy) then RLS is | not the best fit (see: https://thenile.dev/blog/multi-tenant- | rls#if-you-have-sophis...). | | I think this covers both the complexity aspect and the | difference between what you get from RLS and what external | authz brings to the table (schema, for example). | | I do think that RLS is a great way for a company without authz | experts to built a multi-tenant MVP safely. I've yet to see a | single pre-PMF company that worries about authorization beyond | that, this is a series-B concern in my experience. | semitones wrote: | Btw that's a localhost link | gwen-shapira wrote: | oops :) Too many tabs. Fixed and thank you. | paxys wrote: | Is having to write "SELECT [...] WHERE user_id=<123>" really | considered a security hole? Isn't that how like every service in | existence operates? Coming up with complicated auth systems and | patterns just because you are scared you will accidentally skip | that WHERE clause seems bizarre to me. | bvirb wrote: | It's pretty nice using RLS that the entire query will follow | the rules applied in the database. So for complex queries with | say joins and/or subqueries they will all automatically follow | the RLS policies as well. In our case we also have some global | lookup tables that don't have RLS policies which can also be | joined. | | We've found it pretty nice to cut out a whole class of possible | bugs by being able to defer it to the database level. At the | application level we end up with a wrapper that sets (and | guarantees unsetting) multi-tenant access to the correct | tenant, and then we never have to add "tenant_id = ..." | anywhere, regardless of the query. Regardless of whether we | forget in some query (which we almost surely would), it cuts | out quite a bit of extra code. | | You can also do some cool stuff like add RLS policies for read- | only multi-tenant access. Then you can query data across | multiple tenants while enforcing that nothing accidentally gets | written. | nordsieck wrote: | > Is having to write "SELECT [...] WHERE user_id=<123>" really | considered a security hole? Isn't that how like every service | in existence operates? Coming up with complicated auth systems | and patterns just because you are scared you will accidentally | skip that WHERE clause seems bizarre to me. | | Is having to avoid use after free really considered a security | hole? Isn't that how like every program in existence operates? | Coming up with complicated languages and frameworks just | because you're scared you will accidentally use a variable | after it's been freed seems bizarre to me. | | As it turns out, humans are bad at being consistent, whereas | computers are much better. Maybe this particularly solution | isn't "the right thing", but it's at least an attempt at | modifying the environment such that mistakes no longer happen. | And at a meta level, that is precisely the right thing to do. | thrownaway561 wrote: | I think this is mainly an issue when you're using RAW SQL | statements. If you're using an ORM, there are many ways to add | a where clause to the statements automatically without having | to update your code every where. | galaxyLogic wrote: | When you say 'user_id' do you mean each end-user of the system | or each customer? | | I assume you have a few customers and then very many users | belonging to each customer. | mbreese wrote: | From my perspective, it isn't the security aspects that are | limiting, but the usability. | | If you want to have any access controls that isn't a simple | user_id==123, SQL WHERE clauses can get complicated. | | Users, groups, or any kind of fine grained access control can | make simple queries non-trivial. It's even worse if a user can | be authorized to view data across different accounts. | w-j-w wrote: | lemax wrote: | In my experience we've looked toward this kind of solution in a | large legacy single-tenant application that wants to go multi | tenant with more safety guarantees. | fswd wrote: | I use this for a startup in a re-write of their solution. It | simplifies my queries and mutations, and security concerns. It | also drammatically reduces the complexity of my code. There's | also ROLES (Guest/Public user, Authenticated, Admin) and | combinding the roles with Row Level Security. | | I like it so much I don't want to go back! | uhoh-itsmaciek wrote: | >Another issue we caught during testing was that some requests | were being authorized with a previous request's user id. | | This is the terrifying part about RLS to me: having to rely on | managing the user id as part of the database connection session | seems like an easy way to shoot yourself in the foot (especially | when combined with connection pooling). Adding WHERE clauses | everywhere isn't great, but at least it's explicit. | | That said, I've never used RLS, and I am pretty curious: it does | seem like a great solution other than that one gotcha. | shaicoleman wrote: | We're currently using the schema-per-tenant, and it's working | very well for us: | | * No extra operational overhead, it's just one database | | * Allows to delete a single schema, useful for GDPR compliance | | * Allows to easily backup/restore a single schema | | * Easier to view and reason about the data from an admin point of | view | | * An issue in a single tenant doesn't affect other tenants | | * Downtime for maintenance is shorter (e.g. database migration, | non-concurrent REINDEX, VACUUM FULL, etc.) | | * Less chance of deadlocks, locking for updates, etc. | | * Allows easier testing and development by subsetting tenants | data | | * Smaller indexes, more efficient joins, faster table scans, more | optimal query plans, etc. With row level security, every index | needs to be a compound index | | * Easy path to sharding per tenant if needed. Just move some | schemas to a different DB | | * Allows to have shared data and per-tenant data on the same | database. That doesn't work with the tenant-per-database approach | | There are a few cons, but they are pretty minor compared to the | alternative approaches: | | * A bit more code to deal in the tenancy, migrations, etc. We | opted to write our own code rather than use an existing solution | | * A bit more hassle when dealing with PostgreSQL extensions . | It's best to install extensions into a separate extensions schema | | * Possible caching bugs so you need to namespace the cache, and | clear the query cache when switching tenant | | * The security guarantees of per tenant solution aren't perfect, | so you need to ensure you have no SQL injection vulnerabilities | bvirb wrote: | We ran a multi-tenant SaaS product for years w/ a schema-per- | tenant approach. For the most part it all worked pretty great. | | We ran into issues here and there but always found a way to | work around them: | | * Incremental backups were a pain because of needing to lock so | many objects (# of schemas X # of tables per schema). | | * The extra code to deal w/ migrations was kinda messy (as you | mentioned). | | * Globally unique IDs become the combination of the row ID + | the tenant ID, etc... | | For us though the real deal-breaker turned out to be that we | wanted to have real foreign keys pointing to individual rows in | tenant schemas from outside of the tenant schema and we | couldn't. No way to fix that one since with multi-schema the | "tenant" relies on DB metadata (the schema name). | | We ended up migrating the whole app to RLS (which itself was a | pretty interesting journey). We were afraid of performance | issues since the multi-schema approach kinda gives you | partitioning for free, but with the index usage on the RLS | constraints we've had great performance (at least for our use | case!). | | After quite a bit of time working with both multi-schema & RLS | I probably wouldn't go back to multi-schema unless I had a real | compelling reason to do so due to the added complexity. I | really liked the multi-schema approach, and I think most of the | critiques of it I found were relatively easy to work around, | but RLS has been a lot simpler for us. | uhoh-itsmaciek wrote: | There are some other cons: | | Memory usage and I/O can be less efficient. Postgres handles | table data in 8kb pages, so even if you're just reading a | single row, that reads 8kb from disk and puts 8kb in the | Postgres buffer cache, with that row and whatever happens to be | next to it in the physical layout of the underlying table. | Postgres does this because of locality of reference: it's | cheaper to bulk-load data from disk, and, statistically | speaking, you may need the adjacent data soon. If each user is | touching separate tables, you're loading a page per row for | each user, and you're missing out on some of the locality | benefits. | | Another problem is monitoring (disclosure: I work for | pganalyze, which offers a Postgres monitoring service). The | pg_stat_statements extension can track execution stats of all | normalized queries in your database, and that's a very useful | tool to find and address performance problems. But whereas | queries like "SELECT * FROM posts WHERE user_id = 123" and | "SELECT * FROM posts WHERE user_id = 345" normalize to the same | thing, schema-qualified queries like "SELECT * FROM | user_123.posts" and "SELECT * FROM user_345.posts" normalize to | _different_ things, so you cannot easily consider their | performance in aggregate (not to mention bloating | pg_stat_statements by tracking so many distinct query stats). | This is the case even when you 're using search_path so that | your schema is not explicitly in your query text. | | Also, performance of tools like pg_dump is not great with a ton | of database objects (tables and schemas) and, e.g., you can run | into max_locks_per_transaction [1] limits, and changing that | requires a server restart. | | I wouldn't say you should never do schema-based multi-tenancy | (you point out some good advantages above), but I'd be | extremely skeptical of using it in situations where you expect | to have a lot of users. | | [1]: https://www.postgresql.org/docs/current/runtime-config- | locks... | bearjaws wrote: | This is such a killer feature in PG, my new job uses it and it | makes audits of our tenancy model dead simple. | | Coming from a SaaS company that used MySQL, we would get asked by | some customers how we guarantee we segmented their data, and it | always ended at the app layer. One customer (A fortune 10 | company) asked if we could switch to SQL Server to get this | feature... | | Our largest customers ask how we do database multi-tenant and we | point to our SDLC + PG docs and they go 'K'. | eastbound wrote: | I honestly don't understand how Oracle is still alive. Postgres | has so many of these killer features. | | Also, I wonder how others do tenant separation, what other | solutions there are. | Scarbutt wrote: | oracle has flashback | throwaway787544 wrote: | They're like a tick. Very good at burrowing in and hard to | remove. They have a lot of clients for whom a dozen million | dollars is a drop in the bucket, and moving away is a decade- | long millions-of-dollars project. | mritchie712 wrote: | Legacy. | | If you have thousands of lines of code relying on Oracle the | cost to migrate would be enormous. | toomuchtodo wrote: | Amazon has a great post on this topic. | | https://aws.amazon.com/blogs/aws/migration-complete- | amazons-... | | I thought it was cool they retrained their Oracle DBAs into | other roles as part of the project. | spacemanmatt wrote: | I work with a few former-Oracle DBAs in a PostgreSQL- | flavored consultancy now and they are aces. All the root- | cause analysis and organization skills transfer handily. | paulmd wrote: | Postgres is functionally and conceptually extremely | similar to Oracle. There are a few oddities (in | particular, oracle's "nulls are never in indexes" is | kinda weird) but the redo log is similar to the WAL, etc. | In most cases, similar approaches will perform similarly | and experience pretty much transfers over with a few | months of experience. | gav wrote: | Ignoring the cost, there's the risk/reward alignment you | see in large enterprises. | | Imagine you're a new CIO. You know you're probably looking | at a 3-5 year tenure at this new company and you want to | lead with some big wins to set the tone and show your | value. | | You're reviewing proposals from your senior leadership. One | of the options is an Oracle migration. It could cost a | million dollars to migrate, but you'd save a million | dollars a year going forward. Oracle runs your mission- | critical internal systems, any issues with the migration | and the system you migrate to is going to cause significant | financial and reputation damage. You'll have to defend this | decision if anything goes wrong, i.e. you've absorbed a lot | of risk but a lot less upside to you personally. | | What do you do? You put the proposal to the side and look | for something that has a lot better upside. | beckingz wrote: | Exactly. The risk/cost profile for migrations is bad: If | it goes well, decent return. If it goes poorly, | catastrophic. | abraae wrote: | Oracle has had the ability to do this for decades, so | whatever is keeping them alive, it's nothing to do with this | particular nifty Postgres feature. | revskill wrote: | Sharding is the only scalable way per my experience. The | point about scalability here is, i can control the load as | the data gets bigger. | e1g wrote: | Every B2B client who asked us how we handle multi-tenancy also | asked how we ensure their data is erased at the end of the | contract. Using a shared database with RLS means you have to go | through all DB backups, delete individual rows for that tenant, | then re-generate the backup. That's a non-starter, so we opted | for having one DB per tenant which also makes sharding, | scaling, balancing, and handling data-residency challenges | easier. | axlee wrote: | It makes BI work an absolute hellscape as well. Tradeoffs. | brightball wrote: | I filled out a ton of enterprise questionnaires on this stuff | before and we just told people that it would be deleted when | the backups expired after X days because we didn't have the | capability to delete specific rows from our backups. Nobody | ever argued. | | There's not a single customer I've ever run across who's | going to halt a contract because you can't purge their data | from your backups fast enough. They're signing up because of | what you offer, not the termination clause. | coenhyde wrote: | i did the same with the same results | mikeodds wrote: | thirded | wahnfrieden wrote: | same. but! it's a liability nonetheless, go talk with | legal etc | e1g wrote: | A recent example from the other side: a client contacts me | and says they will have to exit from our _existing_ | contract unless we can update our (AWS) infrastructure to | use their (AWS) encryption keys for servers and databases | handling their tenancy. In Enterprise, some tenants are | very opinionated about what cloud you use and how their | data lives /flows within it. I run all our infosec, | including SOC2 & ISO27001 programs, and I know that using | their encryption keys is nothing but security theater. But | with $500k p.a. on the line, I also know when it's | showtime. | voberoi wrote: | This is the way -- also never had an issue across US | healthcare and enterprise SaaS. | gervwyk wrote: | Also second this, we even split our AWS org into an AWS | account per tentant. Although, this will maybe be a problem | if we have +100s of clients. But it makes onboarding and off- | loading simple. | throw03172019 wrote: | Each client is running on their own instances / load | balancers? | HatchedLake721 wrote: | It depends on an annual contract value (ACV), doesn't it? | You can't give an AWS account to every $99 p/m plan, but | you can for enterprise $50-100k+ deals. | gervwyk wrote: | Yeah, that sums it up. I guess it means it can't be | labelled as "multi-tenant" then.. | SahAssar wrote: | I think the way to handle this (based on how many companies | handle GDPR compliance) is to not keep backups older than X | months (usually 3 months) and have a clause that all data | past that time is deleted. | bearjaws wrote: | We usually write a "reasonable best effort" clause into our | deletion, that it will 100% be deleted from production within | 30 days and automatically fall out of backups 60 days from | there. This also helps since we can't control our downstream | vendors such as Twilio, AWS SES, etc, who all have their own | legal obligations and time frames. | | Even for large health systems they have been okay with it. | jerryjerryjerry wrote: | I think TTL feature provided by some DB vendors are | actually orthogonal to multi tendency, where the former | deals with life cycles policy of data but the original | problem to delete the data of certain user is more related | to privacy policy of data, though overlap may exists. | hobs wrote: | Agree - performance on row level security (at least on SQL | Server) is terrible, sharing by database is fairly easy. | jquery_dev wrote: | How do you manage your backend in this case? Do you have an | insurance of backend for each customer or do you allow | backend to make connections to all the DBs. | | I'm interested in doing similar and wondering about the best | way to handle the routing between the databases from a single | backend. | mason55 wrote: | It really depends on your requirements, both functional and | cost. Having a full stack per customer can be great for a | lot of reasons. It's probably the safest because you never | have to worry about something getting messed up in the code | and creating cross-customer access. Once you're sure the | environment is set up correctly you can sleep well at | night. You also don't have to worry about one customer | doing something to impact the performance of other | environments (unless you're using shared infra, like | putting all your DBs on a single cluster). And it can make | maintenance easier, for example you can data migrations can | start with small and/or less important customers for | practice. It also can give you more flexibility if you need | to make special snowflakes (i.e. some big customer wants | weird IP whitelisting rules that can't work with your | normal env setup). | | Downsides are that it's probably more expensive and more | work. Even if your infra spin up is totally automated, you | still need to keep track of all the environments, you still | need to keep your Infrastructure-as-Code (e.g. your | terraform scripts) up to date, more can go wrong when you | make changes, there's more chance for environments to | drift. | | So, in short, separate stacks usually means more safety & | simpler application architecture in exchange for more cost | and more effort to manage the fleet. | sgarman wrote: | Am I right in my understanding that EVERY request that comes in | to their api creates a new connection to the database? What about | reusing connections with connection pools or one level up using | pgbouncer or thing. Can you actually use RLS while reusing | connections? | cstejerean wrote: | You can reuse the connection with a connection pool and use SET | ROLE when you check it out. | rst wrote: | Nope. Quoting the article itself: | | "In the traditional use case of direct db access, RLS works by | defining policies on tables that filter rows based on the | current db user. For a SaaS application, however, defining a | new db user for each app user is clunky. For an application use | case you can dynamically set and retrieve users using Postgres' | current_settings() function ( i.e: SET | app.current_app_user = 'usr_123' | | and SELECT | current_settings('app.current_app_user) | | )." | | The policies that they define reference these settings, so they | can do a "set" at the start of processing every web request, on | a pre-existing db connection. | lemax wrote: | It's possible to implement this without creating new | connections to the database for each request by using SET LOCAL | and wrapping every query in a transaction. Instead of applying | RLS based on the current user, you apply RLS based on the | parameter value you set at the beginning of the transaction. | You can set this parameter value based on the user session in | your application. | | Your RLS policy looks as follows: CREATE POLICY | tenant_${tableName}_isolation_policy ON "${tableName}" USING | ("tenant_id" = current_setting('app.current_tenant'); | | Your queries look something like this: BEGIN TRANSACTION SET | LOCAL app.current_tenant = '${tenant}'; SELECT * from | some_table END TRANSACTION; | | You can even initialize your writes with a `tenant_id` column | defaulted to your `current_setting('app.current_tenant')` ___________________________________________________________________ (page generated 2022-07-26 23:00 UTC)