[HN Gopher] Using PostgreSQL as a Data Warehouse ___________________________________________________________________ Using PostgreSQL as a Data Warehouse Author : cedricd Score : 148 points Date : 2021-05-10 19:39 UTC (3 hours ago) (HTM) web link (www.narrator.ai) (TXT) w3m dump (www.narrator.ai) | u678u wrote: | What would be great is some way to codify all this advice. Eg run | PG in a Data Warehouse mode. Databases are too big and too | configurable and without specialist DBAs any more most people are | just guessing what works best. | cedricd wrote: | That's a great point. | | This isn't really what you're saying, but citus [1] ships a | distributed Postgres. A lot of the things they improve would | help massively with analytical workloads actually. | | 1: https://www.citusdata.com/ | riku_iki wrote: | You misspelled cit<r>us, the same is in the blogpost. | cedricd wrote: | Ahh! So sorry. Fixed it. | cedricd wrote: | Also, what's your take? Do people use citus for analytical | workloads as well as production at scale? | | I'd assume yes, but I haven't personally used you guys. I'm | just aware of you and broadly how you scale Postgres. | riku_iki wrote: | Sorry, I am not familiar with citus. | [deleted] | richwater wrote: | > vacuum analyze after bulk insertion | | Ugh, this is nightmare. I wish they would come up with a better | system than forcing this on users. | cedricd wrote: | Luckily Postgres' autovacuum works really well in normal | workloads. If there's an even mix of inserts spread throughout | time then it's probably best to just rely on it. | | For data warehouses inserts can happen in bulk on a regular | cadence. In that case it can help to vacuum right after. I'm | not sure if it has a huge impact in practice. | efxhoy wrote: | Dr. Martin Loetzsch did a great video, ETL Patterns with | Postgres. He covers some really good topics: | | - Instead of updating tables build their replacements under a | different name then rename them. This makes updating heavy-to- | compute table instant. Works even for schemas: rebuild a schema | as schemaname_next rename the current to schemaname_old then | rename schemaname_next to schemaname. | | - Keep all the source data raw and disable WAL, you don't need it | for ETL. | | - Set memory limitis high. | | And lots of other good tips for doing ETL/DW in postgres. It's | here: https://www.youtube.com/watch?v=whwNi21jAm4 | | I really appreciate having data in postgres. It's often easy to | think that a specialised DW tool will solve all your problems, | but that often fails to consider things like: | | - Developer experience. Postgres runs very easily on a local | machine, more specialized solutions often don't or are tricky to | setup. | | - Learning another tool costs time. A developer can learn | postgres really well in the time it takes them to figure out how | to use several more specialised tools. And many devs already know | postgres because it's pretty much the default DB nowadays. | | - Analytics queries often don't need to run at warp speed. | Bigquery might give you the answer in a second but if postgres | does it in a minute and it's a weekly report, who cares? | | - Postgres is boring and has been around for many years now, it | will probably still be here in 10 years so time spent learning it | is time well spent. More niche systems will probably be | superseded by fancier, faster replacements. | | I would go so far as to say don't necessarily need to split out | your DW from your prod DB in every case. As soon as you start | splitting out a DW to a separate server you need some way to keep | it in sync, so you'll probably end up duplicating some business | logic for a report, maintaining some ingestion app, shuffling | data around S3 or whatever. Keeping your analytics in your prod | DB (or just a snapshot of yesterdays DB) is often good enough and | means you will be more likely to avoid gnarly business-rules | going out of sync between your app and your DW. | dgudkov wrote: | From a quick look I didn't notice any mention of columnar | storage. I would be very skeptical about any claim that a DB | without columnar storage "can work extremely well as a data | warehouse". | dreyfan wrote: | Take a look at Kimball/Star-Schema. It's worked extremely well | as a data warehouse technique for decades. That said, I think | modern offerings (e.g. Clickhouse) are superior in most use | cases, but it's definitely not impossible on a traditional row- | oriented RDBMS. | kthejoker2 wrote: | No mention of Greenplum? Literally a columnar DW built on top of | pg. | | Where are my DuckDB people? (Think SQLite for OLAP workloads.) | | https://duckdb.org/ | mattashii wrote: | > Avoid Common Table Expressions (or upgrade to 12) | | Upgrading to 12 (or 13) seems like the better option here, | whenever you're able to do so. The improvements are very much | worth it. | truculent wrote: | Yes, it is strange that "Avoid CTEs" made it into the TL;DR | instead. | cedricd wrote: | I updated the blog post :) | cedricd wrote: | Yes, that's a really great point. I should emphasize that more | clearly in the blog :). | mattashii wrote: | A few more items I've encountered: | | - Citus data, not Citrus data. | | - In tables, column order matters. Order const-width non-null | columns before all other columns (best so that there's no | unnecessary alignment padding). Then maybe some low null | fraction fixed-width columns, then ordered by query access | frequency. A lot of time can be spent extracting tuple | values, and you can save a lot of time using the offset | caches by correctly tetris-ing your columns. Note that | dropped columns are set to null, so a table rewrite (by | redefining your table) may be in order for 100% performance. | galkk wrote: | Isn't #2 something that the database engine should handle | by itself: const-width, non-null is written in DDL and the | engine should be be able to handle it. | arcticfox wrote: | Something painful I just learned today is that even Postgres >= | 12 isn't that great with planning CTEs. Querying the same CTE | twice seems to force it into materializing the CTE just like | Postgres <12 used to do. | | Fortunately there's a workaround - using the `with .. as not | materialized ()` hint sped up my query 100x. | wwweston wrote: | What situations would make people unable to upgrade to 12+? | aargh_aargh wrote: | One case I decided against it was when I needed to access the | new postgres server (which I initially planned to be v13) | from an old machine (with a legacy distro with v9.6). v13 | introduces a new SCRAM-SHA-256 password hashing method by | default and only libpq10 and newer supports this method. For | some reason I couldn't or didn't want to neither rehash the | passwords nor upgrade the client, so I remained on a lower | version. Certainly not unsolvable, but I didn't have the time | to spend on a proper fix. | goatinaboat wrote: | _What situations would make people unable to upgrade to 12+?_ | | In my personal experience: the organisation doesn't invest in | a storage solution that offers snapshots/COW (e.g. ZFS or SAN | or whatever). Then they wait to upgrade until their disks | reach a usage that the upgrade has to be done in-place. Then | they become like rabbits in the headlights and never upgrade. | cedricd wrote: | Maybe it doesn't get prioritized until it's important. I know | PG upgrades are pretty straightforward, but sometimes people | don't want to touch something running well. | | That said, given the performance implications, if someone | wants to use PG as a warehouse upgrading to 12 is a no- | brainer. | cedricd wrote: | We support multiple data warehouses on our platform. We recently | had to do a bit of work to get Postgres running, so we wrote a | high-level post about things to consider when running analytical | workloads on PG instead of normal production workloads. | rad_gruchalski wrote: | Is anybody aware of any serious research Yugabyte vs Postgres? | | Yugabyte appears to an application as essentially Postgres 11.2 | with all psql features (even the row / column level security | mechanisms) but, apparently, handles replication and sharding | automagically (is it DHT, similar to Cassandra)? | | Are there any serious technical comparisons with some | conclusions, without marketing bs? | zinclozenge wrote: | yugabyte is multi-raft i believe | oncethere wrote: | It's interesting to see how PG can be configured, but why not | just use a real warehouse like Snowflake? | | Also, do you have any numbers on how PG performs once it's | configured? | cedricd wrote: | Yeah, I think using Snowflake or BigQuery or something is | ultimately the better move. But sometimes folks use what they | know (what they're comfortable managing, tuning, deploying, | whatever). | | In my own testing PG performed very similarly to a 'real' | warehouse. It's hard to measure because I didn't have the same | datasets across several warehouses. Maybe in the future I'll | try running something against a few to see. | fiddlerwoaroof wrote: | I really wanted to migrate an analytics project I was working | on from Elasticsearch to Postgres: however, when we sat down | and ran production-scale proofs of concepts for the change, | ClickHouse handily outclassed all the Postgres-based | solutions we tried. (A Real DBA might have been able to solve | this for us: I did some tuning, but I'm not an expert). | ClickHouse, however, worked near-optimally out of the box. | willvarfar wrote: | Snowflake and big query are cloud solutions. Some companies | have a need for self hosted databases etc. | | And often having a homogenous database stack is a plus. If | your production systems are all MySQL, then trying to get | away with using MySQL for analytics too is a smart move etc. | | I've seen so many tiddly data warehouses. Most companies | don't need web scale, and they overbuild and over complicate | when they could be running on a simpler homogenous stack etc. | acidbaseextract wrote: | I'm glad I'm not in the situation of needing to make the | judgement call, but Postgres' ecosystem might be part of the | answer. For example, Snowflake has basic geospatial support, | but PG has insane and useful stuff in PostGIS like | ST_ClusterDBSCAN: | https://postgis.net/docs/ST_ClusterDBSCAN.html | | Foreign data wrappers are another thing that might be | compelling -- dunno if Snowflake has an equivalent. | | I don't have any numbers but PG has served me fine for basic | pseudo-warehousing. Relative to real solutions, it's pretty bad | at truly columnar workloads: scans across a small number of | columns in wide tables. The "Use Fewer Columns" advice FTA is | solid. This hasn't been a deal breaker though. Analytical query | time has been fine if not great up to low tens of GB table | size, beyond that it gets rough. | simonw wrote: | If your project has less than a few million rows, chances are | PostgreSQL will run any analytical query you like in just a few | seconds. So you can keep things simpler by avoiding adding any | extra tools. | adwww wrote: | > why not just use a real warehouse like Snowflake | | Customers who demand / are legally obliged to ensure their data | does not leave their territory, is one big reason. | | Snowflake give you some options, but if you get a big contract | with a customer in another region your entire analytics | platform is unavailable to that customer. | peterthehacker wrote: | > why not just use a real warehouse like Snowflake? | | Cost is the most common reason I've seen. An RDS instance is | about 1/2 the cost of Redshift per CPU and then Snowflake is | slightly more expensive than Redshift (often worth the extra | $$). | | Also, if you're dealing with less than 10GB of data the | difference in performance will be barely noticeable, so at | modest scale the most cost effective solution is RDS. | hn2017 wrote: | No columnstore option like SQL server? | mattashii wrote: | None baked in (yet). Maybe in 15 or 16; Zedstore (a project for | a PG columnstore table access method) is slowly getting more | and more feature-complete, and might be committed to the main | branch in one of the next few major releases. | arcticfox wrote: | I'm curious what people think of Swarm64 and/or TimescaleDB on | that front | BenoitP wrote: | Citus, cited at the end, is a column store (single node or | distributed) | | https://github.com/citusdata/citus | CRConrad wrote: | Slightly peculiar juxtaposition in subheading, "Differences | Between Data Warehouses and Relational Databases". Most data | warehouses _are_ relational databases (like, e.g, PostgreSQL). I | think you might want to use something like "Differences Between | Data Warehouses and Operational Databases" in stead? | | Also, under "Reasons not to use indexes", #3 says: "Indexes add | additional cost on every insert / update". Yes, but then data | warehouses usually aren't continually updated during the workday. | It's a one-time hit sometime during the night, during your ETL | run. (Not a Pg DBA, but presumably you can shut of index updating | during data load, and then run it separately afterwards for | higher performance?) | SigmundA wrote: | Usually you drop indexes before load, then recreate after. | gervwyk wrote: | Probably going to get downvoted for this, but I feel like MongoDB | should get more love in these subs. | | We use it all the time and their aggreations can get really | advanced and perform well to the level where we run most | analytics on demand. Sure we're not pushing to "big data" levels, | max a few 100k records, but in reality I believe that's the | average size of the majority of business datasets (Just an | estimate, I have nothing to back this up) | | Been building with it for 5 years now and it's been a breeze. | Especially with Atlas. I think our team has not spent more that 3 | days in total on DB dev ops. And with Atlas Lucene text search | and data lakes, querying data from S3. What's not to love. | jmchuster wrote: | Yes, if you're only going to have 100k records, then basically | any type of store will work, and you should choose purely on | ergonomics. At that size, you can get away with not even having | indexes on a traditional sql database. | ahmedelsama wrote: | Yeah, for this blog, it was optimizing a Postgres table with | 100m records. so 1000x more and thus all these issues came to | be. | neximo64 wrote: | I have used Mongo and it is a total nightmare, a bit of the | reasons are in your text. Its supposedly only good with Atlas | which is a managed service. | | I haven't tried Atlas myself (and why would I? - I try and | avoid lock in) but since Postgres supports json column types | which has been my go to instead of Mongo & it has been an | absolute breeze. Especially since it can be indexed and scanned | with postgres sql. | adwww wrote: | Mongo is nice to work with for applications, but for analytics | it's a bit of a nightmare - mostly around non compatibility | with external tools. | arcticfox wrote: | > Sure we're not pushing to "big data" levels, max a few 100k | records, but in reality I believe that's the average size of | the majority of business datasets (Just an estimate, I have | nothing to back this up) | | Excel on a laptop is also a viable option at that scale | paulrbr wrote: | Combine these tips with those: https://tech.fretlink.com/build- | your-own-data-lake-for-repor... (I'm the writer of that linked | article) and you get a really powerful, fully open-source, | extensive data-warehouse solution for many real-life use cases | (when your data doesn't exceed the 10^9 order of magnitude for | number of rows). | | Thanks Cedric for sharing your experience with using PG for data- | warehousing <3 | willvarfar wrote: | A new Postgres-based darling is TimescaleDB. It's a drop-in for | Postgres. | | It is a hybrid row column store with excellent compression and | performance. | | It would be interesting to see how it compares if narrator would | try it out. Benchmarks would be cool. | | One very neat feature I am enamored by is "continuous | aggregates". These are materialized views that auto-update as you | change the fact table. | | Continuous aggregates are a great idea. InfluxDB had "continuous | queries" (but the implementation of influx generally is not so | neat), and firebolt has "aggregate indexes" which are much the | same thing. | | I think all olap dbs will eventually have them as staple, and | that they will trickle down into oltp too. | esturk wrote: | This recommendation reads backward. You made a suggestion of a | substitute that's "comparable" yet without any data and then go | to nudge the author to try it. How do we know you didn't just | pull up something just to have the author test it for you? | cedricd wrote: | Would TimescaleDB be much faster for analytical queries that | aren't necessarily segmented or filtered by time? | | My uninformed assumption is if I do a group by over all rows in | a table that they may not perform better. | | I'll look into their continuous aggregates -- that could be one | way to get around the cost of aggregating everything if it's | done incrementally. | willvarfar wrote: | I'd guess any roughly sequentially keyed table ought get good | insert performance. Think how common auto-increment is. | | And being HTAP, timescale ought do better than a classic pure | column store on upserts and non-appending inserts too. | | Of course if your table is big and the keys are unordered you | still might get excellent performance if your access pattern | is read heavy. | | Of course you can still mix in classic Postgres row-based | tables etc. Timescale just gives you a column store choice | for each table. | gshulegaard wrote: | I would have to dig more into specifics of your use case but | my gut reaction is yes, it would be better. | | I do not have specific experience with TimescaleDB, but I | have some experience scaling PostgreSQL directly and with | Citus (which is similar, but not the same). But depending on | the nuances of your use case, I can envision a number of | scaling strategies in vanilla Postgres to handle your use | case. A lot of what Timescale and Citus does is abstract some | of those strategies and extend them. Which is just a vague | way of me saying: I think I could probably come up with a | scheme in vanilla Postgres to support your use case, and | since Timescale/Citus makes those strategies even | easier/better I am fairly confident they would also handle | that use case. | | As an example I currently have a table in my current Citus | schema that is sharded by column hash (e.g. "type" | enumerator) and further partitioned by time. The first part | (hash based sharding) seems possibly sufficient for your use | case. | | Beyond the most simple applications in that domain though, | there are more exotic options available to both Timescale and | Citus that could come into play. For example, I know Citus | recently incorporated some of their work on the cstore_fdw | into Citus "natively" to allow columnar storage tables | directly: | | https://www.citusdata.com/blog/2021/03/05/citus-10-release-o. | .. | gnfargbl wrote: | I love postgres and my business relies on it. However, at the | scale the author is talking about (~100m rows), throwing all the | data into BigQuery is very likely to be a better option for many | people. | | If rows are around 1kB, then full-dataset queries over 100m rows | will cost less than $0.5 each on BQ -- less if only a few columns | have to be scanned. Storage costs will be negligible and, unlike | pg, setup time will be pretty much nil. | cedricd wrote: | Yep. Fully agree. The point of the post wasn't to say that you | _should_ use PG as a data warehouse. Just that if it 's what | you've got available (for various reasons) that you _can_. | predictmktegirl wrote: | $0.5 queries? | Keyframe wrote: | Full scan. Still absurd to what we've come to to think this | is tolerable. | phpnode wrote: | Reading 100Gb from disk costs $0.50? this is absolutely | incredible to me, how has this become acceptable to the | industry? | wodenokoto wrote: | You don't pay for having a server running. You only pay for | storage and data read during queries. | | No server fees and no fees when autoscaling up for heavy | computations. | phibz wrote: | Maybe I missed it but there's no mention of denormalizing | strategies or data cubes, fact tables and dimension tables. | Structuring your data closer to how its going to be analyzed us | vital for performance. It also gives you the opportunity to | cleanse, standardize, and conform your data. | | I ran a pg data warehouse in the 8.x and 9.x days with about 20TB | of data and it performed great. | cedricd wrote: | I think you're right but it's a bit out of scope. Hard to give | generalizable advice around this I think. | | What we personally do in practice is put everything into a | single time-series table with 11 columns. [1] | | 1: https://www.activityschema.com/ ___________________________________________________________________ (page generated 2021-05-10 23:00 UTC)