hngopher.com

       [HN Gopher] Using PostgreSQL as a Data Warehouse
       ___________________________________________________________________
        
       Using PostgreSQL as a Data Warehouse
        
       Author : cedricd
       Score  : 148 points
       Date   : 2021-05-10 19:39 UTC (3 hours ago)
        
 (HTM) web link (www.narrator.ai)
 (TXT) w3m dump (www.narrator.ai)
        
       | u678u wrote:
       | What would be great is some way to codify all this advice. Eg run
       | PG in a Data Warehouse mode. Databases are too big and too
       | configurable and without specialist DBAs any more most people are
       | just guessing what works best.
        
         | cedricd wrote:
         | That's a great point.
         | 
         | This isn't really what you're saying, but citus [1] ships a
         | distributed Postgres. A lot of the things they improve would
         | help massively with analytical workloads actually.
         | 
         | 1: https://www.citusdata.com/
        
           | riku_iki wrote:
           | You misspelled cit<r>us, the same is in the blogpost.
        
             | cedricd wrote:
             | Ahh! So sorry. Fixed it.
        
             | cedricd wrote:
             | Also, what's your take? Do people use citus for analytical
             | workloads as well as production at scale?
             | 
             | I'd assume yes, but I haven't personally used you guys. I'm
             | just aware of you and broadly how you scale Postgres.
        
               | riku_iki wrote:
               | Sorry, I am not familiar with citus.
        
               | [deleted]
        
       | richwater wrote:
       | > vacuum analyze after bulk insertion
       | 
       | Ugh, this is nightmare. I wish they would come up with a better
       | system than forcing this on users.
        
         | cedricd wrote:
         | Luckily Postgres' autovacuum works really well in normal
         | workloads. If there's an even mix of inserts spread throughout
         | time then it's probably best to just rely on it.
         | 
         | For data warehouses inserts can happen in bulk on a regular
         | cadence. In that case it can help to vacuum right after. I'm
         | not sure if it has a huge impact in practice.
        
       | efxhoy wrote:
       | Dr. Martin Loetzsch did a great video, ETL Patterns with
       | Postgres. He covers some really good topics:
       | 
       | - Instead of updating tables build their replacements under a
       | different name then rename them. This makes updating heavy-to-
       | compute table instant. Works even for schemas: rebuild a schema
       | as schemaname_next rename the current to schemaname_old then
       | rename schemaname_next to schemaname.
       | 
       | - Keep all the source data raw and disable WAL, you don't need it
       | for ETL.
       | 
       | - Set memory limitis high.
       | 
       | And lots of other good tips for doing ETL/DW in postgres. It's
       | here: https://www.youtube.com/watch?v=whwNi21jAm4
       | 
       | I really appreciate having data in postgres. It's often easy to
       | think that a specialised DW tool will solve all your problems,
       | but that often fails to consider things like:
       | 
       | - Developer experience. Postgres runs very easily on a local
       | machine, more specialized solutions often don't or are tricky to
       | setup.
       | 
       | - Learning another tool costs time. A developer can learn
       | postgres really well in the time it takes them to figure out how
       | to use several more specialised tools. And many devs already know
       | postgres because it's pretty much the default DB nowadays.
       | 
       | - Analytics queries often don't need to run at warp speed.
       | Bigquery might give you the answer in a second but if postgres
       | does it in a minute and it's a weekly report, who cares?
       | 
       | - Postgres is boring and has been around for many years now, it
       | will probably still be here in 10 years so time spent learning it
       | is time well spent. More niche systems will probably be
       | superseded by fancier, faster replacements.
       | 
       | I would go so far as to say don't necessarily need to split out
       | your DW from your prod DB in every case. As soon as you start
       | splitting out a DW to a separate server you need some way to keep
       | it in sync, so you'll probably end up duplicating some business
       | logic for a report, maintaining some ingestion app, shuffling
       | data around S3 or whatever. Keeping your analytics in your prod
       | DB (or just a snapshot of yesterdays DB) is often good enough and
       | means you will be more likely to avoid gnarly business-rules
       | going out of sync between your app and your DW.
        
       | dgudkov wrote:
       | From a quick look I didn't notice any mention of columnar
       | storage. I would be very skeptical about any claim that a DB
       | without columnar storage "can work extremely well as a data
       | warehouse".
        
         | dreyfan wrote:
         | Take a look at Kimball/Star-Schema. It's worked extremely well
         | as a data warehouse technique for decades. That said, I think
         | modern offerings (e.g. Clickhouse) are superior in most use
         | cases, but it's definitely not impossible on a traditional row-
         | oriented RDBMS.
        
       | kthejoker2 wrote:
       | No mention of Greenplum? Literally a columnar DW built on top of
       | pg.
       | 
       | Where are my DuckDB people? (Think SQLite for OLAP workloads.)
       | 
       | https://duckdb.org/
        
       | mattashii wrote:
       | > Avoid Common Table Expressions (or upgrade to 12)
       | 
       | Upgrading to 12 (or 13) seems like the better option here,
       | whenever you're able to do so. The improvements are very much
       | worth it.
        
         | truculent wrote:
         | Yes, it is strange that "Avoid CTEs" made it into the TL;DR
         | instead.
        
           | cedricd wrote:
           | I updated the blog post :)
        
         | cedricd wrote:
         | Yes, that's a really great point. I should emphasize that more
         | clearly in the blog :).
        
           | mattashii wrote:
           | A few more items I've encountered:
           | 
           | - Citus data, not Citrus data.
           | 
           | - In tables, column order matters. Order const-width non-null
           | columns before all other columns (best so that there's no
           | unnecessary alignment padding). Then maybe some low null
           | fraction fixed-width columns, then ordered by query access
           | frequency. A lot of time can be spent extracting tuple
           | values, and you can save a lot of time using the offset
           | caches by correctly tetris-ing your columns. Note that
           | dropped columns are set to null, so a table rewrite (by
           | redefining your table) may be in order for 100% performance.
        
             | galkk wrote:
             | Isn't #2 something that the database engine should handle
             | by itself: const-width, non-null is written in DDL and the
             | engine should be be able to handle it.
        
         | arcticfox wrote:
         | Something painful I just learned today is that even Postgres >=
         | 12 isn't that great with planning CTEs. Querying the same CTE
         | twice seems to force it into materializing the CTE just like
         | Postgres <12 used to do.
         | 
         | Fortunately there's a workaround - using the `with .. as not
         | materialized ()` hint sped up my query 100x.
        
         | wwweston wrote:
         | What situations would make people unable to upgrade to 12+?
        
           | aargh_aargh wrote:
           | One case I decided against it was when I needed to access the
           | new postgres server (which I initially planned to be v13)
           | from an old machine (with a legacy distro with v9.6). v13
           | introduces a new SCRAM-SHA-256 password hashing method by
           | default and only libpq10 and newer supports this method. For
           | some reason I couldn't or didn't want to neither rehash the
           | passwords nor upgrade the client, so I remained on a lower
           | version. Certainly not unsolvable, but I didn't have the time
           | to spend on a proper fix.
        
           | goatinaboat wrote:
           | _What situations would make people unable to upgrade to 12+?_
           | 
           | In my personal experience: the organisation doesn't invest in
           | a storage solution that offers snapshots/COW (e.g. ZFS or SAN
           | or whatever). Then they wait to upgrade until their disks
           | reach a usage that the upgrade has to be done in-place. Then
           | they become like rabbits in the headlights and never upgrade.
        
           | cedricd wrote:
           | Maybe it doesn't get prioritized until it's important. I know
           | PG upgrades are pretty straightforward, but sometimes people
           | don't want to touch something running well.
           | 
           | That said, given the performance implications, if someone
           | wants to use PG as a warehouse upgrading to 12 is a no-
           | brainer.
        
       | cedricd wrote:
       | We support multiple data warehouses on our platform. We recently
       | had to do a bit of work to get Postgres running, so we wrote a
       | high-level post about things to consider when running analytical
       | workloads on PG instead of normal production workloads.
        
       | rad_gruchalski wrote:
       | Is anybody aware of any serious research Yugabyte vs Postgres?
       | 
       | Yugabyte appears to an application as essentially Postgres 11.2
       | with all psql features (even the row / column level security
       | mechanisms) but, apparently, handles replication and sharding
       | automagically (is it DHT, similar to Cassandra)?
       | 
       | Are there any serious technical comparisons with some
       | conclusions, without marketing bs?
        
         | zinclozenge wrote:
         | yugabyte is multi-raft i believe
        
       | oncethere wrote:
       | It's interesting to see how PG can be configured, but why not
       | just use a real warehouse like Snowflake?
       | 
       | Also, do you have any numbers on how PG performs once it's
       | configured?
        
         | cedricd wrote:
         | Yeah, I think using Snowflake or BigQuery or something is
         | ultimately the better move. But sometimes folks use what they
         | know (what they're comfortable managing, tuning, deploying,
         | whatever).
         | 
         | In my own testing PG performed very similarly to a 'real'
         | warehouse. It's hard to measure because I didn't have the same
         | datasets across several warehouses. Maybe in the future I'll
         | try running something against a few to see.
        
           | fiddlerwoaroof wrote:
           | I really wanted to migrate an analytics project I was working
           | on from Elasticsearch to Postgres: however, when we sat down
           | and ran production-scale proofs of concepts for the change,
           | ClickHouse handily outclassed all the Postgres-based
           | solutions we tried. (A Real DBA might have been able to solve
           | this for us: I did some tuning, but I'm not an expert).
           | ClickHouse, however, worked near-optimally out of the box.
        
           | willvarfar wrote:
           | Snowflake and big query are cloud solutions. Some companies
           | have a need for self hosted databases etc.
           | 
           | And often having a homogenous database stack is a plus. If
           | your production systems are all MySQL, then trying to get
           | away with using MySQL for analytics too is a smart move etc.
           | 
           | I've seen so many tiddly data warehouses. Most companies
           | don't need web scale, and they overbuild and over complicate
           | when they could be running on a simpler homogenous stack etc.
        
         | acidbaseextract wrote:
         | I'm glad I'm not in the situation of needing to make the
         | judgement call, but Postgres' ecosystem might be part of the
         | answer. For example, Snowflake has basic geospatial support,
         | but PG has insane and useful stuff in PostGIS like
         | ST_ClusterDBSCAN:
         | https://postgis.net/docs/ST_ClusterDBSCAN.html
         | 
         | Foreign data wrappers are another thing that might be
         | compelling -- dunno if Snowflake has an equivalent.
         | 
         | I don't have any numbers but PG has served me fine for basic
         | pseudo-warehousing. Relative to real solutions, it's pretty bad
         | at truly columnar workloads: scans across a small number of
         | columns in wide tables. The "Use Fewer Columns" advice FTA is
         | solid. This hasn't been a deal breaker though. Analytical query
         | time has been fine if not great up to low tens of GB table
         | size, beyond that it gets rough.
        
         | simonw wrote:
         | If your project has less than a few million rows, chances are
         | PostgreSQL will run any analytical query you like in just a few
         | seconds. So you can keep things simpler by avoiding adding any
         | extra tools.
        
         | adwww wrote:
         | > why not just use a real warehouse like Snowflake
         | 
         | Customers who demand / are legally obliged to ensure their data
         | does not leave their territory, is one big reason.
         | 
         | Snowflake give you some options, but if you get a big contract
         | with a customer in another region your entire analytics
         | platform is unavailable to that customer.
        
         | peterthehacker wrote:
         | > why not just use a real warehouse like Snowflake?
         | 
         | Cost is the most common reason I've seen. An RDS instance is
         | about 1/2 the cost of Redshift per CPU and then Snowflake is
         | slightly more expensive than Redshift (often worth the extra
         | $$).
         | 
         | Also, if you're dealing with less than 10GB of data the
         | difference in performance will be barely noticeable, so at
         | modest scale the most cost effective solution is RDS.
        
       | hn2017 wrote:
       | No columnstore option like SQL server?
        
         | mattashii wrote:
         | None baked in (yet). Maybe in 15 or 16; Zedstore (a project for
         | a PG columnstore table access method) is slowly getting more
         | and more feature-complete, and might be committed to the main
         | branch in one of the next few major releases.
        
         | arcticfox wrote:
         | I'm curious what people think of Swarm64 and/or TimescaleDB on
         | that front
        
         | BenoitP wrote:
         | Citus, cited at the end, is a column store (single node or
         | distributed)
         | 
         | https://github.com/citusdata/citus
        
       | CRConrad wrote:
       | Slightly peculiar juxtaposition in subheading, "Differences
       | Between Data Warehouses and Relational Databases". Most data
       | warehouses _are_ relational databases (like, e.g, PostgreSQL). I
       | think you might want to use something like  "Differences Between
       | Data Warehouses and Operational Databases" in stead?
       | 
       | Also, under "Reasons not to use indexes", #3 says: "Indexes add
       | additional cost on every insert / update". Yes, but then data
       | warehouses usually aren't continually updated during the workday.
       | It's a one-time hit sometime during the night, during your ETL
       | run. (Not a Pg DBA, but presumably you can shut of index updating
       | during data load, and then run it separately afterwards for
       | higher performance?)
        
         | SigmundA wrote:
         | Usually you drop indexes before load, then recreate after.
        
       | gervwyk wrote:
       | Probably going to get downvoted for this, but I feel like MongoDB
       | should get more love in these subs.
       | 
       | We use it all the time and their aggreations can get really
       | advanced and perform well to the level where we run most
       | analytics on demand. Sure we're not pushing to "big data" levels,
       | max a few 100k records, but in reality I believe that's the
       | average size of the majority of business datasets (Just an
       | estimate, I have nothing to back this up)
       | 
       | Been building with it for 5 years now and it's been a breeze.
       | Especially with Atlas. I think our team has not spent more that 3
       | days in total on DB dev ops. And with Atlas Lucene text search
       | and data lakes, querying data from S3. What's not to love.
        
         | jmchuster wrote:
         | Yes, if you're only going to have 100k records, then basically
         | any type of store will work, and you should choose purely on
         | ergonomics. At that size, you can get away with not even having
         | indexes on a traditional sql database.
        
         | ahmedelsama wrote:
         | Yeah, for this blog, it was optimizing a Postgres table with
         | 100m records. so 1000x more and thus all these issues came to
         | be.
        
         | neximo64 wrote:
         | I have used Mongo and it is a total nightmare, a bit of the
         | reasons are in your text. Its supposedly only good with Atlas
         | which is a managed service.
         | 
         | I haven't tried Atlas myself (and why would I? - I try and
         | avoid lock in) but since Postgres supports json column types
         | which has been my go to instead of Mongo & it has been an
         | absolute breeze. Especially since it can be indexed and scanned
         | with postgres sql.
        
         | adwww wrote:
         | Mongo is nice to work with for applications, but for analytics
         | it's a bit of a nightmare - mostly around non compatibility
         | with external tools.
        
         | arcticfox wrote:
         | > Sure we're not pushing to "big data" levels, max a few 100k
         | records, but in reality I believe that's the average size of
         | the majority of business datasets (Just an estimate, I have
         | nothing to back this up)
         | 
         | Excel on a laptop is also a viable option at that scale
        
       | paulrbr wrote:
       | Combine these tips with those: https://tech.fretlink.com/build-
       | your-own-data-lake-for-repor... (I'm the writer of that linked
       | article) and you get a really powerful, fully open-source,
       | extensive data-warehouse solution for many real-life use cases
       | (when your data doesn't exceed the 10^9 order of magnitude for
       | number of rows).
       | 
       | Thanks Cedric for sharing your experience with using PG for data-
       | warehousing <3
        
       | willvarfar wrote:
       | A new Postgres-based darling is TimescaleDB. It's a drop-in for
       | Postgres.
       | 
       | It is a hybrid row column store with excellent compression and
       | performance.
       | 
       | It would be interesting to see how it compares if narrator would
       | try it out. Benchmarks would be cool.
       | 
       | One very neat feature I am enamored by is "continuous
       | aggregates". These are materialized views that auto-update as you
       | change the fact table.
       | 
       | Continuous aggregates are a great idea. InfluxDB had "continuous
       | queries" (but the implementation of influx generally is not so
       | neat), and firebolt has "aggregate indexes" which are much the
       | same thing.
       | 
       | I think all olap dbs will eventually have them as staple, and
       | that they will trickle down into oltp too.
        
         | esturk wrote:
         | This recommendation reads backward. You made a suggestion of a
         | substitute that's "comparable" yet without any data and then go
         | to nudge the author to try it. How do we know you didn't just
         | pull up something just to have the author test it for you?
        
         | cedricd wrote:
         | Would TimescaleDB be much faster for analytical queries that
         | aren't necessarily segmented or filtered by time?
         | 
         | My uninformed assumption is if I do a group by over all rows in
         | a table that they may not perform better.
         | 
         | I'll look into their continuous aggregates -- that could be one
         | way to get around the cost of aggregating everything if it's
         | done incrementally.
        
           | willvarfar wrote:
           | I'd guess any roughly sequentially keyed table ought get good
           | insert performance. Think how common auto-increment is.
           | 
           | And being HTAP, timescale ought do better than a classic pure
           | column store on upserts and non-appending inserts too.
           | 
           | Of course if your table is big and the keys are unordered you
           | still might get excellent performance if your access pattern
           | is read heavy.
           | 
           | Of course you can still mix in classic Postgres row-based
           | tables etc. Timescale just gives you a column store choice
           | for each table.
        
           | gshulegaard wrote:
           | I would have to dig more into specifics of your use case but
           | my gut reaction is yes, it would be better.
           | 
           | I do not have specific experience with TimescaleDB, but I
           | have some experience scaling PostgreSQL directly and with
           | Citus (which is similar, but not the same). But depending on
           | the nuances of your use case, I can envision a number of
           | scaling strategies in vanilla Postgres to handle your use
           | case. A lot of what Timescale and Citus does is abstract some
           | of those strategies and extend them. Which is just a vague
           | way of me saying: I think I could probably come up with a
           | scheme in vanilla Postgres to support your use case, and
           | since Timescale/Citus makes those strategies even
           | easier/better I am fairly confident they would also handle
           | that use case.
           | 
           | As an example I currently have a table in my current Citus
           | schema that is sharded by column hash (e.g. "type"
           | enumerator) and further partitioned by time. The first part
           | (hash based sharding) seems possibly sufficient for your use
           | case.
           | 
           | Beyond the most simple applications in that domain though,
           | there are more exotic options available to both Timescale and
           | Citus that could come into play. For example, I know Citus
           | recently incorporated some of their work on the cstore_fdw
           | into Citus "natively" to allow columnar storage tables
           | directly:
           | 
           | https://www.citusdata.com/blog/2021/03/05/citus-10-release-o.
           | ..
        
       | gnfargbl wrote:
       | I love postgres and my business relies on it. However, at the
       | scale the author is talking about (~100m rows), throwing all the
       | data into BigQuery is very likely to be a better option for many
       | people.
       | 
       | If rows are around 1kB, then full-dataset queries over 100m rows
       | will cost less than $0.5 each on BQ -- less if only a few columns
       | have to be scanned. Storage costs will be negligible and, unlike
       | pg, setup time will be pretty much nil.
        
         | cedricd wrote:
         | Yep. Fully agree. The point of the post wasn't to say that you
         | _should_ use PG as a data warehouse. Just that if it 's what
         | you've got available (for various reasons) that you _can_.
        
         | predictmktegirl wrote:
         | $0.5 queries?
        
           | Keyframe wrote:
           | Full scan. Still absurd to what we've come to to think this
           | is tolerable.
        
             | phpnode wrote:
             | Reading 100Gb from disk costs $0.50? this is absolutely
             | incredible to me, how has this become acceptable to the
             | industry?
        
           | wodenokoto wrote:
           | You don't pay for having a server running. You only pay for
           | storage and data read during queries.
           | 
           | No server fees and no fees when autoscaling up for heavy
           | computations.
        
       | phibz wrote:
       | Maybe I missed it but there's no mention of denormalizing
       | strategies or data cubes, fact tables and dimension tables.
       | Structuring your data closer to how its going to be analyzed us
       | vital for performance. It also gives you the opportunity to
       | cleanse, standardize, and conform your data.
       | 
       | I ran a pg data warehouse in the 8.x and 9.x days with about 20TB
       | of data and it performed great.
        
         | cedricd wrote:
         | I think you're right but it's a bit out of scope. Hard to give
         | generalizable advice around this I think.
         | 
         | What we personally do in practice is put everything into a
         | single time-series table with 11 columns. [1]
         | 
         | 1: https://www.activityschema.com/
        
       ___________________________________________________________________
       (page generated 2021-05-10 23:00 UTC)