[HN Gopher] Zheap - Reinvented PostgreSQL Storage
       ___________________________________________________________________
        
       Zheap - Reinvented PostgreSQL Storage
        
       Author : PhilipTrauner
       Score  : 117 points
       Date   : 2020-10-12 18:00 UTC (5 hours ago)
        
 (HTM) web link (cybertec-postgresql.github.io)
 (TXT) w3m dump (cybertec-postgresql.github.io)
        
       | yayzheap wrote:
       | This should solve one of the problems Uber had with PostgreSQL
       | reported by Christophe Pettus back in 2017[1].
       | 
       | 1- https://thebuild.com/presentations/uber-perconalive-2017.pdf
       | 
       | Previous discussion from 2018
       | https://news.ycombinator.com/item?id=16526623
        
       | teej wrote:
       | This seems neat!
       | 
       | Last year I helped a friend diagnose an issue they had with
       | Postgres. A database table with a scheduled full DELETE/INSERT
       | had slowed to the point of failure. It turns out, having slightly
       | less IO than needed led the auto-VACUUM process to get further
       | and further behind each time it ran.
       | 
       | My friend simply provisioned more IO and moved on. Another option
       | would be to rewrite the process to naturally produce fewer dead
       | rows. It would be great to have a third feasible option.
        
         | andreypopp wrote:
         | Such workflows with scheduled DELETE/INSERT often mean that the
         | data is "derived", there's unlogged tables feature for that in
         | PostgreSQL. Table configured with unlogged are not being
         | written to WAL and thus generate much much less I/O. The
         | downside is that after a crash occurs the table might be empty
         | (before it is repopulated by the DELETE/INSERT workflow again).
        
           | teej wrote:
           | UNLOGGED tables is new to me, thanks for sharing.
        
         | merb wrote:
         | well there is truncate for that. (not mvcc safe)
        
           | teej wrote:
           | The table needed to never be visible as empty, so truncate
           | wasn't a short term option.
        
             | takeda wrote:
             | I solved this by using partitions (at the time it was done
             | through inheritance, where you could make one table made
             | out of several smaller ones), and just dropping a table
             | with unneeded data.
        
       | malisper wrote:
       | For those unaware, Zheap is a new storage engine for Postgres
       | that handles updates and deletes in a different way. Currently,
       | when you update a row in Postgres, Postgres creates a new copy of
       | the row and marks the old one as deleted. This is done for
       | several reasons, specifically it makes handling concurrency
       | easier and it makes rolling back transactions easier.
       | 
       | The issue with this approach is over time this leads to lots of
       | "dead rows" - deleted rows that are taking up space in your DB.
       | Postgres has a background job called the "vacuum" which
       | effectively garbage collects all deleted rows. Depending on your
       | settings, the vacuum can be a pretty expensive job and even with
       | the vacuum, you can still wind up using a lot more space than you
       | actually need.
       | 
       | Zheap addresses these problems by using a different approach.
       | When performing an update, instead of marking the old row as
       | deleted and inserting a new one, Postgres will replace the old
       | row with the new one and write a copy of the old row to a
       | separate file. This means the main table file doesn't need to be
       | larger than necessary to keep track of the dead rows.
       | 
       | Zheap does lead to lots of tricky scenarios. If for any reason
       | you need to access the old copy of the row, you have to fetch it
       | from the separate file. If the transaction that performed that
       | update is rolled back, you need to replace the new version of the
       | row with the old version of the row. This sounds straightforward,
       | but gets really tricky really fast. For example, what happens if
       | I have row X that takes up 1kb. I replace it with row Y that
       | takes up 500b. I then write row Z after row Y that takes up 500b.
       | If you want to rollback the original transaction, row X will no
       | longer fit in its original spot because row Z is now taking up
       | part of the space it used to occupy.
        
         | comboy wrote:
         | > row X that takes up 1kb. I replace it with row Y that takes
         | up 500b. I then write row Z after row Y that takes up 500b.
         | 
         | I thought there is a fixed amount of space allocated per row
         | and varying length blobs are stored elsewhere, is my
         | understanding wrong?
        
           | eloff wrote:
           | Yes. Rows are variable length. Think null values, strings,
           | etc.
           | 
           | Only really large values that don't fit into a page are
           | stored elsewhere, and only if they can't be compressed and
           | made to fit. I think it's > 3kb. But I don't remember why
           | that number comes to mind. Pages are 8kb.
        
         | pbalau wrote:
         | > I have row X that takes up 1kb. I replace it with row Y that
         | takes up 500b. I then write row Z after row Y that takes up
         | 500b. If you want to rollback the original transaction, row X
         | will no longer fit in its original spot because row Z is now
         | taking up part of the space it used to occupy.
         | 
         | But this only can happen if replace X with Y and add Z are in
         | the same transaction. If you rollback the transaction, then you
         | start by removing Z, then replacing Y with X. What I'm missing
         | here?
        
         | aidos wrote:
         | Oof. That sounds super hairy. Don't you run into all sorts of
         | weirdness with things like bitmaps expecting to find a specific
         | version of a tuple at a specific location in the heap?
        
         | castorp wrote:
         | > Zheap does lead to lots of tricky scenarios. If for any
         | reason you need to access the old copy of the row, you have to
         | fetch it from the separate file. If the transaction that
         | performed that update is rolled back, you need to replace the
         | new version of the row with the old version of the row. This
         | sounds straightforward, but gets really tricky really fast.
         | 
         | This is pretty much what Oracle is doing.
        
           | takeda wrote:
           | EnterpriseDB (which started this project) specializes in
           | providing postgresql version that can emulate most of Oracle
           | database functionality, so this is not surprising.
        
         | stingraycharles wrote:
         | Thanks for this. So I'm fairly familiar how MVCC relates to
         | CoW, and the way that I read this is that it's kind of the
         | inverse of CoW: it's an in-place update and a copy of the old
         | data.
         | 
         | I can completely imagine the corner cases this ends up
         | touching, as there must be an insane amount of logic in
         | Postgres that assumes immutability of data on disk: if I read
         | data block X for transaction version 1234, it will always
         | remain the same.
         | 
         | It's a very interesting approach though. Once all the corner
         | cases are solved and delt with, I wonder: are there any reasons
         | _not_ to choose zheap over the standard storage engine?
        
           | simcop2387 wrote:
           | the cow approach likely plays better with flash based storage
           | since rewrites happen less often. so if you've got a lot of
           | otherwise extra storage then maybe it's a good idea. though
           | if you run on top of something like f2fs maybe it'll be
           | better in that respect than pg does on it's own.
        
           | anarazel wrote:
           | > I can completely imagine the corner cases this ends up
           | touching, as there must be an insane amount of logic in
           | Postgres that assumes immutability of data on disk: if I read
           | data block X for transaction version 1234, it will always
           | remain the same.
           | 
           | Most of the code having such low level assumptions has been
           | centralized for PG12, as part of the introduction of
           | pluggable table storage. Also note that that assumption as
           | stated isn't actually true - we don't expect on-disk data to
           | be immutable. There is on-access "pruning" of dead rows,
           | there's VACUUM, there's new row versions on the same page
           | etc.
        
           | macdice wrote:
           | There are probably some cases where traditional heap would be
           | better that (ideal, finished) zheap. Some ideas: Crash
           | recovery is faster without an UNDO phase. Accessing data that
           | has been modified many times since you took your snapshot may
           | be slower if you have to walk back in time through multiple
           | hops to find the right version of each tuple (compared to the
           | regular heap, which also has to step over lots of dead tuples
           | in that case but can do so sequentially). There may be space
           | management problems when there are too many active
           | transactions interested in a page. There may be microscopic
           | overheads that show up in some workloads that don't benefit
           | from updating in place, perhaps because they are
           | insert/delete-only.
           | 
           | I guess both options would be good to have, and it's
           | interesting that recent SQL Server can do it both ways at the
           | user's option (see "Accelerated Database Recovery", something
           | that sounds conceptually a bit like PostgreSQL vacuum instead
           | of ARIES style rollback where you have to undo changes).
        
         | abhinav22 wrote:
         | Thanks for the explanation. In layman terms how much is the
         | benefit from separating the old rows from the database vs the
         | cost of accessing it from a separate file?
         | 
         | Just curious why it's being done now as sounds like a major
         | design decision that would have been considered a long time ago
        
           | jasonwatkinspdx wrote:
           | It's a complex tradeoff. It depends on the specific workload
           | which scheme is better. There's no clear winner. This is why
           | it's being introduced as a pluggable choice vs a wholesale
           | change.
           | 
           | Andy Pavlo's class has a lot of good general information on
           | the topic: https://15721.courses.cs.cmu.edu/spring2020/slides
           | /03-mvcc1....
        
           | malisper wrote:
           | In Postgres 12, the pluggable storage API was introduced.
           | That allows use of multiple different storage engines with
           | Zheap being one of the new storage engines.
           | 
           | IIRC, one of the big reasons for implementing pluggable
           | storage was for Zheap.
           | 
           | FWIW, the existing implementation has worked really well for
           | most use cases.
        
         | arthurcolle wrote:
         | Have you stress tested it? How does this end up faring with
         | large scale deployments?
        
           | _bohm wrote:
           | If you look at the most recent status update on the project:
           | 
           | > 12-10-2020: Most regression tests are passing, but write-
           | speeds are still low.
        
         | magicalhippo wrote:
         | > replace the old row with the new one and write a copy of the
         | old row to a separate file
         | 
         | Does it do this in the reverse order? If not, how does it
         | prevent dataloss in case the main table update succeeds but
         | writing to the "old copy" table fails?
        
           | macdice wrote:
           | Well, conceptually it happens atomically. The change is
           | described in a single WAL (write ahead log) record, and data
           | in the buffers representing the table and undo log isn't
           | allowed to touch the disk until the WAL is on disk, so in any
           | crash/restart scenario that leaves the job half done, we'll
           | simply do it again when we replay all changes since the last
           | checkpoint. A related question is: what happens to the
           | uncommitted change after that? The system recognises that the
           | change belongs to an uncommitted transaction, and rolls it
           | back, which in this case means that we'll copy the old copy
           | back into the main table before allowing any user to access
           | it, and free up the space in the undo log. This is
           | approximately how most other RDBMSes work.
        
             | magicalhippo wrote:
             | Ah of course, how silly of me to forget about the log.
             | Thank you, that makes perfect sense.
        
               | jasonwatkinspdx wrote:
               | It's more complex than the above. The key constraint is
               | actually that the transaction is not allowed to return as
               | committed until the WAL is stable on disk. Whether the
               | writes it did to various pages are allowed before that
               | point or held until after is an implementation choice,
               | and different databases make different choices. It gets
               | complex fast, but the short summary is that the recovery
               | protocols understand how to reconstruct the state after
               | the fact, redo what needs to be redone, and undo what
               | needs to be undone.
        
           | malisper wrote:
           | > Does it do this in the reverse order?
           | 
           | I would guess so, but I haven't looked up the implementation
           | so I'm not sure. There's a ton of race conditions that can
           | come up depending on the exact order you write things so I'm
           | sure the actual implementation is pretty messy.
        
       | AdamProut wrote:
       | I wonder how this approach will compare with an LSM tree style
       | index that has gained popularity recently (mostly via RocksDB).
       | LSM trees are also write optimized and allow for some
       | optimizations by batching up updates and merging them in bulk.
        
       | fanf2 wrote:
       | There is a super informative blog post from last week at
       | https://www.cybertec-postgresql.com/en/zheap-reinvented-post... -
       | discussed at https://news.ycombinator.com/item?id=24710765
        
       ___________________________________________________________________
       (page generated 2020-10-12 23:00 UTC)