[HN Gopher] Zheap - Reinvented PostgreSQL Storage ___________________________________________________________________ Zheap - Reinvented PostgreSQL Storage Author : PhilipTrauner Score : 117 points Date : 2020-10-12 18:00 UTC (5 hours ago) (HTM) web link (cybertec-postgresql.github.io) (TXT) w3m dump (cybertec-postgresql.github.io) | yayzheap wrote: | This should solve one of the problems Uber had with PostgreSQL | reported by Christophe Pettus back in 2017[1]. | | 1- https://thebuild.com/presentations/uber-perconalive-2017.pdf | | Previous discussion from 2018 | https://news.ycombinator.com/item?id=16526623 | teej wrote: | This seems neat! | | Last year I helped a friend diagnose an issue they had with | Postgres. A database table with a scheduled full DELETE/INSERT | had slowed to the point of failure. It turns out, having slightly | less IO than needed led the auto-VACUUM process to get further | and further behind each time it ran. | | My friend simply provisioned more IO and moved on. Another option | would be to rewrite the process to naturally produce fewer dead | rows. It would be great to have a third feasible option. | andreypopp wrote: | Such workflows with scheduled DELETE/INSERT often mean that the | data is "derived", there's unlogged tables feature for that in | PostgreSQL. Table configured with unlogged are not being | written to WAL and thus generate much much less I/O. The | downside is that after a crash occurs the table might be empty | (before it is repopulated by the DELETE/INSERT workflow again). | teej wrote: | UNLOGGED tables is new to me, thanks for sharing. | merb wrote: | well there is truncate for that. (not mvcc safe) | teej wrote: | The table needed to never be visible as empty, so truncate | wasn't a short term option. | takeda wrote: | I solved this by using partitions (at the time it was done | through inheritance, where you could make one table made | out of several smaller ones), and just dropping a table | with unneeded data. | malisper wrote: | For those unaware, Zheap is a new storage engine for Postgres | that handles updates and deletes in a different way. Currently, | when you update a row in Postgres, Postgres creates a new copy of | the row and marks the old one as deleted. This is done for | several reasons, specifically it makes handling concurrency | easier and it makes rolling back transactions easier. | | The issue with this approach is over time this leads to lots of | "dead rows" - deleted rows that are taking up space in your DB. | Postgres has a background job called the "vacuum" which | effectively garbage collects all deleted rows. Depending on your | settings, the vacuum can be a pretty expensive job and even with | the vacuum, you can still wind up using a lot more space than you | actually need. | | Zheap addresses these problems by using a different approach. | When performing an update, instead of marking the old row as | deleted and inserting a new one, Postgres will replace the old | row with the new one and write a copy of the old row to a | separate file. This means the main table file doesn't need to be | larger than necessary to keep track of the dead rows. | | Zheap does lead to lots of tricky scenarios. If for any reason | you need to access the old copy of the row, you have to fetch it | from the separate file. If the transaction that performed that | update is rolled back, you need to replace the new version of the | row with the old version of the row. This sounds straightforward, | but gets really tricky really fast. For example, what happens if | I have row X that takes up 1kb. I replace it with row Y that | takes up 500b. I then write row Z after row Y that takes up 500b. | If you want to rollback the original transaction, row X will no | longer fit in its original spot because row Z is now taking up | part of the space it used to occupy. | comboy wrote: | > row X that takes up 1kb. I replace it with row Y that takes | up 500b. I then write row Z after row Y that takes up 500b. | | I thought there is a fixed amount of space allocated per row | and varying length blobs are stored elsewhere, is my | understanding wrong? | eloff wrote: | Yes. Rows are variable length. Think null values, strings, | etc. | | Only really large values that don't fit into a page are | stored elsewhere, and only if they can't be compressed and | made to fit. I think it's > 3kb. But I don't remember why | that number comes to mind. Pages are 8kb. | pbalau wrote: | > I have row X that takes up 1kb. I replace it with row Y that | takes up 500b. I then write row Z after row Y that takes up | 500b. If you want to rollback the original transaction, row X | will no longer fit in its original spot because row Z is now | taking up part of the space it used to occupy. | | But this only can happen if replace X with Y and add Z are in | the same transaction. If you rollback the transaction, then you | start by removing Z, then replacing Y with X. What I'm missing | here? | aidos wrote: | Oof. That sounds super hairy. Don't you run into all sorts of | weirdness with things like bitmaps expecting to find a specific | version of a tuple at a specific location in the heap? | castorp wrote: | > Zheap does lead to lots of tricky scenarios. If for any | reason you need to access the old copy of the row, you have to | fetch it from the separate file. If the transaction that | performed that update is rolled back, you need to replace the | new version of the row with the old version of the row. This | sounds straightforward, but gets really tricky really fast. | | This is pretty much what Oracle is doing. | takeda wrote: | EnterpriseDB (which started this project) specializes in | providing postgresql version that can emulate most of Oracle | database functionality, so this is not surprising. | stingraycharles wrote: | Thanks for this. So I'm fairly familiar how MVCC relates to | CoW, and the way that I read this is that it's kind of the | inverse of CoW: it's an in-place update and a copy of the old | data. | | I can completely imagine the corner cases this ends up | touching, as there must be an insane amount of logic in | Postgres that assumes immutability of data on disk: if I read | data block X for transaction version 1234, it will always | remain the same. | | It's a very interesting approach though. Once all the corner | cases are solved and delt with, I wonder: are there any reasons | _not_ to choose zheap over the standard storage engine? | simcop2387 wrote: | the cow approach likely plays better with flash based storage | since rewrites happen less often. so if you've got a lot of | otherwise extra storage then maybe it's a good idea. though | if you run on top of something like f2fs maybe it'll be | better in that respect than pg does on it's own. | anarazel wrote: | > I can completely imagine the corner cases this ends up | touching, as there must be an insane amount of logic in | Postgres that assumes immutability of data on disk: if I read | data block X for transaction version 1234, it will always | remain the same. | | Most of the code having such low level assumptions has been | centralized for PG12, as part of the introduction of | pluggable table storage. Also note that that assumption as | stated isn't actually true - we don't expect on-disk data to | be immutable. There is on-access "pruning" of dead rows, | there's VACUUM, there's new row versions on the same page | etc. | macdice wrote: | There are probably some cases where traditional heap would be | better that (ideal, finished) zheap. Some ideas: Crash | recovery is faster without an UNDO phase. Accessing data that | has been modified many times since you took your snapshot may | be slower if you have to walk back in time through multiple | hops to find the right version of each tuple (compared to the | regular heap, which also has to step over lots of dead tuples | in that case but can do so sequentially). There may be space | management problems when there are too many active | transactions interested in a page. There may be microscopic | overheads that show up in some workloads that don't benefit | from updating in place, perhaps because they are | insert/delete-only. | | I guess both options would be good to have, and it's | interesting that recent SQL Server can do it both ways at the | user's option (see "Accelerated Database Recovery", something | that sounds conceptually a bit like PostgreSQL vacuum instead | of ARIES style rollback where you have to undo changes). | abhinav22 wrote: | Thanks for the explanation. In layman terms how much is the | benefit from separating the old rows from the database vs the | cost of accessing it from a separate file? | | Just curious why it's being done now as sounds like a major | design decision that would have been considered a long time ago | jasonwatkinspdx wrote: | It's a complex tradeoff. It depends on the specific workload | which scheme is better. There's no clear winner. This is why | it's being introduced as a pluggable choice vs a wholesale | change. | | Andy Pavlo's class has a lot of good general information on | the topic: https://15721.courses.cs.cmu.edu/spring2020/slides | /03-mvcc1.... | malisper wrote: | In Postgres 12, the pluggable storage API was introduced. | That allows use of multiple different storage engines with | Zheap being one of the new storage engines. | | IIRC, one of the big reasons for implementing pluggable | storage was for Zheap. | | FWIW, the existing implementation has worked really well for | most use cases. | arthurcolle wrote: | Have you stress tested it? How does this end up faring with | large scale deployments? | _bohm wrote: | If you look at the most recent status update on the project: | | > 12-10-2020: Most regression tests are passing, but write- | speeds are still low. | magicalhippo wrote: | > replace the old row with the new one and write a copy of the | old row to a separate file | | Does it do this in the reverse order? If not, how does it | prevent dataloss in case the main table update succeeds but | writing to the "old copy" table fails? | macdice wrote: | Well, conceptually it happens atomically. The change is | described in a single WAL (write ahead log) record, and data | in the buffers representing the table and undo log isn't | allowed to touch the disk until the WAL is on disk, so in any | crash/restart scenario that leaves the job half done, we'll | simply do it again when we replay all changes since the last | checkpoint. A related question is: what happens to the | uncommitted change after that? The system recognises that the | change belongs to an uncommitted transaction, and rolls it | back, which in this case means that we'll copy the old copy | back into the main table before allowing any user to access | it, and free up the space in the undo log. This is | approximately how most other RDBMSes work. | magicalhippo wrote: | Ah of course, how silly of me to forget about the log. | Thank you, that makes perfect sense. | jasonwatkinspdx wrote: | It's more complex than the above. The key constraint is | actually that the transaction is not allowed to return as | committed until the WAL is stable on disk. Whether the | writes it did to various pages are allowed before that | point or held until after is an implementation choice, | and different databases make different choices. It gets | complex fast, but the short summary is that the recovery | protocols understand how to reconstruct the state after | the fact, redo what needs to be redone, and undo what | needs to be undone. | malisper wrote: | > Does it do this in the reverse order? | | I would guess so, but I haven't looked up the implementation | so I'm not sure. There's a ton of race conditions that can | come up depending on the exact order you write things so I'm | sure the actual implementation is pretty messy. | AdamProut wrote: | I wonder how this approach will compare with an LSM tree style | index that has gained popularity recently (mostly via RocksDB). | LSM trees are also write optimized and allow for some | optimizations by batching up updates and merging them in bulk. | fanf2 wrote: | There is a super informative blog post from last week at | https://www.cybertec-postgresql.com/en/zheap-reinvented-post... - | discussed at https://news.ycombinator.com/item?id=24710765 ___________________________________________________________________ (page generated 2020-10-12 23:00 UTC)