[HN Gopher] An unexpected find that freed 20GB of unused index s... ___________________________________________________________________ An unexpected find that freed 20GB of unused index space in PostgreSQL Author : haki Score : 285 points Date : 2021-02-01 14:17 UTC (8 hours ago) (HTM) web link (hakibenita.com) (TXT) w3m dump (hakibenita.com) | tantalor wrote: | Graphing "free storage" is meaningless and confusing; it should | be "used storage". | | Available storage depends on usage and capacity. | | Edit: I meant for this article; of course I believe it is useful | to track this in practice. | AnotherGoodName wrote: | Used makes sense for getting a feeling for pure performance | (smaller the better as its likely to be in memory). | | Available makes sense for knowing when things will just plain | break (reaching 0 = write failure for a DB). | | >Every few months we get an alert from our database monitoring | to warn us that we are about to run out of space. | | In this case they were avoiding their DB server breaking. They | didn't do this for performance reasons. | zufallsheld wrote: | If you pay for a fixed amount of storage and only use it | partially, you should also monitor free storage sdso you know | when you waste it. | fanf2 wrote: | Free storage is what matters because it makes it very obvious | when you are getting close to a disk-full outage. | brianberns wrote: | > REINDEX INDEX CONCURRENTLY index_name; | | > If for some reason you had to stop the rebuild in the middle, | the new index will not be dropped. Instead, it will be left in an | invalid state and consume space. | | Well, that sure sounds like a bug in PostreSQL to me. | striking wrote: | Well, you can't just _delete_ it. It is an object that was | created by some user and there 's no good reason for the | database to get rid of it automatically. The database keeps a | record of the invalid thing, even though it is invalid. | brianberns wrote: | The good reason to get rid of it automatically: It takes up | space. | | Is there any good reason to keep it? (The fact that it was | "created by some user" doesn't seem like much of a reason.) | | IMHO, creating an index should be atomic: Either you end up | with a valid index, or you end up with nothing. | voganmother42 wrote: | pretty well documented behavior as far as concurrent: | https://www.postgresql.org/docs/current/sql-createindex.html... | ivoras wrote: | Is the partial index technique to avoid indexed NULL data as | effective for PostgreSQL 13+? | | It looks like in v13+ PostgreSQL could create a single leaf for | NULL data and just store row pointers within it, which should | reduce data sizes at least a bit. | chrismeller wrote: | He actually mentioned index de-duplication earlier: | https://hakibenita.com/postgresql-unused-index-size#activati... | | If I had to guess, I would say that it doesn't accomplish | anything (or as much as you'd think) for null values simply | because there is no real data to store in either approach, you | just have a bunch of pointers either way. | petergeoghegan wrote: | NULL values are not special as far as deduplication is | concerned. They use approximately as much disk space as a | non-NULL integer column without deduplication, and compress | just as well with deduplication. Deduplication is effective | because it eliminates per-tuple overhead, so you see most of | the benefits even with index tuples that naturally happen to | have physically small keys. You'll still get up to a 3x | decrease in storage overhead for the index provided there is | low cardinality data (and not necessarily that low | cardinality, ~10 or so tuples per distinct value will get you | there). | | The NULL issue is documented directly -- see the "Note" box | here: | | https://www.postgresql.org/docs/devel/btree- | implementation.h... | mattashii wrote: | Not per se _as effective_, but it will still help a lot. NULL | tuples pre-pg13 take ~ 14 bytes each, and 18 bytes when | aligned. (= 2 (ItemID, location on page) + 6 (TID) + 2 (t_info) | + 4 (NULL bitmap) + 4 bytes alignment). When deduplication is | enabled for your index, then your expected tuple size becomes | just a bit more than 6 bytes (~ 50 TIDs* in one tuple => 2 | (ItemId) + 6 (alt tid) + 2 (t_info) + 4 (null bitmap) + 50 * 6 | (heap TIDs) / 50 => ~ 6.28 bytes/tuple). | | So, deduplication saves some 65% in index size for NULL-only | index-tuples, and the further 35% can be saved by using a | partial index (so, in this case, deduplication could have saved | 13GB). | | *note: last time I checked, REINDEX with deduplication enabled | packs 50 duplicates in one compressed index tuple. This varies | for naturally grown indexes, and changes with column types and | update access patterns. | [deleted] | mulander wrote: | Partial indexes are amazing but you have to keep in mind some | pecularities. | | If your query doesn't contain a proper match with the WHERE | clause of the index - the index will not be used. It is easy to | forget about it or to get it wrong in subtle ways. Here is an | example from work. | | There was an event tracing structure which contained the event | severity_id. Id values 0-6 inclusive are user facing events. | Severity 7 and up is debug events. In practice all debug events | were 7 and there were no other values above 7. This table had a | partial index with WHERE severity_id < 7. I tracked down a | performance regression, when an ORM (due to programmer error) | generated WHERE severity_id != 7. The database is obviously not | able to tell that there will never be any values above 7 so the | index was not used slowing down event handling. Turning the query | to match < 7 fixes the problem. The database might also not be | able to infer that the index can be indeed used, for example when | prepared statements are involved WHERE severity_id < ?. The | database will not be able to tell that all bindings of ? will | satisfy < 7 so will not use the index (unless you are running PG | 12, then that might depend on the setting of plan_cache_mode[1] | but I have not tested that yet). | | Another thing is that HOT updates in PostgreSQL can't be | performed if the updated field is indexed but that also includes | being part of a WHERE clause in a partial index. So you could | have a site like HN and think that it would be nice to index | stories WHERE vote > 100 to quickly find more popular stories. | That index however would nullify the possiblity of a hot update | when the vote tally would be updated. Again, not a problem but | you need to know the possible drawbacks. | | That said, they are great when used for the right purpose. Kudos | to the author for a nice article! | | [1] - https://postgresqlco.nf/doc/en/param/plan_cache_mode/ | GordonS wrote: | > The database is obviously not able to tell that there will | never be any values above 7 | | You say "obviously", but with updated statistics this is the | exactly the kind of thing you might expect the planner to know | and aid index decisions. | | I'm a huge fan of Postgres, coming to it around 5 years ago | from at least 10 previous years with SQL Server, but I have hit | a few things like this in that time. IME the planner is much | more fickle about how you specify your predicates than SQL | Server is. | londons_explore wrote: | All statistics in postgres are considered best effort | guidance. Even if the statistics are wrong it can never | impact the correctness of the results. | mulander wrote: | No, I don't think statistics can let you get away with this. | Databases are concurrent, you can't guarantee that a | different session will not insert a record that invalidates | your current statistics. | | You could argue that it should be able to use it if the table | has a check constraint preventing severity_id above 7 being | ever inserted. That is something that could be done, I don't | know if PostgreSQL does it (I doubt it) or how feasable it | would be. | | Is SQL Server able to make an assumption like that purely | based on statistics? Genuine question. | GordonS wrote: | > No, I don't think statistics can let you get away with | this. Databases are concurrent, you can't guarantee that a | different session will not insert a record that invalidates | your current statistics. | | Of course you can't make a guarantee like that, but why | would you need to? Statistics are there to guide planner | choices, not make cast iron predictions. | cmeacham98 wrote: | Because the database has to decided whether or not to use | the index. If it decides to use the index, and there | _are_ values above 7, then it will misbehave (the query | will miss those results). Now of course the database | could then scan the rows for values above 7 it missed but | at that point there 's no point in using the index and | you might as well have row scanned for the original | query. | | As a result, the database has to be 100% sure that there | are no values _at all_ above 7 to safely and efficiently | use the index, ex. when there's a constraint. | mulander wrote: | Let us say that in our example table we have 100 000 | records with severit_id < 7, 200 000 with severity_id = 7 | and 3 records with severity_id = 8. | | Statistics claim 100k id < 7, 200k id = 7 and 0 with id > | 7. The last 3 updates could have happened right before | our query, the statistics didn't update yet. | | Let us assume that we blindly trust the statistics and | they currently state that there are absolutely no values | with severity_id > 7 and you have a query WHERE | severity_id != 7 and a partial index on severity_id < 7. | | If you trust the statistics and actually use the index | the rows containing severity_id = 8 will never be | returned by the query even if they exist. So by using the | index you only scan 100 k rows and never touch the | remaining ~200k. However this query can't be answered | without scanning all ~300k records. This means, that on | the same database you would get two different results for | the exact same query if you decided to drop the index | after the first run. The database can't fall back and | change the plan during execution. | | Perhaps I misunderstood you originally. I thought you | suggested that the database should be able to know that | it can still use the index because currently the | statistics claim that there are no records that would | make the result incorrect. You are of course correct, | that the statistics are there to guide the planner | choices and that is how they are used within PostgreSQL - | however some plans will give different results if your | assumption about data are wrong. | mattb314 wrote: | I doubt it? At least the number times the "last updated" | column appears on SQL server stats [1] leads me to believe | it collects stats async with updates to the table. | | The only system I've heard of that relies on up-to-date | statistics for correctness is snowflake (long but | interesting talk here [2]), where having accurate max/mins | for each micro partition is really helpful for cutting down | the amount of data in the large range scan queries common | in BI. I'd guess that being a BI system, snowflake can get | away with higher row update latency too. | | [1] https://www.sqlshack.com/sql-server-statistics-and-how- | to-pe... | | [2] https://www.youtube.com/watch?v=CPWn1SZUZqE | lucian1900 wrote: | Partial indexes can be useful in any case where one value has | much higher cardinality than others. | | Indexing boolean columns is often only useful if one of the | values is uncommon and the index is partial to only include those | uncommon rows. | mnw21cam wrote: | Agreed. To explain why this is the case, consider that table in | the story that had 99% NULL values. If you were to try to run | "SELECT FROM table WHERE column IS NULL", then Postgresql | wouldn't use the index anyway, because it would be faster to | just read sequentially through the entire table and filter out | the 1% that don't match. | maweki wrote: | That would highly depend on what you select. If the query | could be answered by index only, like COUNT(*), it would | probably use the index. You are right if you want to query | any data from that row that's not in the index. | mnw21cam wrote: | I might be out of touch a little with Postgres (I last used | it in 2010), but my impression was that COUNT(*) still | needed to scan the actual table in order to exclude rows | that had been deleted in a transaction, due to the way | multi-version concurrency worked. Is this something that | has been improved since then? | zeroimpl wrote: | They support index-only scans now, so there is some sort | of optimization which bypasses the table lookup, at least | in certain cases. | mjw1007 wrote: | Summary: if you have an index on a column which is mostly NULL, | consider using a partial index covering only the records where | it's non-NULL. | latch wrote: | Another benefit of partial indexes is to limit a unique | constraint: | | create index users_email on users(email) where status != | 'delete' | SomeHacker44 wrote: | Be very careful, then, as the optimizer will (usually?) not | use the index if the condition is not part if the query. | jackTheMan wrote: | you can create views for that, then it will be always part | of the query. | malinens wrote: | Too bad MySQL does not have partial indexes. | | We have one huge table I want to add some indexes for specific | cases (for max 1% of records) but server will not have enough | memory for it if I add those indexes for all records :/ | fraktl wrote: | MySQL has pluggable storage engines. TokuDB does what you're | after (adds indexes on the fly, as well as alter tables on the | fly without overloading the server). | wolf550e wrote: | Altering table online without using pt-online-schema-change | doesn't help if they want an index that covers only some of | the keys but not others. | hu3 wrote: | This page about TokuDB reads: | | > TokuDB has been deprecated by its upstream maintainer. It | is disabled from MariaDB 10.5 and has been been removed in | MariaDB 10.6 - MDEV-19780. We recommend MyRocks as a long- | term migration path. | | https://mariadb.com/kb/en/tokudb/ | | Is MyRocks comparable? | crazygringo wrote: | As long as you've got primary keys on the huge table, there's a | hacky solution -- create a second table with columns for just | the first table's primary key and the columns you're indexing | and your desired index, and ensure you always | write/update/delete both tables simultaneously using | transactions. Then when needed, use the index on the second | table and join it to your first with the primary key. | | Annoying, but it should work for most queries I'd expect | without too much SQL. | | I've definitely "rolled my own indexing" like this in the past, | though it's more often been duplicating strings into a custom | "collation" or other transformations. | | Another solution is simply to split your table in two, with the | same columns in both, and the index only on one of the tables. | But of course that really depends on your business logic -- | queries that need to retrieve data from both tables together | can get pretty hairy/slow, and if you've got auto-incrementing | PKEY's then avoiding collisions between the two tables can be | tricky on its own. So this is definitely the less general | solution. | | Of coure it certainly would be nicer if MySQL supported partial | indexes. It seems so useful, I'm surprised it didn't happen | long ago. | lucian1900 wrote: | The first approach is one of the steps towards normalising a | database. | crazygringo wrote: | Actually it's the opposite of database normalization. | | Normalizing removes data redundancy. This adds data | redundancy. | | When I design a database structure, it's common to start | with the most normalized representation possible. And then | to denormalize the minimum necessary for performance | reasons -- duplicating rows and/or columns just like here | so certain data can be retrieved more quickly, whenever | indexes aren't powerful or featured enough. | fauigerzigerk wrote: | I think what lucian1900 may be thinking is that instead | of create table purchase_order ( | id int primary key, ordered_on timestamptz | not null, customer_id int not null references | customer, canceled_on timestamptz ); | | you could have create table | purchase_order ( id int primary key, | ordered_on timestamptz not null, customer_id | int not null references customer ); | create table order_cancelation ( order_id int | primary key references purchase_order, | canceled_on timestamptz not null ); | | This is indeed a better normalised schema and it allows | you to index order_cancelation.canceled_on without | worrying about nulls. | lucian1900 wrote: | Exactly, that's what I thought was being described. | crazygringo wrote: | Oh then, absolutely. I was assuming a constraint that | columns couldn't be removed from the original table. But | if you can, then your example is an even better solution. | abfan1127 wrote: | Could you create a temporary high memory slave MySQL server, | sync the master, add the index, sync back to master, and | decommission the temporary high memory? I don't know enough | about master/slave operations to know if it would work. | wolf550e wrote: | 'malinens doesn't have the storage space to store an index | over a column if all values in the column are indexed. They | want to index only some values, but not others. This feature | does not exist in MySQL. | matsemann wrote: | > _Clear bloat in tables_ | | Ohh, we've had issues with this. We have this table that's mostly | ephemeral data, so rows are constantly inserted and then deleted | after a certain amount of time. Due to a bug the deletion didn't | work for a while and the db grew very large. Fixed the deletion, | but no amount of vacuuming actually allows us to fully reclaim | that space so we don't have to pay for it. | | At the same time the extra cost is probably negligible compared | to spending more energy fixing it.. | hinkley wrote: | The problem we always ran into with deletes is them triggering | full table scans because our indexes weren't set up correctly | to test foreign key constraints properly. Constant game of | whack-a-mole that everyone quickly grew tired of. Also more | indexes increases the slope of the line for insert operations | as data size grows. | | Another solution is tombstoning data so you never actually do a | DELETE, and partial indexes go a long way to making that scale. | It removes the logn cost of all of the dead data on every | subsequent insert. | mulander wrote: | > The problem we always ran into with deletes is them | triggering full table scans because our indexes weren't set | up correctly to test foreign key constraints properly. | | This is a classic case where partitioning shines. Lets say | those are logs. You partition it monthly and want to retain 3 | months of data. | | - M1 - M2 - M3 | | When M4 arrives you drop partition M1. This is a very fast | operation and the space is returned to the OS. You also don't | need to vacuum after dropping it. When you arrive at M5 you | repeat the process by dropping M2. | | > Another solution is tombstoning data so you never actually | do a DELETE, and partial indexes go a long way to making that | scale. It removes the logn cost of all of the dead data on | every subsequent insert. | | If you are referring to PostgreSQL then this would actually | be worse than outright doing a DELETE. PostgreSQL is copy on | write so an UPDATE to a is_deleted column will create a new | copy of the record and a new entry in all its indexes. The | old one would still need to be vacuumed. You will accumulate | bloat faster and vacuums will have more work to do. | Additionally, since is_deleted would be part of partial | indexes like you said, a deleted record would also incur a | copy in all indexes present on the table. | | Compare that to just doing the DELETE which would just store | the transaction ID of the query that deleted the row in cmax | and a subsequent vacuum would be able to mark it as reusable | by further inserts. | alexfromapex wrote: | It seems like this is an optimization that Postgres should handle | internally, doesn't it? | boomer918 wrote: | Partial indexes can flip query plans if the covered part becomes | so small that it won't be represented when sampled by the stats | collector. The planner could then decide that the index scan | isn't worth it and could try an alternative less efficient index | if one exists. | tbrock wrote: | Yeah and sadly using the index in those scenarios could be even | more worth it due to the high selectivity it has. | | Is PG smart enough to avoid that if the query patterns are | frequently or exclusively covered by the index? | gangstead wrote: | The included query for finding which indexes in your database | could benefit from a partial index is amazing. Thanks for putting | the extra effort into this post. | nieve wrote: | The article includes a couple of useful queries unrelated to the | "find" and led me to these useful bloat-detection resources | https://wiki.postgresql.org/wiki/Show_database_bloat | https://github.com/ioguix/pgsql-bloat-estimation | pierrebai wrote: | The chart seems to show an uptick of 2GB, not 20GB. Am I missing | something? | pottertheotter wrote: | This has nothing to do with the content, but the design of this | page really stuck out to me. It's very easy to read and doesn't | have fluff. But it still feels modern (in the good way). It's | perfectly balanced. | paxys wrote: | Agreed! It's a perfect example of how you can make a website | with "modern" features (responsive design, accessible, mobile | friendly, dark mode, static pages, embeds) without it being a | bloated mess. | de6u99er wrote: | When I did my Oracle DBA training 15 years ago, I learnt about | database reorgs. | | It means basically exporting your database (or tables) and | importing it again. What happens is that deleted data which | doesn't necessarily free up space (Oracle reuses the freed up | space sometimes) doesn't get exported. | | https://www.iri.com/blog/vldb-operations/database-reorgs-why... | | https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUEST... | brianwawok wrote: | A vacuum full basically does this for a table, copying the data | from location A to location B, cleaning up junk. I think index | rebuilding may take a separate command? | Tostino wrote: | Vacuum full does a index rebuild automatically. Since a | vacuum full builds an entire new heap table, the old indexs | are all pointing to the incorrect locations for all tuples, | so it has no choice but to rebuild. | brianwawok wrote: | Excellent there you go. | | Is there a way to just do the index build part, short of | dropping an index and adding it back? | jvolkman wrote: | I believe pg_repack can do that. | Jedd wrote: | This is described about a quarter the way into TFA. | | > REINDEX INDEX CONCURRENTLY index_name; | | (A vanilla REINDEX will lock the table, preventing | writes, while it runs. The CONCURRENT creates a new | index, replicates any updates to the original while it | does so, and then does an atomic switcheroo at the end.) | jeffbee wrote: | Dumping and reloading databases used to be mandatory for major | postgresql updates, which is one of the reasons postgresql | wasn't suitable for production workloads until recently and | also why it resisted fixing bugs in vacuum, index, and | compaction for many years. | AbacusAvenger wrote: | Whoah, that's news to me. | | I used PostgreSQL fairly recently (a year or so ago?) and | ended up abandoning it after I was forced to do the | export/import dance through a few version upgrades. | | When did that requirement go away? | dialamac wrote: | It never did. | | The difference is that you can use logical replication | since 10 to prevent downtime during upgrade. | | Which if you were using it a year ago could have been done. | jskrablin wrote: | Ever since there's pg_upgrade available. Since 8.4 or so - | https://www.percona.com/blog/2019/04/12/fast-upgrade-of- | lega... | | Dump and reload is ok if you have a small database or can | afford hours of downtime... if not, use pg_upgrade. | Uberphallus wrote: | Since 9 there's pg_upgrade, personally I never had an issue | and it was very fast, so the downtime is in the order of a | few minutes, which is ok for my usecase. YMMV. | avereveard wrote: | "pg_upgrade does its best to make sure the old and new | clusters are binary-compatible, e.g., by checking for | compatible compile-time settings, including 32/64-bit | binaries. It is important that any external modules are | also binary compatible, though this cannot be checked by | pg_upgrade." | | This makes me very nervous tho, I've at least two exts | (trigrams and gists) maybe they work, maybe not, I just | prefer the ease of mind of a old fashioned dump. ___________________________________________________________________ (page generated 2021-02-01 23:01 UTC)