[HN Gopher] Index Merges vs. Composite Indexes in Postgres and M...
       ___________________________________________________________________
        
       Index Merges vs. Composite Indexes in Postgres and MySQL
        
       Author : Sirupsen
       Score  : 132 points
       Date   : 2022-11-27 19:02 UTC (9 hours ago)
        
 (HTM) web link (sirupsen.com)
 (TXT) w3m dump (sirupsen.com)
        
       | magicalhippo wrote:
       | A composite index can also be used for a partial index scan[1],
       | so if you're frequently looking for just int100 as well as the
       | int100,int1000 combination (as per the article's example), then a
       | composite index int100,int1000 can be used for queries just
       | filtering on int100.
       | 
       | The order of the columns in the composite index might matter
       | then. We got some nice savings by reordering columns in indexes
       | and changing the join order in queries (our DB wasn't _that_
       | smart) or adding a  "useless" filters to the where clause,
       | allowing us to consolidate multiple indexes into one composite
       | one.
       | 
       | [1]: might be using the wrong terminology here, I'm not talking
       | about a partial index[2], but using only the first N dimensions
       | of a M-dimensional index.
       | 
       | [2]: https://www.postgresql.org/docs/current/indexes-partial.html
        
         | winrid wrote:
         | There are some DBs where the order of fields doesn't matter in
         | the compound index if you're just searching by one field (but
         | of course is less efficient). I think Oracle supports this.
        
           | chrsig wrote:
           | > (but of course is less efficient)
           | 
           | This means the order _does_ matter
        
             | winrid wrote:
             | Yes...?
             | 
             | Sorry my comment is not semantically accurate enough for
             | you.
        
           | dspillett wrote:
           | All can search an index by just one field if it is the first
           | field covered by the index, few can use an index if your
           | predicate concerns the second and not the first.
           | 
           | IIRC Oracle does support this and calls it a skip-scan. If
           | the first column in the index is not very selective this can
           | be very efficient, though if the first column _is_ good in
           | terms of selectivity then a skip-scan will be very
           | inefficient and the DB might as well just scan the whole
           | index. Given that a good choice of index usually has a fairly
           | selective column as its first, and that unless storage space
           | is very cramped you are likely better off just defining an
           | extra index without the column that you want to not touch in
           | some queries, this means that a skip-scan isn 't helpful as
           | often as one might think which is one of the reasons DB
           | engine maintainers give for not spending time implementing
           | and maintaining the feature.
           | 
           | There are some use cases where having the option to skip-scan
           | is definitely a significant bonus, but they are rare enough
           | that I understand why most engine maintainers have not
           | implemented them.
        
             | winrid wrote:
             | Yes, I'm talking about skip-scan. I know pretty much all
             | DBs can use compound index predicates for querying :)
             | 
             | I can think of several cases where it would have been nice
             | to have, sometimes "good enough" for a particular read
             | pattern is better than additional write overhead etc.
        
           | Sirupsen wrote:
           | MySQL can do skip-scans
           | 
           | https://dev.mysql.com/doc/refman/8.0/en/range-
           | optimization.h...
        
             | winrid wrote:
             | Nice, I did not know MySQL supported this. Someday I'll
             | work with MySQL again, good to know.
        
           | magicalhippo wrote:
           | Good point, I reworded it to be more general. Can't recall
           | seeing our DB being that clever.
        
           | thom wrote:
           | Postgres is capable of doing this with some index types, but
           | generally more slowly than a B-tree index being asked about
           | its leftmost columns:
           | 
           | https://www.postgresql.org/docs/current/indexes-
           | multicolumn....
           | 
           | Also worth noting that for very complex workloads that need
           | to support arbitrary subsets of equality matches over
           | columns, a Bloom filter might work best:
           | 
           | https://www.postgresql.org/docs/current/bloom.html
        
           | mattashii wrote:
           | This dependz on the index type, but PostgreSQL also supports
           | queries based on arbitrary attributes of multi-attribute
           | indexes: both GIN and BRIN can be used for queries on any of
           | the indexed expressions.
           | 
           | Maybe eventually PG will also get index skip scans for
           | btrees, which could allow for arbitrary columns searches in
           | the btree; but we're not there yet by a large margin.
        
       | [deleted]
        
       | bawolff wrote:
       | Honestly im kind of surprised that they are even that close. I
       | wonder if this changes at scale when the intersection is larger.
        
         | Sirupsen wrote:
         | Ideally, I would add three graphs to the post:
         | 
         | (1) Table size on the x-axis, and time on the y-axis for index
         | merge vs composite index
         | 
         | (2) Number of columns on the x-axis, and time on the y-axis for
         | both
         | 
         | (3) Number of final matches on the x-axis, and time on the
         | y-axis for both
         | 
         | But ran out of time and decided to test with the table size of
         | 10M rows, and a 100-ish result set. That's in my experience a
         | decent representation for what you might be doing with a
         | relational database.
        
       | arynda wrote:
       | Comparison on Clickhouse, also runs in about 30-40ms, however
       | there's no indexing being used and this is a full-table scan.
       | create table if not exists test_table         (             id
       | UInt64,             text1 String,             text2 String,
       | int1000 UInt64,             int100 UInt64,             int10
       | UInt64,             int10_2 UInt64         )         engine =
       | MergeTree()         order by (id)         ;
       | insert into test_table         with           repeat('b', 1024)
       | as one_kib,           repeat('b', 255) as bytes_255
       | select             number as id,             one_kib,
       | bytes_255,             rand() % 1000 as int1000,
       | rand() % 100 as int100,             rand() % 10 as int10,
       | rand() % 10 as int10_2           from numbers(10e6)         ;
       | > select count(*) from test_table where int1000 = 1 and int100 =
       | 1;              +-count()-+       |    9949 |       +---------+
       | 1 row in set. Elapsed: 0.034 sec. Processed 10.00 million rows,
       | 160.00 MB (290.93 million rows/s., 4.65 GB/s.)
       | 
       | The same table but with 1B rows instead, runs in ~1800ms
       | > select count(*) from test_table where int1000 = 1 and int100 =
       | 1;            +-count()-+       |  999831 |       +---------+
       | 1 row in set. Elapsed: 1.804 sec. Processed 1.00 billion rows,
       | 16.00 GB (554.24 million rows/s., 8.87 GB/s.)
       | 
       | [1] Converted the table create and insert logic from here:
       | https://github.com/sirupsen/napkin-math/blob/master/newslett...
        
         | hodgesrm wrote:
         | > however there's no indexing being used and this is a full-
         | table scan.
         | 
         | That first steatement about "no indexing being used" is not
         | quite correct if the query is run exactly as you show in your
         | nice example.
         | 
         | ClickHouse performs what is known as PREWHERE processing which
         | will effectively use the int1000 and int100 columns as indexes.
         | It scans those columns and knocks out any blocks (technically
         | granules containing by default 8192 rows) that do not values
         | that match the filter conditions. It then performs a scan on
         | the remaining blocks to get the actual counts.
         | 
         | PREWHERE is effective because columns are compressed and scans
         | are fast. If there's any pattern to the filter columns (for
         | example monotonically increasing counters) or their values have
         | high cardinality PREWHERE processing will remove a large number
         | of blocks. This will make the rest of the scan far faster.
         | 
         | In your dataset it may not be especially efficient because you
         | use random values, which don't necessarily compress well, and
         | the values will appear in many blocks. It works much better in
         | real datasets where data are more correlated.
         | 
         | EDIT: PREWHERE is _much_ faster in cases where you are doing
         | more complex aggregation on many columns. Counts of course don
         | 't need to scan any extra values so it's not helpful in this
         | case.
         | 
         | p.s. Scans are ridiculously fast.
        
           | arynda wrote:
           | > ClickHouse performs what is known as PREWHERE processing >
           | p.s. Scans are ridiculously fast.
           | 
           | Good point, I should have mentioned this was basically a
           | worst-case scenario for Clickhouse as the data layout is
           | totally random (same approach as OP used in their benchmark)
           | and isn't able to utilize any granule pruning, sorting, or
           | skip indexing, but is still able to achieve such remarkable
           | speeds.
        
             | hodgesrm wrote:
             | What's cool is that even in this case ClickHouse is still
             | stupid fast compared to most other databases. ;)
        
           | paulmd wrote:
           | > p.s. Scans are ridiculously fast.
           | 
           | this is really the lesson of SOLR. full-scan all the things,
           | aggregate as you go, broadcast disk IO to multiple listeners.
           | 
           | why do a bunch of 4K random IO when you could full-scan at
           | bus speed? yeah you can make the 4K random IO super fast but
           | that's not where hardware is going, and it's also
           | scalable/clusterable where RDBMS caps out at one machine and
           | clustering is kinda ugly.
        
         | Sirupsen wrote:
         | Are you aware of a good write-up on how Clickhouse/other
         | columnar databases do the intersection?
        
           | twoodfin wrote:
           | One obvious way is to build a bitmap indexed by row position
           | for each filter. Both the "&" intersect and the final bit
           | count can be rocket fast on modern CPU vector units.
        
           | arynda wrote:
           | Not in particular sorry, most of the good content I've found
           | is on Altinity [1] and Alibaba's technical blogs [2][3].
           | These tend to be mostly focused on how the data itself is
           | stored and how to use Clickhouse, but don't really dive into
           | the specifics of how query processing is performed.
           | 
           | [1] https://altinity.com/blog/
           | 
           | [2] https://www.alibabacloud.com/blog/clickhouse-kernel-
           | analysis...
           | 
           | [3] https://www.alibabacloud.com/blog/clickhouse-analysis-of-
           | the...
        
           | hodgesrm wrote:
           | ClickHouse uses a single primary key index, which matches the
           | sort order, plus skip indexes, which knock out blocks to
           | scan. Here's a writeup that explains skip indexes.
           | 
           | https://altinity.com/blog/clickhouse-black-magic-skipping-
           | in...
           | 
           | You can also check out the following webinar, which explains
           | how ClickHouse indexes work in general. Here's a link to the
           | discussion of indexes.
           | 
           | https://youtu.be/1TGGCIr6dMY?t=1933
           | 
           | p.s. The blog article is missing some images that WordPress
           | seems to have lost but you'll still get the idea. (Should be
           | fixed shortly.)
           | 
           | Disclaimer: I work for Altinity
        
       | Lukas1994 wrote:
       | Good stuff! What's the size difference between the composite
       | index vs the two separate indices?
        
         | Sirupsen wrote:
         | I've added this to the article, thanks!
         | 
         | Composite index (int64, int64): ~70 MiB in Postgres, ~350 MiB
         | in MySQL
         | 
         | Single index (int64): ~70 MiB in Postgres, ~240 MiB in MySQL
         | 
         | If you assume the majority of an index are index entries of
         | (int64, int64, int64) where the third number is some sort of
         | identifier for the record on the heap, you'd expect this to be
         | ~230 MiB. So Postgres does some compression very well here, and
         | MySQL has a bit more overhead for its indexes it seems.
        
           | libraryofbabel wrote:
           | As you mention in the article, this is small by modern
           | hardware standards and means the indexes can be entirely held
           | in memory. Curious how things look for much larger tables
           | where the index has to be read from disk?
        
           | fabian2k wrote:
           | Could be the deduplication newer Postgres versions have for
           | B-Tree indexes:
           | 
           | https://www.postgresql.org/docs/current/btree-
           | implementation...
           | 
           | > 67.4.3. Deduplication
           | 
           | > A duplicate is a leaf page tuple (a tuple that points to a
           | table row) where all indexed key columns have values that
           | match corresponding column values from at least one other
           | leaf page tuple in the same index. Duplicate tuples are quite
           | common in practice. B-Tree indexes can use a special, space-
           | efficient representation for duplicates when an optional
           | technique is enabled: deduplication.
           | 
           | > Deduplication works by periodically merging groups of
           | duplicate tuples together, forming a single posting list
           | tuple for each group. The column key value(s) only appear
           | once in this representation. This is followed by a sorted
           | array of TIDs that point to rows in the table. This
           | significantly reduces the storage size of indexes where each
           | value (or each distinct combination of column values) appears
           | several times on average. The latency of queries can be
           | reduced significantly. Overall query throughput may increase
           | significantly. The overhead of routine index vacuuming may
           | also be reduced significantly.
        
             | Sirupsen wrote:
             | Excellent, thank you! I'll add that to the article.
        
               | fabian2k wrote:
               | This is a guess on my part, though I think a plausible
               | one. To verify you'd probably have to compare an index
               | with unique values to one with many identical ones.
        
           | masklinn wrote:
           | An other thing which might be of interest: what if you use
           | convering indexes for the two single indexes? Is postgres
           | able to do an index-only scan then, thanks to the coverage?
           | 
           | Or does it decide to use only one index and filter based on
           | the covered values?
        
       | pdhborges wrote:
       | For these 64-bit index entries we'd expect to have to scan
       | roughly:       index_row_size[?]rows=2[?]64bit[?]10^5=1.5MiB
       | 
       | Where do the 10^5 rows come from? With a composite index and a
       | point query doesn't the database scan just the 100 returned rows?
        
         | Sirupsen wrote:
         | You're absolutely right!
         | 
         | I forgot to move this around when I updated the article's
         | structure. This is only relevant when doing the index merge.
         | The article has been updated
        
       ___________________________________________________________________
       (page generated 2022-11-28 05:00 UTC)