[HN Gopher] Citus 10 brings columnar compression to Postgres ___________________________________________________________________ Citus 10 brings columnar compression to Postgres Author : whitepoplar Score : 133 points Date : 2021-03-06 18:16 UTC (4 hours ago) (HTM) web link (www.citusdata.com) (TXT) w3m dump (www.citusdata.com) | nwatson wrote: | Sensage/Addamark was too early to the columnar storage game in | 2001-2003 ... https://en.wikipedia.org/wiki/Sensage. | atombender wrote: | Sybase/IQ (now SAP IQ, originally acquired from Expressway) was | launched in 1995. | | There are other old column databases, such as kdb, but I | believe Sybase/Expessway was the first commercial product. | jacques_chester wrote: | I think the claim here is that Citus added it to PostgreSQL. | | Depending on how one defines that claim, Greenplum may have | been first. | | Disclosure: I work for VMware, which sponsors Greenplum | development. | clairegiordano wrote: | Definitely not intending to take credit away from any other | teams. There are so many good Postgres extensions and forks | in this ecosystem. | | What we've done is add a Columnar storage feature into the | Citus open source extension to Postgres, as part of Citus 10. | | One way to think of it: Citus 10 now gives Postgres users | (those who are also using the Citus extension) a columnar | compression option, for use on a single node and/or a | distributed Citus cluster. | jacques_chester wrote: | Thanks. Regardless of who did what, columnar stores are | non-trivial engineering and anyone who produces a | production-ready one is worthy of admiration. | asah wrote: | Greenplum was founded in 2003, A/S was 2001. | jacques_chester wrote: | Again, my qualifier is "to PostgreSQL". I couldn't see such | a connection on a brief skim of the A/S wikipedia entry. | gigatexal wrote: | Great start! Keep in mind the limitations: | | What are the Limitations? | | These limitations are not set in stone, and we look forward to | working on them in the future: | | No UPDATE or DELETE support No index support No logical | replication or logical decoding support See more limitations in | the columnar README | tpetry wrote: | cstore_fdw had the same limitations, so even if they say these | limitations may not persist forever, i am not very hopeful they | really want to solve this problem. | | But they could solve it with just a little bit of work: * | Create a hidden bitmap information to store whether the ,,row" | of the columnar table is still valid * When updating/deleting | values only set the bitmap information to zero to indicate the | value is no longer valid * Every updated ,,row" is added into a | special chunk at the end of the table in uncompressed format * | When vacuuming the table the old rows with bitmap=0 are deleted | and the values from the special chunk are merged | | So you would have same performance if update/delete is never | done as the complete bitmap index is filled with 1. And every | update/delete will make it just a little bit slower, as most | often only old date is stored in compressed format not much | uncomprossed values will be stored. And a vacuum full would | optimize these old tables again. | [deleted] | jeff-davis wrote: | Thank you for the suggestions! We are interested in | UPDATE/DELETE. | | Can you describe you use case for columnar update/delete in a | little more detail? Is it a few random updates, or bulk | updates, or something else? | BenoitP wrote: | So are they using Parquet, ORC or Arrow under the hood; or do | they have a custom format? | mslot wrote: | It is a custom format that was originally derived from ORC, but | is very different at this point. For instance, all the metadata | is kept in PostgreSQL catalog tables to make changes | transactional. | jfbaro wrote: | I am glad to see Postgresql ecosystem growing. It gets better by | the day. Thanks to all people involved! | znpy wrote: | since we're talking postgresql: | | i recently started diving into postgresql and it seems to me that | there is a patchwork of HA solutions (with patroni being the most | feature-full) but no real multi-master solution for postgresql | released under an open source license. | | There's BDR (bi-directional replication) but apparently | 2ndquadrant pulled it back under a proprietary license, am i | right? | | what's the current status of postgresql multi-master HA? | merb wrote: | > i recently started diving into postgresql and it seems to me | that there is a patchwork of HA solutions (with patroni being | the most feature-full) but no real multi-master solution for | postgresql released under an open source license. | | true multi master is barely needed. but there is citus which | uses the gpl, which can be run in multi master. and yes patroni | is really really awesome. | znpy wrote: | Thanks, I'll look into that! | whitepoplar wrote: | Not multi-master, but pg_auto_failover looks to be a very nice | solution for HA, and one that is operationally simple to | manage: https://github.com/citusdata/pg_auto_failover | teej wrote: | One of the gotchas of columnar storage (coming from Redshift) is | that you lose all of the compression benefits if you have just | one column that's fat or hard to compress. | | In Redshift columns are stored in blocks. You want to fit roughly | the same number of column values per block across all your | columns. But if you have one column where a small number of | values can fit in a block, the rest of the columns end up leaving | most of the block space unused. The result is wasted disk space | and poor query performance. | | This Postgres extension has similar-sounding storage ideas with | stripes, but it's not clear to me if it suffers from the same | issue. | | My first test to vet this would be a table with 50 columns of | ints and one column of md5 hashes stored as varchar. | mrits wrote: | Several different compression algorithms would do very well for | this. E.g dict compression for low cardinality wide columns. | I'd recommend familiarizing yourself with a handful of | different encodings for columnar data stores before you give up | on the idea. | georgewfraser wrote: | Do you have a source for this, or a code sample that can | demonstrate it? This would be an extremely naive implementation | of columnar storage. There are some truly hard cases around | long variable-length strings, but any halfway decent columnar | storage engine should be able to handle columns with different | widths. | zten wrote: | https://github.com/awslabs/amazon-redshift- | utils/blob/master... | | The compression actually works as advertised, but that | document outlines why you don't want to have the sortkey skew | from other columns. The disk I/O balloons as it goes and | fetches a larger number of blocks from the other columns that | match what you're filtering in the sortkey. | teej wrote: | They call this issue out in the docs specifically when the | fat column is a sort/distkey. rsbb in the Intermix slack has | studied this in the most detail I know of outside AWS but | they haven't published their book yet. | jeff-davis wrote: | In citus columnar, stripes/chunks are variable in size. If you | have one large column that doesn't bloat the other columns. | teej wrote: | Cool! So if I select two columns with a predicate in the | first, does it scan all stripes for the second column? | jeff-davis wrote: | Stripes aren't really "scanned". They are more of a logical | concept that tracks where the physical data for each column | is, and only fetches what it needs. | | If I understand what you are asking, let me restate: "can | you apply a predicate first on column A before reading | column B, so that you can avoid reading column B if the | predicate on A doesn't match?". | | The answer is: "sometimes". Predicates match some rows and | not others, so matching rows may be mixed in with non- | matching rows, so it's not always possible to avoid the IO | to read column B. _However_ , if the matching and non- | matching rows happen to be separated (e.g. if your table is | naturally in time order and you have a time-based | predicate), then it's able to do this optimization. Please | see the section on "Chunk Group Filtering". | teej wrote: | Thanks for the in depth explanation! I look forward to | exploring it more. | brightball wrote: | Awesome news! After seeing the benefits with the cstore | extension, I assumed this was only a matter of time. | georgewfraser wrote: | Beware that simply adding a column-oriented storage engine to a | row store like Postgres is not going to get you anywhere near the | performance of a ground-up columnar system like Redshift or | Snowflake. This paper explains why [1]. Short version: most of | the benefits are in the column-oriented execution engine, which | differs in every aspect of its implementation from a row-oriented | execution engine. | | [1] | https://stratos.seas.harvard.edu/files/stratos/files/columns... | mslot wrote: | Correct, though it depends whether you are CPU-bound or | I/O-bound. We see the latter a lot more often for large data | sets. | | Columnar storage for PostgreSQL is especially relevant in cloud | environments. Most database servers in the cloud use managed, | network-attached disks because of durability, availability, and | encryption-at-rest requirements. However, those do come with a | performance penalty compared to local SSDs. The VMs also have | IOPS and bandwidth limits, partly to manage capacity within the | IaaS platform. | | If you can reduce the data size by 10x, then you are | effectively increasing your disk bandwidth by that much as | well. Moreover, you can keep more data in memory, so you will | read much less data from disk, plus you'll only read the | columns used by the query. Hence, you're likely to see speed | ups of more than 10x for some queries, even without column- | oriented execution. | glogla wrote: | That's exactly what we've seen. I don't know how Citus | exactly stores data, but we found difference of 30x between | gzipped Parquet and "generic Oracle table". | | There's a huge difference whether your analytical query is | scanning full 30 GB or 1 GB (or maybe even half that or less | if you need to scan just some columns). | dang wrote: | Some past threads on that paper: | | _The design and implementation of modern column-oriented | database systems_ - | https://news.ycombinator.com/item?id=18076547 - Sept 2018 (42 | comments) | | _Design and Implementation of Modern Column-Oriented Databases | (2012) [pdf]_ - https://news.ycombinator.com/item?id=11803299 - | May 2016 (9 comments) | GordonS wrote: | I agree,l that it's never going to be as effective as something | built from the ground up as a column store. But the performance | and storage improvements can still be immense over standard | Postgres. And if you're already using Postgres, it means one | less service to configure and maintain, and means you can | easily combine both standard relational data with columnar | data. For a lot of folks, that a huge win. | wyck wrote: | Amazon really missed the boat on Citus, a few more great | acquisitions like this and Azure is going to look great (yes I | know it can be still self installed on AWS). | biggerfisch wrote: | I'm curious to see how this compares in real life to TimescaleDB | hypertables with compression - which to me, reads as much the | same thing. I'm wondering if Citus is bringing a lower level | implementation of idea possibly? | LogicX wrote: | Came here to say this - I was looking to see how compression | compared to timescale's stated 91% compression. | | https://docs.timescale.com/latest/using-timescaledb/compress... | atanasovskib wrote: | There are a lot of differences that need to be taken into | account before making a comparison. | | 1. TimescaleDB implements the compression on a hifher level, | the underlying table storage/access method remains the same | | 2. TimescaleDB doesn't compress latest data, allowing you to | keep fast writes and edits for recent data, but also allows | you to benefit from compression on row based data | | 3. Although not currently available, it is possible to have a | TimescaleDB hypertable with a column based access method | | 4. Comparing would have to take into account the data model, | access methods (types of queries), ingestion vs query | comparison (batch vs real time), backfilling and editing, etc | | I agree that this (Columnar) would be closer to Parquet. | mslot wrote: | It always depends on the data, but we've seen 92.5% and more: | https://twitter.com/JeffMealo/status/1368030569557286915 | akulkarni wrote: | (TimescaleDB person) | | TimescaleDB users have seen 98% (ie over 50x) compression | rates in some real-world cases (e.g., for some IT | monitoring datasets), but compression ratio will definitely | vary by dataset. (For example, a dataset of just 0s will | compress even better! But that's probably not a realistic | dataset :-) ) | | The reality is that Citus and TimescaleDB [0][1] take very | different approaches to columnar compression, which result | in different usability and performance trade-offs. In | reality one should choose the right tool for their | workload. | | (As an aside, if you have time-series data, no one has | spent more time developing an awesome time-series | experience on Postgres than the TimescaleDB team has :-) ) | | Kudos to the Citus team for this launch! I love seeing how | different members of the Postgres community keep pushing | the state-of-the art. | | [0] Building columnar compression in a row-oriented | database (https://blog.timescale.com/blog/building- | columnar-compressio...) | | [1] Time-series compression algorithms, explained | (https://blog.timescale.com/blog/time-series-compression- | algo...) | mslot wrote: | The access method approach followed in Citus is indeed lower | level and more generic, which means it can be used on both time | series data and other types of data. | | For time series data, you can use built-in partitioning in | PostgreSQL. It's not as easy to use as TimescaleDB, but | pg_partman goes a long way, see: | https://docs.citusdata.com/en/latest/use_cases/timeseries.ht... | | You can then use the columnar access method to compress old | partitions (see the end of the doc), and use distributed tables | to shard and parallelize queries and DML. | bradleyjg wrote: | This reads to me more like parquet. | simonw wrote: | Reassuring to see big new features like this coming out after the | Microsoft acquisition, a healthy sign that this open source | product continues to see serious ongoing investment. | ablekh wrote: | Technically, Citus _product_ is open core, not open source. | Though open sourcing shard rebalancer is a recent welcome step. | merb wrote: | well because they are the only cloud vendor with a HA master- | master postgresql database (citusdb managed) it probably is | also a selling point. | pm90 wrote: | Wait... they have 2 read/write masters? And you can | read/write from either master at the same time? | merb wrote: | yes but its more like vitess. which is basically more like | sharded masters, but most of the time this works way better | than something like galera or bdr. ___________________________________________________________________ (page generated 2021-03-06 23:00 UTC)