[HN Gopher] PostgreSQL views and materialized views and how they...
       ___________________________________________________________________
        
       PostgreSQL views and materialized views and how they influenced
       TimescaleDB
        
       Author : od0
       Score  : 127 points
       Date   : 2022-07-14 15:52 UTC (7 hours ago)
        
 (HTM) web link (www.timescale.com)
 (TXT) w3m dump (www.timescale.com)
        
       | jbylund wrote:
       | Did you consider a type of continuous aggregate that works over
       | pre-aggregated partitions for time buckets without modifications
       | and swapping those buckets in the continuous aggregate view with
       | a live view when data is modified? I guess it would mean that
       | performance would sort of continually degrade as
       | inserts/updates/deletes happened in more and more time buckets,
       | but as soon as the aggregation event fired performance would
       | recover. It seems like one could provide a more strongly
       | consistent view of aggregates with this sort of approach.
        
         | djk447 wrote:
         | NB: Post author
         | 
         | I'm not 100% sure I understand what you're asking, but
         | essentially something that would look for data modifications
         | and at query time run the query over the older regions as well?
         | 
         | If that's what you're asking the answer is yes, we did consider
         | it, but basically decided that it was something that relatively
         | few people needed and the complexity and performance tradeoffs
         | were unlikely to be worth it for most folks.
         | 
         | Essentially, we could do something like this now by looking at
         | our invalidation log and running a join against it to get to a
         | more strongly consistent state (I haven't thought through the
         | full implications and whether it's truly strong consistency, I
         | think it might be, but it'd require a proof / some thinking
         | through of all of our locking logic to really get there). It's
         | interesting to consider though.
        
       | gfody wrote:
       | > If I'd written out the query, I might have seen that I didn't
       | need the JOIN (or never written it in the first place). Whereas
       | the view hides that complexity. So they can make things easier,
       | but that can lead to performance pitfalls if we're not careful.
       | 
       | you can avoid this particular pitfall by using left joins for
       | views like this (that join stuff in for convenience that you
       | might not select) - postgres will eliminate a left join but not
       | an inner join since the inner join could filter rows (eg symbols
       | in stocks_real_time that don't exist in company (commercial
       | engines will use the presence of fk constraints to know that's
       | impossible and go ahead and eliminate inner joins as well))
        
         | gopalv wrote:
         | > commercial engines will use the presence of fk constraints
         | 
         | What, postgres doesn't do FK join removals?
         | 
         | Like I tried it right now and it didn't remove the hash-join
         | 
         | http://sqlfiddle.com/#!17/073747/2
         | 
         | Should've been just a count off the PK Employees table.
         | 
         | Apache Calcite in Hive had to bake in this specific
         | optimization because there were a ton of deep join views
         | (coming off Teradata) where someone would run something like a
         | "select count(*) + where" or just project 2-3 columns out of a
         | 60+ join view from a couple of tables.
         | 
         | And those just ran forever without the PK-FK removals.
        
           | michael1999 wrote:
           | How would that work? Your schema allows an employee to be
           | assigned to multiple departments, and the query must count
           | them.
        
             | doctor_eval wrote:
             | The schema only allows one employee per department, so
             | there is no need to look up the employee table.
        
             | jmalicki wrote:
             | Because of the foreign key, the query should reduce to
             | "explain select count(*) from Department d;"
        
               | [deleted]
        
         | djk447 wrote:
         | NB: post author here!
         | 
         | Thanks yes! Totally true, was thinking about including some of
         | that but it felt like it opened a can of worms about join types
         | and why certain things would be included and others not (ie
         | inner join needs to see that it's there on both sides whereas
         | the left join doesn't) etc. and the post was already kinda long
         | in the tooth.
        
       | garyclarke27 wrote:
       | Postgres is an amazing database. It's only significant weakness
       | now is in Materialized views, with their lack of incremental
       | refresh. Was disappointing to see there was no progress towards
       | this in v15.
        
         | gunnarmorling wrote:
         | That work towards incrementally updated views is happening and
         | progressing. For now, it's a separate extension, though:
         | https://github.com/sraoss/pg_ivm.
        
         | edmundsauto wrote:
         | I also wish this were in core posgres, but if you use a build
         | tool like DBT you can enable this type of thing.
        
           | snicker7 wrote:
           | Interestingly, DBT does not support creating materialized
           | views.
        
             | mritchie712 wrote:
             | yeah, this issue has been open for years. We need them for
             | Clickhouse for our product.
             | 
             | https://github.com/dbt-labs/dbt-core/issues/1162
        
       | WXLCKNO wrote:
       | Enjoyed this post! Djk447 would love to ask a question as well.
       | 
       | We've started working with Timescale to process historical time
       | series data. However there is so much of it that we chose which
       | parts we process.
       | 
       | It's possible that in the future we may need to go back and
       | reprocess the dataset to add something we decided we want after
       | all.
       | 
       | In your post it seems like this is handled automatically on a
       | smaller time scale, meaning you could insert into the past
       | without doing anything special.
       | 
       | What happens if you need to insert data at numerous points across
       | two years worth of data instead? Do you have to use backfilling
       | as described in your documentation? Or is it better to maybe
       | rebuild the entire hypertable?
        
         | djk447 wrote:
         | NB: Post author
         | 
         | Yes. this is generally handled automatically, there may be
         | times though where you want to essentially pause refreshing the
         | view for a while while you do some backfilling and then
         | eventually let it catch up, especially if you're overwriting
         | the same time period multiple times in a row. If you can insert
         | in time order then it just breaks up re-calculation into
         | smaller segments, which can be quite useful rather than having
         | to process the whole data set again.
         | 
         | This can be a little bit different if you're doing compression,
         | but with continuous aggregates I think it should work fine. I'm
         | not 100% sure that was what you were looking for, let me know
         | if it's not.
        
       | hodgesrm wrote:
       | This is an excellent article. I like the way the author builds up
       | in steps to eventual consistency between source tables and their
       | materialized views. It was fun to guess the next step.
       | 
       | I do have one question: how does the algorithm described in the
       | article work when the source table is spread across multiple
       | servers, say in multiple shards? Can TimescaleDB maintain
       | materialized views on each shard and then run a query that
       | reconciles them?
       | 
       | Edited: clarification
        
         | djk447 wrote:
         | NB: Post author
         | 
         | So we thought about doing something like that with multinode
         | where each of the nodes would maintain their own
         | materialization but abandoned it for that very reason it's
         | very, very difficult to maintain any sort of consistency
         | guarantees in that case, or even to reason about it.
         | 
         | Instead we use the access nodes as coordinators to do the
         | materialization. right now the materialization only exists on
         | the access node but there's no reason we couldn't send it back
         | out to the data nodes, you just need a coordination point to
         | start a distributed transaction to have some semblance of a
         | guarantee.
        
           | djk447 wrote:
           | And glad you liked the article!
        
       | gigatexal wrote:
       | I wonder how well these perform compared to the aggregate tables
       | of Clickhouse fame.
        
       ___________________________________________________________________
       (page generated 2022-07-14 23:00 UTC)