[HN Gopher] The many faces of DISTINCT in Postgres (2017)
       ___________________________________________________________________
        
       The many faces of DISTINCT in Postgres (2017)
        
       Author : rsecora
       Score  : 153 points
       Date   : 2023-05-22 09:46 UTC (13 hours ago)
        
 (HTM) web link (hakibenita.com)
 (TXT) w3m dump (hakibenita.com)
        
       | kubota wrote:
       | Oracle listaggs do support distinct, since 19c.
        
       | cpursley wrote:
       | I only learned about the ranked query approach last week thanks
       | to ChatGTP. Helped me solve a hairy query that rolled up activity
       | events grouped by time periods. Before that I was struggling with
       | distinct (and it was slow).
       | 
       | I've avoided ChatGTP until recently and at least for SQL
       | refactoring, it's great. The interesting part is the ranked
       | example ChatCTP gave me was almost identical to the one in this
       | post. I wonder if they're (ChatGTP) is training up on technical
       | blog posts.
        
         | swader999 wrote:
         | I throw it lines of logs of sql from Paper trail and ask it to
         | extract the SQL statements. Then I give it the analyse explain
         | output. Sometimes ChatGPT will give me great advice on
         | optimizing.
        
           | cpursley wrote:
           | Dang, that's a good idea (as long as not production data).
           | 
           | It seems like eventually database automation is going be
           | self-optimizing (more than the query planner already is).
        
       | xupybd wrote:
       | MSSQL does have something closer to array agg.
       | 
       | https://www.mssqltips.com/sqlservertip/5542/using-for-xml-pa...
        
         | veddan wrote:
         | There's also JSON_ARRAYAGG.
         | 
         | https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions....
        
       | pophenat wrote:
       | Microsoft SQL Server now also has IS [NOT] DISTINCT FROM.
       | https://learn.microsoft.com/en-us/sql/t-sql/queries/is-disti...
        
       | [deleted]
        
       | codeflo wrote:
       | > A classic job interview question is finding the employee with
       | the highest salary in each department.
       | 
       | Here's a cheat code, in case you need to write a database query
       | and don't remember all the fancy join tricks: Problems like this
       | often have a very straightforward solution with subselects. In
       | this case, the main select gets the departments, and a subselect
       | (with limit 1) fetches the top employee for each department.
       | 
       | That's a very natural, compositional way of thinking. Granted,
       | it's not the most optimized way to do it, but more often then
       | not, the resulting query plan is perfectly fine.
        
         | paol wrote:
         | > It's not the most optimized way to do it, but more often then
         | not, the query plan is perfectly fine.
         | 
         | Depending on the details, the query plan might be exactly the
         | same. I remember once (in SQL Server 2000, so long time ago)
         | writing the same query in 3 completely different ways, only to
         | see the DB generate the same plan for each one.
         | 
         | It would surprise most people how sophisticated the query
         | transformations databases can do.
        
           | [deleted]
        
           | dunham wrote:
           | I had one case, with the same SQL server, where rewriting to
           | a join made the query much faster. That was a data changing
           | statement though, and I think the semantics demanded it rerun
           | the subselect.
           | 
           | More recently, I've found that mysql sometimes behaves poorly
           | with subselects, so I tend to avoid them out of habit.
        
         | emptyfile wrote:
         | [dead]
        
         | paulddraper wrote:
         | > Granted, it's not the most optimized way to do it, but more
         | often then not, the resulting query plan is perfectly fine.
         | 
         | It usually is the exactly same. (MySQL has a potato for a query
         | optimizer tho.)
         | 
         | The biggest case it is suboptimal is when you need to produce
         | multiple fields from the sub-select. Because then you need
         | multiple sub-selects.
        
         | foldr wrote:
         | Also easy to do this with a lateral join (and a limit 1 in the
         | subquery).
        
           | giraffe_lady wrote:
           | If you can do a lateral join off the top of your head this
           | isn't the kind of question you need to have a strategy for
           | anyway.
        
             | foldr wrote:
             | I think everyone tends to learn their own subset of SQL. I
             | was unaware of some of the more 'obvious' solutions in the
             | article (e.g. I had only a very vague idea of how window
             | functions work), but I use lateral joins all the time.
             | 
             | The nice thing about lateral joins is that they're very
             | easy to understand conceptually once you get over the weird
             | syntax.
        
               | andrenotgiant wrote:
               | How would you explain them to someone who knows the
               | syntax but still finds them counterintuitive?
        
               | foldr wrote:
               | I just think of it as performing a subquery for every row
               | of the main query.
        
               | dagss wrote:
               | To me a lateral join is the simplest construct you could
               | have, like a for loop in a programming language.
               | 
               | For each row in my query so far, do this other thing.
               | 
               | Since I learned them, everything went from being a puzzle
               | on its own, "I know what I want to do but how to express
               | it in SQL" -- to simply being writing things out
               | naturally..
        
           | mwexler wrote:
           | Fwir, BigQuery doesn't yet support lateral joins, though they
           | do use the same concept in their UNNEST and array expansion
           | features.
           | 
           | It would be nice to have this feature; its a handy hammer for
           | the right type of nail.
        
       | neallindsay wrote:
       | The author mentions having to get over the lack of upsert when
       | moving from Oracle. But readers might like to know this isn't a
       | problem anymore since Postgres got "INSERT ... ON CONFLICT UPDATE
       | ...".
        
         | paulryanrogers wrote:
         | And MERGE
        
         | inferiorhuman wrote:
         | Oh I'll add to the pile: the Postgres ON CONFLICT/MERGE stuff
         | doesn't capture the " _select or insert if row doesn 't exist_"
         | behavior some folks might want to use an upsert type thing for.
         | 
         | Why? You cannot access the row id or any values within the row
         | as a return value unless the row changes. The top stackexchange
         | answer goes into some detail about why you really don't want to
         | simply make a dummy update in Postgres.
         | 
         | You can access these old row values in some scope for the no-
         | update case, and this allows you to leverage CTEs instead of
         | dummy updates. However... you can't use CTEs with prepared
         | statements (and it's a non-obvious fail for the uninitiated).
         | So yeah there are still lots of little sharp edges.
        
           | jadbox wrote:
           | > "select or insert if row doesn't exist" behavior
           | 
           | I've hit this many, many times. It's such a common need, and
           | yet I keep having to write it out as multiple queries to
           | "select row" and if no result, do then insert query.
        
           | paulddraper wrote:
           | It's true.
           | 
           | Fortunately the records are locked, so a second query will
           | pick them up. Still sad tho.
           | 
           | Prepared statements work fine with CTEs.
        
         | cdogl wrote:
         | I am a huge Postgres fan, but "ON CONFLICT UPDATE" does not
         | always cut the mustard. You can't target conflicts against
         | multiple constraints, so if there are multiple constraints that
         | could cause conflicts you're stuck either figuring out whether
         | you can drop a constraint or doing something clever in your
         | client.
         | 
         | This is often a smell, but it's a little inflexible.
        
           | paulddraper wrote:
           | > You can't target conflicts against multiple constraints,
           | 
           | I've never needed that, and it's not clear why I would.
           | 
           | Like, my new record is a duplicate of record A according to
           | one constraint and record B according to another constraint,
           | so it updates both A and B?
           | 
           | One "upsert" but two resulting records?
           | 
           | Can someone fill me in on when this would crop up?
        
             | richbell wrote:
             | In my experience this tends to be an annoyance when you
             | want to upsert against a table that has multiple unique
             | constraints.
             | 
             | For example, let's say you're tracking GitHub repositories
             | and have a table `repository(id, gh_id, gh_node_id, owner,
             | name, description)` where `gh_id` and `gh_node_id` are both
             | unique.
             | 
             | If you want to insert or update a repository you might want
             | to do something like below, however, this is not a valid
             | syntax and as you need to define a separate `DO UPDATE` for
             | `gh_id` and `gh_node_id`:                   INSERT INTO
             | repository (gh_id, gh_node_id, owner, name, description)
             | VALUES (:id, :node_id, :owner, :name, :description)
             | ON CONFLICT DO UPDATE            SET name = excluded.name,
             | description = excluded.description;
             | 
             | ------
             | 
             | To my knowledge there's no way to define a single
             | constraint `UNIQUE(gh_id || gh_node_id)` instead of
             | `UNIQUE(gh_id && gh_node_id)`.
        
               | paulddraper wrote:
               | So....what is gh_id and gh_node_id?
               | 
               | Like, you have                   gh_id | gh_node_id
               | ------------------           1   |     2               2
               | |     1
               | 
               | And then you want to "upsert" (1, 1).
               | 
               | You want both records to be updated?
               | 
               | I'm still struggling to see why you would want to do a
               | upsert relative to multiple unique constraints at once.
        
           | richbell wrote:
           | In addition to this, `ON CONFLICT` is still susceptible to
           | race-conditions if you have a highly concurrent application.
           | This may be obvious to some, however, a lot of people expect
           | upsert to be atomic and get bit by this.
           | 
           | I also dislike that `ON CONFLICT ... DO NOTHING` increments
           | numerical primary keys. I understand _why_ it happens but it
           | seems counter-intuitive given the name  "DO NOTHING". (And,
           | yes, relying on the values of primary keys to never change is
           | an anti-pattern. However, if you have a table of 10 values
           | and the primary keys have massive gaps like 1, 500_211,
           | 2_521_241, 15_631_121, etc., it _feels weird_ nonetheless.)
        
             | jack_squat wrote:
             | From the Postgres docs,
             | 
             | ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE
             | outcome; provided there is no independent error, one of
             | those two outcomes is guaranteed, even under high
             | concurrency. This is also known as UPSERT -- "UPDATE or
             | INSERT".
             | 
             | https://www.postgresql.org/docs/current/sql-insert.html
             | 
             | What are you referring to?
        
               | neallindsay wrote:
               | It increments the sequence associated with with a
               | "serial" or "bigserial" field--usually used for primary
               | keys. People are often surprised by this because they
               | expect their primary keys to be sequential and sequences
               | are designed to leave gaps in order to avoid a lot of
               | locking.
        
               | hn_throwaway_99 wrote:
               | That just seems like a fundamental misunderstanding of
               | sequences to me. They are guaranteed to be increasing but
               | not necessarily sequential.
        
               | richbell wrote:
               | I think it's more confusion of how `ON CONFLICT DO
               | NOTHING` works. Incrementing sequences is not _" doing
               | nothing"_, even if it's valid and sensible behaviour,
               | which leads to confusion.
        
               | richbell wrote:
               | If I recall correctly (and it has been a while, so I'm
               | not saying I am), the issue was with concurrent
               | transactions inserting a record into tableA and another
               | into tableB which has a foreign key constraint to tableA.
               | The issue was likely specific to `ON CONFLICT DO NOTHING`
               | and not `ON CONFLICT DO UPDATE`.
               | 
               | For example, let's saying you're building an index of
               | open source packages and have two tables:
               | package_type(id, name) and package(id, type_id,
               | namespace, name).
               | 
               | If you receive two concurrent requests for
               | `maven://log4j:log4j` and `maven://io.quarkus:quarkus`, a
               | naive implementation to insert both "maven" and the
               | packages if they don't exist might look something like
               | this:                  WITH type_id AS (          INSERT
               | INTO package_type(name)          VALUES (:type)
               | RETURNING id          ON CONFLICT DO NOTHING        )
               | INSERT INTO package (type_id, namespace, name)
               | SELECT type_id, :namespace, :name        FROM type
               | ON CONFLICT DO NOTHING;
               | 
               | However, one or both inserts can fail intermittently
               | because the primary key for `package_type` will be auto-
               | incremented and thus the foreign key won't be valid.
               | Also, as mentioned in another comment[0] this won't work
               | if `maven` already exists in the `package_type` table.
               | 
               | [0]: https://news.ycombinator.com/item?id=36031790
        
               | paulddraper wrote:
               | I don't think this has to with concurrency; this query is
               | fundamentally broken (besides typos), in that the CTE
               | won't return anything if the package_type already exists.
               | 
               | You have two options:
               | 
               | (1) ON CONFLICT DO UPDATE, with dummy update:
               | WITH           type AS (             INSERT INTO
               | package_type (name)             VALUES ($1)
               | ON CONFLICT (name) DO UPDATE SET name = excluded.name
               | RETURNING id           )         INSERT INTO package
               | (type_id, namespace, name)         SELECT id, $2, $3
               | FROM type         ON CONFLICT (type_id, namespace, name)
               | DO UPDATE SET name = excluded.name         RETURNING id;
               | 
               | (2) Separate statements with ON CONFLICT DO NOTHING
               | (could be in a UDF if desired):                   INSERT
               | INTO package_type (name)         VALUES ($1)         ON
               | CONFLICT DO NOTHING;              INSERT INTO package
               | (type_id, namespace, name)         SELECT type_id, $2, $3
               | FROM package_type         WHERE name = $1         ON
               | CONFLICT DO NOTHING;              SELECT id         FROM
               | package         WHERE (type_id, namespace, name) = ($1,
               | $2, $3);
        
               | jack_squat wrote:
               | Atomicity doesn't mean "doesn't fail", it means "either
               | fails or succeeds, but does not succeed halfway".
               | 
               | There is nothing about what you are describing that is
               | different from the behavior you'd get from a regular
               | insert or update. If two transactions conflict, a
               | rollback will occur. That isn't violating atomicity. In
               | fact, it is the way by which atomicity is guaranteed.
               | 
               | The behavior of sequence values getting incremented and
               | not committed, resulting in gaps in the sequence, is a
               | separate matter, not specific to Postgres or to upsert.
        
               | hn_throwaway_99 wrote:
               | I think that actually more to the root of the problem, as
               | other folks have noted, is that `ON CONFLICT DO NOTHING`
               | means the RETURNING clause returns no rows if there is a
               | conflict, which in my experience is rarely what people
               | want. So instead people do `ON CONFLICT DO UPDATE` with a
               | no-op update which has performance/locking implications,
               | otherwise they need to do a complicated query (search
               | stack overflow).
               | 
               | I wish that postgres would add some sort of backwards
               | compatible option like `ON CONFLICT DO NOOP` or `ON
               | CONFLICT DO RETURN` so that you got the semantics of `DO
               | NOTHING` except that the conflicted rows are returned.
        
             | paulddraper wrote:
             | It's as resilient to race conditions as it could be. Does
             | exactly what you'd expect, no?
             | 
             | That is annoying about sequences, I agree.
        
               | richbell wrote:
               | > It's as resilient to race conditions *as it could be*.
               | Does exactly what you'd expect, no?
               | 
               | While that that may be true, based on the hours I've
               | spent scouring Stack Overflow and GitHub issues it seems
               | that many people don't realize that it's only 99.999%
               | resilient.
        
               | vore wrote:
               | It is 100% resilient short of your hard drive going
               | through a microwave. It is almost definitely an
               | application logic error if you are observing non-
               | atomicity. What would non-atomicity even look like here?
               | Only some columns being written?
        
               | paulddraper wrote:
               | Could you describe what you mean? What phenomenon? For
               | the default transaction isolation level, or a different
               | one?
        
         | mattashii wrote:
         | Yes, that's been a thing since 9.5, which was released in early
         | 2016.
        
       | justinclift wrote:
       | 2017
        
         | cpursley wrote:
         | And still helpful as ever.
        
         | ayhanfuat wrote:
         | I was thinking to myself "there is no way Haki Benita has just
         | discovered `distinct on`".
        
       | zzzeek wrote:
       | I've never gotten into DISTINCT ON and it's always confused me,
       | I'd rather see the query with MAX and GROUP BY if I'm looking for
       | "the highest X in groups of Y". For the same reason I don't
       | prefer RSA-in-three-lines.
        
       | maweki wrote:
       | "(easy CS students, I know it's not normalized...)"
       | 
       | Sure it is. As long as you're not storing any other information
       | on the department in the employee table.
        
         | daigoba66 wrote:
         | And as long as you never rename the departments.
        
           | maweki wrote:
           | That has no bearing on normalization.
        
             | Izkata wrote:
             | It's an almost verbatim example of getting to 1NF, the
             | first and most basic normalization. The value (department
             | name) is repeating and should be extracted and given its
             | own ID.
        
               | jack_squat wrote:
               | 1NF bans relation-valued attributes, not repetition of
               | attribute value across tuples in a relation. Mainstream
               | SQL databases don't support relation-valued attributes,
               | so any table you make in a relational database is 1NF.
               | 
               | You can push back on this a little - for instance maybe
               | you consider an array-valued attribute to be enough like
               | a relation to argue array-valued attributes violate 1NF.
               | But if you do that you must also explain what makes
               | arrays different from strings, since strings are pretty
               | similar to arrays of characters and can be treated the
               | same way in most respects (for instance characters in a
               | string can be addressed by index or split into
               | substrings).
        
               | dymk wrote:
               | Only if there's a separate Departments table, which for
               | this very simple example, there isn't
        
               | maweki wrote:
               | Even if there were a separate Departments table, who is
               | to say that the Department Name is not its primary key?
               | The Department name certainly is A key.
        
               | [deleted]
        
               | maweki wrote:
               | > The value (department name) is repeating and should be
               | extracted
               | 
               | Then the id would be repeating. Furthermore, the
               | department name would make a fine primary or alternate
               | key for the new relation you're proposing.
               | 
               | Also, that's not what 1NF is. 1NF means there should be
               | no table-valued attributes. And neither is any column
               | list-valued nor does any subset of columns form a
               | subtable.
               | 
               | The other normal forms talk about functional dependencies
               | and there aren't any.
               | 
               | The only possible violation of 1NF could be not splitting
               | the name in given name and family name. Other than that,
               | the table is normalized.
        
               | Izkata wrote:
               | > Then the id would be repeating.
               | 
               | Yes, but it's an ID, not a value. No problems there.
               | 
               | > Furthermore, the department name would make a fine
               | primary or alternate key for the new relation you're
               | proposing.
               | 
               | They're called "natural keys" and there's a lot of
               | problems with them not actually mapping to identity. Like
               | for one example, if a department's name is changed, it
               | isn't just a change to the database column, you have to
               | update all associated code as well - which is why you
               | should have an ID and use that in the code anyway.
               | 
               | > Also, that's not what 1NF is. 1NF means there should be
               | no table-valued attributes. And neither is any column
               | list-valued nor does any subset of columns form a
               | subtable.
               | 
               | If there's only one such column and you switch
               | perspective to the inner table, the transformation is the
               | same. That's why I said "almost" - it's not exactly the
               | same, there are additional conditions, but if you hit
               | them it's the same thing and the result is extracting the
               | duplicate values into their own table, linking them with
               | an ID instead.
        
               | maweki wrote:
               | > Yes, but it's an ID, not a value.
               | 
               | In normalization theory and relational algebra an ID is
               | just a value. DBMSs make no difference between this
               | column and any other primary key column(s) and they make
               | no difference between PK indexes over strings or numbers
               | or any other supported data type.
               | 
               | You're just cargo culting here instead of applying
               | database theory or actually looking at implementations.
        
       ___________________________________________________________________
       (page generated 2023-05-22 23:00 UTC)