[HN Gopher] The many faces of DISTINCT in Postgres (2017) ___________________________________________________________________ The many faces of DISTINCT in Postgres (2017) Author : rsecora Score : 153 points Date : 2023-05-22 09:46 UTC (13 hours ago) (HTM) web link (hakibenita.com) (TXT) w3m dump (hakibenita.com) | kubota wrote: | Oracle listaggs do support distinct, since 19c. | cpursley wrote: | I only learned about the ranked query approach last week thanks | to ChatGTP. Helped me solve a hairy query that rolled up activity | events grouped by time periods. Before that I was struggling with | distinct (and it was slow). | | I've avoided ChatGTP until recently and at least for SQL | refactoring, it's great. The interesting part is the ranked | example ChatCTP gave me was almost identical to the one in this | post. I wonder if they're (ChatGTP) is training up on technical | blog posts. | swader999 wrote: | I throw it lines of logs of sql from Paper trail and ask it to | extract the SQL statements. Then I give it the analyse explain | output. Sometimes ChatGPT will give me great advice on | optimizing. | cpursley wrote: | Dang, that's a good idea (as long as not production data). | | It seems like eventually database automation is going be | self-optimizing (more than the query planner already is). | xupybd wrote: | MSSQL does have something closer to array agg. | | https://www.mssqltips.com/sqlservertip/5542/using-for-xml-pa... | veddan wrote: | There's also JSON_ARRAYAGG. | | https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.... | pophenat wrote: | Microsoft SQL Server now also has IS [NOT] DISTINCT FROM. | https://learn.microsoft.com/en-us/sql/t-sql/queries/is-disti... | [deleted] | codeflo wrote: | > A classic job interview question is finding the employee with | the highest salary in each department. | | Here's a cheat code, in case you need to write a database query | and don't remember all the fancy join tricks: Problems like this | often have a very straightforward solution with subselects. In | this case, the main select gets the departments, and a subselect | (with limit 1) fetches the top employee for each department. | | That's a very natural, compositional way of thinking. Granted, | it's not the most optimized way to do it, but more often then | not, the resulting query plan is perfectly fine. | paol wrote: | > It's not the most optimized way to do it, but more often then | not, the query plan is perfectly fine. | | Depending on the details, the query plan might be exactly the | same. I remember once (in SQL Server 2000, so long time ago) | writing the same query in 3 completely different ways, only to | see the DB generate the same plan for each one. | | It would surprise most people how sophisticated the query | transformations databases can do. | [deleted] | dunham wrote: | I had one case, with the same SQL server, where rewriting to | a join made the query much faster. That was a data changing | statement though, and I think the semantics demanded it rerun | the subselect. | | More recently, I've found that mysql sometimes behaves poorly | with subselects, so I tend to avoid them out of habit. | emptyfile wrote: | [dead] | paulddraper wrote: | > Granted, it's not the most optimized way to do it, but more | often then not, the resulting query plan is perfectly fine. | | It usually is the exactly same. (MySQL has a potato for a query | optimizer tho.) | | The biggest case it is suboptimal is when you need to produce | multiple fields from the sub-select. Because then you need | multiple sub-selects. | foldr wrote: | Also easy to do this with a lateral join (and a limit 1 in the | subquery). | giraffe_lady wrote: | If you can do a lateral join off the top of your head this | isn't the kind of question you need to have a strategy for | anyway. | foldr wrote: | I think everyone tends to learn their own subset of SQL. I | was unaware of some of the more 'obvious' solutions in the | article (e.g. I had only a very vague idea of how window | functions work), but I use lateral joins all the time. | | The nice thing about lateral joins is that they're very | easy to understand conceptually once you get over the weird | syntax. | andrenotgiant wrote: | How would you explain them to someone who knows the | syntax but still finds them counterintuitive? | foldr wrote: | I just think of it as performing a subquery for every row | of the main query. | dagss wrote: | To me a lateral join is the simplest construct you could | have, like a for loop in a programming language. | | For each row in my query so far, do this other thing. | | Since I learned them, everything went from being a puzzle | on its own, "I know what I want to do but how to express | it in SQL" -- to simply being writing things out | naturally.. | mwexler wrote: | Fwir, BigQuery doesn't yet support lateral joins, though they | do use the same concept in their UNNEST and array expansion | features. | | It would be nice to have this feature; its a handy hammer for | the right type of nail. | neallindsay wrote: | The author mentions having to get over the lack of upsert when | moving from Oracle. But readers might like to know this isn't a | problem anymore since Postgres got "INSERT ... ON CONFLICT UPDATE | ...". | paulryanrogers wrote: | And MERGE | inferiorhuman wrote: | Oh I'll add to the pile: the Postgres ON CONFLICT/MERGE stuff | doesn't capture the " _select or insert if row doesn 't exist_" | behavior some folks might want to use an upsert type thing for. | | Why? You cannot access the row id or any values within the row | as a return value unless the row changes. The top stackexchange | answer goes into some detail about why you really don't want to | simply make a dummy update in Postgres. | | You can access these old row values in some scope for the no- | update case, and this allows you to leverage CTEs instead of | dummy updates. However... you can't use CTEs with prepared | statements (and it's a non-obvious fail for the uninitiated). | So yeah there are still lots of little sharp edges. | jadbox wrote: | > "select or insert if row doesn't exist" behavior | | I've hit this many, many times. It's such a common need, and | yet I keep having to write it out as multiple queries to | "select row" and if no result, do then insert query. | paulddraper wrote: | It's true. | | Fortunately the records are locked, so a second query will | pick them up. Still sad tho. | | Prepared statements work fine with CTEs. | cdogl wrote: | I am a huge Postgres fan, but "ON CONFLICT UPDATE" does not | always cut the mustard. You can't target conflicts against | multiple constraints, so if there are multiple constraints that | could cause conflicts you're stuck either figuring out whether | you can drop a constraint or doing something clever in your | client. | | This is often a smell, but it's a little inflexible. | paulddraper wrote: | > You can't target conflicts against multiple constraints, | | I've never needed that, and it's not clear why I would. | | Like, my new record is a duplicate of record A according to | one constraint and record B according to another constraint, | so it updates both A and B? | | One "upsert" but two resulting records? | | Can someone fill me in on when this would crop up? | richbell wrote: | In my experience this tends to be an annoyance when you | want to upsert against a table that has multiple unique | constraints. | | For example, let's say you're tracking GitHub repositories | and have a table `repository(id, gh_id, gh_node_id, owner, | name, description)` where `gh_id` and `gh_node_id` are both | unique. | | If you want to insert or update a repository you might want | to do something like below, however, this is not a valid | syntax and as you need to define a separate `DO UPDATE` for | `gh_id` and `gh_node_id`: INSERT INTO | repository (gh_id, gh_node_id, owner, name, description) | VALUES (:id, :node_id, :owner, :name, :description) | ON CONFLICT DO UPDATE SET name = excluded.name, | description = excluded.description; | | ------ | | To my knowledge there's no way to define a single | constraint `UNIQUE(gh_id || gh_node_id)` instead of | `UNIQUE(gh_id && gh_node_id)`. | paulddraper wrote: | So....what is gh_id and gh_node_id? | | Like, you have gh_id | gh_node_id | ------------------ 1 | 2 2 | | 1 | | And then you want to "upsert" (1, 1). | | You want both records to be updated? | | I'm still struggling to see why you would want to do a | upsert relative to multiple unique constraints at once. | richbell wrote: | In addition to this, `ON CONFLICT` is still susceptible to | race-conditions if you have a highly concurrent application. | This may be obvious to some, however, a lot of people expect | upsert to be atomic and get bit by this. | | I also dislike that `ON CONFLICT ... DO NOTHING` increments | numerical primary keys. I understand _why_ it happens but it | seems counter-intuitive given the name "DO NOTHING". (And, | yes, relying on the values of primary keys to never change is | an anti-pattern. However, if you have a table of 10 values | and the primary keys have massive gaps like 1, 500_211, | 2_521_241, 15_631_121, etc., it _feels weird_ nonetheless.) | jack_squat wrote: | From the Postgres docs, | | ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE | outcome; provided there is no independent error, one of | those two outcomes is guaranteed, even under high | concurrency. This is also known as UPSERT -- "UPDATE or | INSERT". | | https://www.postgresql.org/docs/current/sql-insert.html | | What are you referring to? | neallindsay wrote: | It increments the sequence associated with with a | "serial" or "bigserial" field--usually used for primary | keys. People are often surprised by this because they | expect their primary keys to be sequential and sequences | are designed to leave gaps in order to avoid a lot of | locking. | hn_throwaway_99 wrote: | That just seems like a fundamental misunderstanding of | sequences to me. They are guaranteed to be increasing but | not necessarily sequential. | richbell wrote: | I think it's more confusion of how `ON CONFLICT DO | NOTHING` works. Incrementing sequences is not _" doing | nothing"_, even if it's valid and sensible behaviour, | which leads to confusion. | richbell wrote: | If I recall correctly (and it has been a while, so I'm | not saying I am), the issue was with concurrent | transactions inserting a record into tableA and another | into tableB which has a foreign key constraint to tableA. | The issue was likely specific to `ON CONFLICT DO NOTHING` | and not `ON CONFLICT DO UPDATE`. | | For example, let's saying you're building an index of | open source packages and have two tables: | package_type(id, name) and package(id, type_id, | namespace, name). | | If you receive two concurrent requests for | `maven://log4j:log4j` and `maven://io.quarkus:quarkus`, a | naive implementation to insert both "maven" and the | packages if they don't exist might look something like | this: WITH type_id AS ( INSERT | INTO package_type(name) VALUES (:type) | RETURNING id ON CONFLICT DO NOTHING ) | INSERT INTO package (type_id, namespace, name) | SELECT type_id, :namespace, :name FROM type | ON CONFLICT DO NOTHING; | | However, one or both inserts can fail intermittently | because the primary key for `package_type` will be auto- | incremented and thus the foreign key won't be valid. | Also, as mentioned in another comment[0] this won't work | if `maven` already exists in the `package_type` table. | | [0]: https://news.ycombinator.com/item?id=36031790 | paulddraper wrote: | I don't think this has to with concurrency; this query is | fundamentally broken (besides typos), in that the CTE | won't return anything if the package_type already exists. | | You have two options: | | (1) ON CONFLICT DO UPDATE, with dummy update: | WITH type AS ( INSERT INTO | package_type (name) VALUES ($1) | ON CONFLICT (name) DO UPDATE SET name = excluded.name | RETURNING id ) INSERT INTO package | (type_id, namespace, name) SELECT id, $2, $3 | FROM type ON CONFLICT (type_id, namespace, name) | DO UPDATE SET name = excluded.name RETURNING id; | | (2) Separate statements with ON CONFLICT DO NOTHING | (could be in a UDF if desired): INSERT | INTO package_type (name) VALUES ($1) ON | CONFLICT DO NOTHING; INSERT INTO package | (type_id, namespace, name) SELECT type_id, $2, $3 | FROM package_type WHERE name = $1 ON | CONFLICT DO NOTHING; SELECT id FROM | package WHERE (type_id, namespace, name) = ($1, | $2, $3); | jack_squat wrote: | Atomicity doesn't mean "doesn't fail", it means "either | fails or succeeds, but does not succeed halfway". | | There is nothing about what you are describing that is | different from the behavior you'd get from a regular | insert or update. If two transactions conflict, a | rollback will occur. That isn't violating atomicity. In | fact, it is the way by which atomicity is guaranteed. | | The behavior of sequence values getting incremented and | not committed, resulting in gaps in the sequence, is a | separate matter, not specific to Postgres or to upsert. | hn_throwaway_99 wrote: | I think that actually more to the root of the problem, as | other folks have noted, is that `ON CONFLICT DO NOTHING` | means the RETURNING clause returns no rows if there is a | conflict, which in my experience is rarely what people | want. So instead people do `ON CONFLICT DO UPDATE` with a | no-op update which has performance/locking implications, | otherwise they need to do a complicated query (search | stack overflow). | | I wish that postgres would add some sort of backwards | compatible option like `ON CONFLICT DO NOOP` or `ON | CONFLICT DO RETURN` so that you got the semantics of `DO | NOTHING` except that the conflicted rows are returned. | paulddraper wrote: | It's as resilient to race conditions as it could be. Does | exactly what you'd expect, no? | | That is annoying about sequences, I agree. | richbell wrote: | > It's as resilient to race conditions *as it could be*. | Does exactly what you'd expect, no? | | While that that may be true, based on the hours I've | spent scouring Stack Overflow and GitHub issues it seems | that many people don't realize that it's only 99.999% | resilient. | vore wrote: | It is 100% resilient short of your hard drive going | through a microwave. It is almost definitely an | application logic error if you are observing non- | atomicity. What would non-atomicity even look like here? | Only some columns being written? | paulddraper wrote: | Could you describe what you mean? What phenomenon? For | the default transaction isolation level, or a different | one? | mattashii wrote: | Yes, that's been a thing since 9.5, which was released in early | 2016. | justinclift wrote: | 2017 | cpursley wrote: | And still helpful as ever. | ayhanfuat wrote: | I was thinking to myself "there is no way Haki Benita has just | discovered `distinct on`". | zzzeek wrote: | I've never gotten into DISTINCT ON and it's always confused me, | I'd rather see the query with MAX and GROUP BY if I'm looking for | "the highest X in groups of Y". For the same reason I don't | prefer RSA-in-three-lines. | maweki wrote: | "(easy CS students, I know it's not normalized...)" | | Sure it is. As long as you're not storing any other information | on the department in the employee table. | daigoba66 wrote: | And as long as you never rename the departments. | maweki wrote: | That has no bearing on normalization. | Izkata wrote: | It's an almost verbatim example of getting to 1NF, the | first and most basic normalization. The value (department | name) is repeating and should be extracted and given its | own ID. | jack_squat wrote: | 1NF bans relation-valued attributes, not repetition of | attribute value across tuples in a relation. Mainstream | SQL databases don't support relation-valued attributes, | so any table you make in a relational database is 1NF. | | You can push back on this a little - for instance maybe | you consider an array-valued attribute to be enough like | a relation to argue array-valued attributes violate 1NF. | But if you do that you must also explain what makes | arrays different from strings, since strings are pretty | similar to arrays of characters and can be treated the | same way in most respects (for instance characters in a | string can be addressed by index or split into | substrings). | dymk wrote: | Only if there's a separate Departments table, which for | this very simple example, there isn't | maweki wrote: | Even if there were a separate Departments table, who is | to say that the Department Name is not its primary key? | The Department name certainly is A key. | [deleted] | maweki wrote: | > The value (department name) is repeating and should be | extracted | | Then the id would be repeating. Furthermore, the | department name would make a fine primary or alternate | key for the new relation you're proposing. | | Also, that's not what 1NF is. 1NF means there should be | no table-valued attributes. And neither is any column | list-valued nor does any subset of columns form a | subtable. | | The other normal forms talk about functional dependencies | and there aren't any. | | The only possible violation of 1NF could be not splitting | the name in given name and family name. Other than that, | the table is normalized. | Izkata wrote: | > Then the id would be repeating. | | Yes, but it's an ID, not a value. No problems there. | | > Furthermore, the department name would make a fine | primary or alternate key for the new relation you're | proposing. | | They're called "natural keys" and there's a lot of | problems with them not actually mapping to identity. Like | for one example, if a department's name is changed, it | isn't just a change to the database column, you have to | update all associated code as well - which is why you | should have an ID and use that in the code anyway. | | > Also, that's not what 1NF is. 1NF means there should be | no table-valued attributes. And neither is any column | list-valued nor does any subset of columns form a | subtable. | | If there's only one such column and you switch | perspective to the inner table, the transformation is the | same. That's why I said "almost" - it's not exactly the | same, there are additional conditions, but if you hit | them it's the same thing and the result is extracting the | duplicate values into their own table, linking them with | an ID instead. | maweki wrote: | > Yes, but it's an ID, not a value. | | In normalization theory and relational algebra an ID is | just a value. DBMSs make no difference between this | column and any other primary key column(s) and they make | no difference between PK indexes over strings or numbers | or any other supported data type. | | You're just cargo culting here instead of applying | database theory or actually looking at implementations. ___________________________________________________________________ (page generated 2023-05-22 23:00 UTC)