[HN Gopher] Distinct on: a confusing, unique and useful feature ... ___________________________________________________________________ Distinct on: a confusing, unique and useful feature in Postgres Author : yogiUSA Score : 76 points Date : 2020-03-19 08:40 UTC (1 days ago) (HTM) web link (www.yogeshchauhan.com) (TXT) w3m dump (www.yogeshchauhan.com) | Groxx wrote: | Seems like a saner option (in that behavior is defined) than | MySQL's ability to select columns in a group-by that are not | aggregates / grouped-on. | | e.g. this works in MySQL, and has for many years: | select a, b, c from table group by a | | and you'll get undefined (but valid) b and c items from the | grouped results per `a`. It's absurdly useful at times - if the | non-grouped columns are just denormalized data, it doesn't matter | which row they came from, and there's no need to spend time | sorting/maxing/etc. And sometimes you truly don't care which | entry you get, you just want _an_ entry, e.g. to show a sample of | what 's in the group. | bfoks wrote: | But not by default. It's related to ONLY_FULL_GROUP_BY. | | More here: https://dev.mysql.com/doc/refman/8.0/en/group-by- | handling.ht... | greiskul wrote: | Is this similar to BigQuery's ANY_VALUE aggregate function? I | found it to be extremely useful. | beckingz wrote: | This removes the self join needed to ask the query | "(latest/first/biggest/smallest) record per | (customer/user/entity)". | | Nice! | caseyohara wrote: | I've been struggling with a similar query recently perhaps | better suited for a time series DB. Can DISTINCT ON be used to | ask the query "latest record per customer _per day_ ". For | example, let's say we have a bunch of devices that send sensor | readings arbitrarily throughout each day. Could DISTINCT ON be | used to query the last reading of each day per device? | claytonjy wrote: | I've been using Timescale DB, a postgres extension, to store | our sensor-produced timeseries data and it handles this | pretty well. DISTINCT ON seems to ruin the query plan there, | by forcing sequential, non-index scans of each subtable | (timescale breaks tables into time-buckets behind the scenes, | so it can skip searching tables in time-filtered queries). | I've had good performance with their "time_bucket" and "last" | functions, though, and not having to use some NoSQL time | series database has been really nice. | eyelidlessness wrote: | Yes, you can `DISTINCT ON` with a `DATE_TRUNC` expression | which should get you what you want. | lenzm wrote: | Absolutely. You'll need an expression to get the date out of | the timestamp, then you want rows that are distinct on | customer_id & date order by timestamp desc. | dagss wrote: | Also consider expressin this with a lateral join. | | Scroll a bit down here for a similar example. | | https://stackoverflow.com/questions/11472790/postgres- | analog... | Zecc wrote: | SQLite takes it one step further. | | https://sqlite.org/quirks.html#aggregate_queries_can_contain... | | > if an aggregate query contains a single min() or max() | function, then the values of columns used in the output are | taken from the row where the min() or max() value was achieved. | If two or more rows have the same min() or max() value, then | the columns values will be chosen arbitrarily from one of those | rows. | | Also here: | | https://sqlite.org/lang_select.html#bareagg | | > When the min() or max() aggregate functions are used in an | aggregate query, all bare columns in the result set take values | from the input row which also contains the minimum or maximum. | [...] There is still an ambiguity if two or more of the input | rows have the same minimum or maximum value or if the query | contains more than one min() and/or max() aggregate function. | | Would have been nice if adding an ORDER BY would allow | disambiguating. | Tostino wrote: | Distinct on is for sure one of the most useful Postgres | extensions to SQL, as far as how concise it makes queries which | otherwise would suck to write. | | I would have liked to see equivalent queries written with GROUP | BY so people could understand and translate that knowledge even | if they don't have direct Postgres experience. | | Thanks for the article! | beckingz wrote: | This would be great, because getting those same results in | standard SQL or MariaDB is a pain. | claytonjy wrote: | I believe the demo query SELECT DISTINCT ON | (location) location, time, report FROM weather_reports | ORDER BY location, time DESC; | | could also be written with a self-join like so | SELECT location, time, report FROM weather_reports | JOIN ( SELECT location, max(time) AS time | FROM weather_reports GROUP BY location ) as | _ USING (location, time) ORDER BY location; | | though I'd be curious what the query-plan differences are. I | suspect the DISTINCT ON might translate more closely to a | lateral join SELECT FROM ( | SELECT DISTINCT(location) FROM weather_reports | ) AS locations JOIN LATERAL ( SELECT | location, time, report FROM weather_reports | WHERE location = locations.location ORDER BY ts | DESC LIMIT 1 ) AS reports ON TRUE | ORDER BY location; | | But I really can't say. | Twisell wrote: | What's nice about PostgreSQL is that you quite often discover | little useful trick you have missed so far. | | Guess that's however true for any well written language when you | master it enough to enjoy theses gems hidden in plain sight. | cosmie wrote: | One really useful call out is that it supports expressions, not | just columns. You can create derived entities via case statements | (think binning, entity normalization, etc) and leverage the super | handy functionality against those derived entities. | | It also allows you to use multiple entities/expressions in the | distinct on, so you can pivot the logic between granularity | levels. Which comes in handy when needing to pull rollups | latest/first/biggest/smallest style rollups from varying | perspectives. | rattray wrote: | This sounds pretty awesome. Anyone have any sample queries | demonstrating this lying around? | cosmie wrote: | I threw together a Fiddle that demonstrates it here[1]. It | shows the basic single column version, a multi-column | distinct on, and an example that uses a case statement for a | derived column. The only thing with the third example is that | I don't believe you can alias the derived column within the | distinct on field itself, so have to repeat the expression - | once within the distinct on and again in the column list of | the select statement. | | [1] https://www.db-fiddle.com/f/7CCYFf14UtkuGYfGZAt5PY/2 | cosmotic wrote: | If you want your application to be portable, don't use this | feature. | irrational wrote: | Do you mean portable as in able to run on different databases? | Your application would have to be fairly simple to achieve that | goal. We moved an application from Oracle to Postgres. It took | about 2 years to complete. | _jal wrote: | This advice grew out of the days when database platforms cost | enormous amounts of money. It is still as true as it ever was, | but since then the biggest pressure that lead to it no longer | applies. | | Back before PG and others were viable competitors to the big | commercial databases, about the only leverage customers had | against aggressive sales people was the threat to switch to a | competitor. So people would preserve that as a credible threat | by not using proprietary SQL extensions. | | No vampiric salesdroid from Postgres is going to knock on your | door demanding 8 figures. | | Now, there may be other reasons you want to switch away from | it, or you may want to support a variety of DBs in whatever | you're building, or some other reason. So it is good to be | aware of. But shying away from SQL extensions is usually no | longer a matter of budgetary apocalypse. ___________________________________________________________________ (page generated 2020-03-20 23:00 UTC)