[HN Gopher] Distinct on: a confusing, unique and useful feature ...
       ___________________________________________________________________
        
       Distinct on: a confusing, unique and useful feature in Postgres
        
       Author : yogiUSA
       Score  : 76 points
       Date   : 2020-03-19 08:40 UTC (1 days ago)
        
 (HTM) web link (www.yogeshchauhan.com)
 (TXT) w3m dump (www.yogeshchauhan.com)
        
       | Groxx wrote:
       | Seems like a saner option (in that behavior is defined) than
       | MySQL's ability to select columns in a group-by that are not
       | aggregates / grouped-on.
       | 
       | e.g. this works in MySQL, and has for many years:
       | select a, b, c from table group by a
       | 
       | and you'll get undefined (but valid) b and c items from the
       | grouped results per `a`. It's absurdly useful at times - if the
       | non-grouped columns are just denormalized data, it doesn't matter
       | which row they came from, and there's no need to spend time
       | sorting/maxing/etc. And sometimes you truly don't care which
       | entry you get, you just want _an_ entry, e.g. to show a sample of
       | what 's in the group.
        
         | bfoks wrote:
         | But not by default. It's related to ONLY_FULL_GROUP_BY.
         | 
         | More here: https://dev.mysql.com/doc/refman/8.0/en/group-by-
         | handling.ht...
        
         | greiskul wrote:
         | Is this similar to BigQuery's ANY_VALUE aggregate function? I
         | found it to be extremely useful.
        
       | beckingz wrote:
       | This removes the self join needed to ask the query
       | "(latest/first/biggest/smallest) record per
       | (customer/user/entity)".
       | 
       | Nice!
        
         | caseyohara wrote:
         | I've been struggling with a similar query recently perhaps
         | better suited for a time series DB. Can DISTINCT ON be used to
         | ask the query "latest record per customer _per day_ ". For
         | example, let's say we have a bunch of devices that send sensor
         | readings arbitrarily throughout each day. Could DISTINCT ON be
         | used to query the last reading of each day per device?
        
           | claytonjy wrote:
           | I've been using Timescale DB, a postgres extension, to store
           | our sensor-produced timeseries data and it handles this
           | pretty well. DISTINCT ON seems to ruin the query plan there,
           | by forcing sequential, non-index scans of each subtable
           | (timescale breaks tables into time-buckets behind the scenes,
           | so it can skip searching tables in time-filtered queries).
           | I've had good performance with their "time_bucket" and "last"
           | functions, though, and not having to use some NoSQL time
           | series database has been really nice.
        
           | eyelidlessness wrote:
           | Yes, you can `DISTINCT ON` with a `DATE_TRUNC` expression
           | which should get you what you want.
        
           | lenzm wrote:
           | Absolutely. You'll need an expression to get the date out of
           | the timestamp, then you want rows that are distinct on
           | customer_id & date order by timestamp desc.
        
           | dagss wrote:
           | Also consider expressin this with a lateral join.
           | 
           | Scroll a bit down here for a similar example.
           | 
           | https://stackoverflow.com/questions/11472790/postgres-
           | analog...
        
         | Zecc wrote:
         | SQLite takes it one step further.
         | 
         | https://sqlite.org/quirks.html#aggregate_queries_can_contain...
         | 
         | > if an aggregate query contains a single min() or max()
         | function, then the values of columns used in the output are
         | taken from the row where the min() or max() value was achieved.
         | If two or more rows have the same min() or max() value, then
         | the columns values will be chosen arbitrarily from one of those
         | rows.
         | 
         | Also here:
         | 
         | https://sqlite.org/lang_select.html#bareagg
         | 
         | > When the min() or max() aggregate functions are used in an
         | aggregate query, all bare columns in the result set take values
         | from the input row which also contains the minimum or maximum.
         | [...] There is still an ambiguity if two or more of the input
         | rows have the same minimum or maximum value or if the query
         | contains more than one min() and/or max() aggregate function.
         | 
         | Would have been nice if adding an ORDER BY would allow
         | disambiguating.
        
       | Tostino wrote:
       | Distinct on is for sure one of the most useful Postgres
       | extensions to SQL, as far as how concise it makes queries which
       | otherwise would suck to write.
       | 
       | I would have liked to see equivalent queries written with GROUP
       | BY so people could understand and translate that knowledge even
       | if they don't have direct Postgres experience.
       | 
       | Thanks for the article!
        
         | beckingz wrote:
         | This would be great, because getting those same results in
         | standard SQL or MariaDB is a pain.
        
         | claytonjy wrote:
         | I believe the demo query                   SELECT DISTINCT ON
         | (location) location, time, report         FROM weather_reports
         | ORDER BY location, time DESC;
         | 
         | could also be written with a self-join like so
         | SELECT location, time, report         FROM weather_reports
         | JOIN (             SELECT location, max(time) AS time
         | FROM weather_reports             GROUP BY location         ) as
         | _ USING (location, time)         ORDER BY location;
         | 
         | though I'd be curious what the query-plan differences are. I
         | suspect the DISTINCT ON might translate more closely to a
         | lateral join                   SELECT         FROM (
         | SELECT DISTINCT(location)             FROM weather_reports
         | ) AS locations         JOIN LATERAL (             SELECT
         | location, time, report             FROM weather_reports
         | WHERE location = locations.location             ORDER BY ts
         | DESC             LIMIT 1         ) AS reports ON TRUE
         | ORDER BY location;
         | 
         | But I really can't say.
        
       | Twisell wrote:
       | What's nice about PostgreSQL is that you quite often discover
       | little useful trick you have missed so far.
       | 
       | Guess that's however true for any well written language when you
       | master it enough to enjoy theses gems hidden in plain sight.
        
       | cosmie wrote:
       | One really useful call out is that it supports expressions, not
       | just columns. You can create derived entities via case statements
       | (think binning, entity normalization, etc) and leverage the super
       | handy functionality against those derived entities.
       | 
       | It also allows you to use multiple entities/expressions in the
       | distinct on, so you can pivot the logic between granularity
       | levels. Which comes in handy when needing to pull rollups
       | latest/first/biggest/smallest style rollups from varying
       | perspectives.
        
         | rattray wrote:
         | This sounds pretty awesome. Anyone have any sample queries
         | demonstrating this lying around?
        
           | cosmie wrote:
           | I threw together a Fiddle that demonstrates it here[1]. It
           | shows the basic single column version, a multi-column
           | distinct on, and an example that uses a case statement for a
           | derived column. The only thing with the third example is that
           | I don't believe you can alias the derived column within the
           | distinct on field itself, so have to repeat the expression -
           | once within the distinct on and again in the column list of
           | the select statement.
           | 
           | [1] https://www.db-fiddle.com/f/7CCYFf14UtkuGYfGZAt5PY/2
        
       | cosmotic wrote:
       | If you want your application to be portable, don't use this
       | feature.
        
         | irrational wrote:
         | Do you mean portable as in able to run on different databases?
         | Your application would have to be fairly simple to achieve that
         | goal. We moved an application from Oracle to Postgres. It took
         | about 2 years to complete.
        
         | _jal wrote:
         | This advice grew out of the days when database platforms cost
         | enormous amounts of money. It is still as true as it ever was,
         | but since then the biggest pressure that lead to it no longer
         | applies.
         | 
         | Back before PG and others were viable competitors to the big
         | commercial databases, about the only leverage customers had
         | against aggressive sales people was the threat to switch to a
         | competitor. So people would preserve that as a credible threat
         | by not using proprietary SQL extensions.
         | 
         | No vampiric salesdroid from Postgres is going to knock on your
         | door demanding 8 figures.
         | 
         | Now, there may be other reasons you want to switch away from
         | it, or you may want to support a variety of DBs in whatever
         | you're building, or some other reason. So it is good to be
         | aware of. But shying away from SQL extensions is usually no
         | longer a matter of budgetary apocalypse.
        
       ___________________________________________________________________
       (page generated 2020-03-20 23:00 UTC)