[HN Gopher] Let's talk about joins
       ___________________________________________________________________
        
       Let's talk about joins
        
       Author : r4um
       Score  : 105 points
       Date   : 2024-01-20 06:52 UTC (16 hours ago)
        
 (HTM) web link (cghlewis.com)
 (TXT) w3m dump (cghlewis.com)
        
       | adrianmonk wrote:
       | From the opening of the "Vertical Joins" section:
       | 
       | > _Similar to horizontal joins, there are many use cases for
       | joining data horizontally, also called appending data._
       | 
       | Shouldn't this read "joining data vertically"? This seems like a
       | typo.
        
         | cghlewis wrote:
         | This is definitely a typo! Thanks for catching this!
        
       | jethkl wrote:
       | The article omits the "natural join". A natural join
       | automatically joins tables on columns having the same name, and
       | the columns in the resulting output appear exactly once. Natural
       | joins often simplify queries, and they automatically adapt to
       | schema changes in source tables. Natural joins also encourage
       | standardized naming conventions, which has longer-term benefits.
        
         | youerbt wrote:
         | I feel like natural joins simplify writing queries, but not
         | exactly reading them (especially if you are not familiar with
         | the database). IMO a good compromise is the USING clause, which
         | acts like a natural join, but columns have to be named
         | explicitly.
        
           | amluto wrote:
           | Maybe some SQL server should require a token like
           | "USING_WILD_GUESS" when doing a natural join :)
        
           | chasil wrote:
           | The problem with USING is that it isn't supported everywhere.
        
         | hipadev23 wrote:
         | That's not a join type, that's syntactic sugar.
        
           | bradleybuda wrote:
           | A right join is also syntactic sugar for a left join
        
             | bakuninsbart wrote:
             | Exactly, that's why they shouldn't be used. If you feel the
             | need to use a right join, swap the direction around. And
             | only use inner joins if it is the only type of join you
             | use, otherwise specify conditions in the where clause. Both
             | serve to significantly lower mental overhead when dealing
             | with queries.
             | 
             | Natural joins are naturally more implicit, and while SQL
             | tends to be a little bit verbose, given the significance of
             | data integrity and the dififculty of testing SQL, the
             | trade-off goes very clearly towards being explicit.
        
         | cldellow wrote:
         | Natural joins also automatically break your queries when two
         | columns happen to share a name but not the same meaning.
         | 
         | Step 1: use natural join. Life is great. Step 2: someone adds a
         | `comment` field on table A. Life is great. Step 3: someone adds
         | a `comment` field on table B. Ruh roh.
         | 
         | I'll use them in short-lived personal projects, but not on
         | something where I'm collaborating with other people on software
         | that evolves over several years.
        
           | jethkl wrote:
           | a defense against collisions like this is through CTEs that
           | select a minimal set of columns, with column names suitably
           | selected and standardized:                 CTE_A AS (SELECT
           | ... comment as comment_a from A...),       CTE_B AS (SELECT
           | ... comment as comment_b from B...)
        
             | Cyberdog wrote:
             | Isn't this a lot more work both for the user and the RDBMS
             | than just using a relatively simple left join?
        
           | closeparen wrote:
           | It seems like the database should be able to figure this out
           | when a foreign key constraint is explicitly declared in the
           | DDL.
        
             | tqi wrote:
             | Should, but in practice I've rarely seen fk contraints used
             | in analytics data warehouses (mostly for etl performance
             | reasons)
        
         | civilized wrote:
         | > Natural joins also encourage standardized naming conventions,
         | which has longer-term benefits.
         | 
         | This is a very positive spin on "you have to manage column
         | names very rigorously for this strategy to be sustainable".
        
       | mgaunard wrote:
       | There are lots of other joins not mentioned there.
       | 
       | A popular one with time series is the asof join.
       | 
       | There are also literal joins, which are generalizations of
       | adjacent difference.
        
         | mgaunard wrote:
         | meant lAteral sorry
        
       | HermitX wrote:
       | Excellent learning material, thanks for sharing. I've noticed an
       | interesting trend: JOINS are crucial for data analytics, yet many
       | new open-source data analytics products, or open-source OLAP
       | products, offer limited support for JOINS. Examples include
       | ClickHouse, Apache Druid, and Apache Pinot. It seems that
       | currently, only Trino and StarRocks provide robust support for
       | JOINS. Commercial products tend to have better support in this
       | area. I hope the various open-source projects will also enhance
       | their JOIN capabilities.
        
         | minitoar wrote:
         | The reason those tools have more limited support for joins is
         | mainly because they are making intentional trade offs in favor
         | of other features, eg performance in a particular domain.
        
         | totalhack wrote:
         | I've also been frustrated when testing out tools that kinda
         | keep you locked into one predetermined view, table, or set of
         | tables at a time. I made a semantic data modeling library that
         | puts together queries (and of course joins) for you as it uses
         | a drill-across querying technique, and can also join data
         | across different data sources in a secondary execution layer.
         | 
         | https://github.com/totalhack/zillion
         | 
         | Disclaimer: this project is currently a one man show, though I
         | use it in production at my own company.
        
         | closeparen wrote:
         | If you want arbitrarily powerful adhoc query support, you need
         | to wait for data to land in an offline warehouse or lake
         | environment where you have access to e.g. Presto/Trino and
         | Spark. If you want a near-real-time view then you're going to
         | need to design the data layout around your query pattern - do a
         | streaming join or other enrichment prior to OLAP ingestion.
        
           | necubi wrote:
           | Yep. There are always going to be constraints about how well
           | a system like clickhouse can support arbitrary joins. Queries
           | in clickhouse are fast because the data is laid out in such a
           | way that it can minimize how much it needs to read.
           | 
           | Part of this is the columnar layout that means it can avoid
           | reading columns that are not involved in the query. However
           | it's also able to push query predicates into the table scan,
           | using metadata (like bloom filters) that tell it what values
           | are in each chunk of data.
           | 
           | But for joins, you typically end up needing to read all of
           | the data and materialize it in memory.
           | 
           | For realtime joins the best option is to do it in a steaming
           | fashion on ingestion, for example in a system like Flink or
           | Arroyo [0], which I work on.
           | 
           | [0] https://github.com/ArroyoSystems/arroyo
        
             | closeparen wrote:
             | Something I have found pretty annoying is that Flink works
             | great for joining a stream against another stream where
             | messages to be joined are expected to arrive within a few
             | minutes of each other, but there is actually ~no platform
             | solution for joining a small, unchanging or slowly changing
             | table against a stream. We end up needing a service to
             | consume the messages, make RPC calls, and re-emit them with
             | new fields.
        
         | teunispeters wrote:
         | A related note, PostgreSQL is very good at joins, but MySQL -
         | with at the time much larger share - was never very good at
         | them (at the time). (I last explored this 2016 and before). But
         | a lot of web interfaces to data exploration (then) were based
         | on MySQL and its quirks, and that colours perspectives a lot.
        
       | tosh wrote:
       | asof join:
       | 
       | https://code.kx.com/q/ref/asof/
       | 
       | https://code.kx.com/q/learn/brief-introduction/#time-joins
       | 
       | https://clickhouse.com/docs/en/sql-reference/statements/sele...
       | 
       | https://duckdb.org/docs/guides/sql_features/asof_join.html
       | 
       | > Do you have time series data that you want to join, but the
       | timestamps don't quite match? Or do you want to look up a value
       | that changes over time using the times in another table?
       | 
       | DuckDB blog post on temporal joins:
       | 
       | https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-look...
       | 
       | note: this idea is very useful for timestamps but can also be
       | used for other types of values
        
         | jamesblonde wrote:
         | The ASOF LEFT JOIN is the main join used to create training
         | data, where the labels are in the left table, and features are
         | in the tables on the RHS. Note, we use DuckDB for its left asof
         | join capabilities, but most vendors use Spark which is
         | hideously slow for left asof joins.
        
       | andy81 wrote:
       | If I was going to add anything -
       | 
       | "Vertical join" is normally called union.
       | 
       | "Right join" is just (bad) syntax sugar and should be avoided.
       | Left join with the tables reversed is the usual convention.
       | 
       | The join condition for inner is optional- if it's always true
       | then you get a "cross join". Can be useful to show all the
       | possible combinations of two fields.
        
         | wood_spirit wrote:
         | All the other non-left joins are just syntactic sugar and can
         | be expressed using only left join...?
        
           | tzot wrote:
           | > All the other non-left joins are just syntactic sugar and
           | can be expressed using only left join...?
           | 
           | A "vertical join" (SQL UNION) is one of the "non-left joins".
           | How can you transform a "vertical join" to a "left join"?
        
             | cbreezyyall wrote:
             | This feels like an interesting interview question. I think
             | you could simulate this with a full outer join on the
             | entire select list and coalesce? For a UNION ALL you could
             | put some literal column in the selects from both tables
             | that you set to different values and include that in the
             | join so you'd get a result set that will have all nulls in
             | the right table columns for the rows in the left table and
             | vice versa. Something like                 WITH top_t AS (
             | SELECT          a         ,b         ,c         ,'top' as
             | nonexistent_col        FROM table_1       ), bottom_t AS (
             | SELECT          a         ,b         ,c         ,'bottom'
             | as nonexistent_col        FROM table_2       )       SELECT
             | COALESCE(top_t.a, bottom_t.a) AS a
             | ,COALESCE(top_t.b, bottom_t.b) AS b
             | ,COALESCE(top_t.c, bottom_t.c) AS c       FROM top_t FULL
             | OUTER JOIN bottom_t          ON top_t.a = bottom_t.a
             | AND top_t.b = bottom_t.b         AND top_t.c = bottom_t.c
             | AND top_t.nonexistent_col = bottom_t.nonexistent_col --
             | remove this for a normal UNION
        
               | wood_spirit wrote:
               | Bravo!!
        
         | Little_Kitty wrote:
         | Of the tens of thousands of queries I've written I've needed
         | right join the exactly once. It's a feature which is neat in
         | that it exists, but the prevalence in teaching materials is
         | entirely unjustified. Cross joins are massively more practical
         | and enable some efficient transformations, but are usually
         | taught only as all to all without a clear position on why they
         | are useful.
        
       | nikhilsimha wrote:
       | Never heard a union be called a "vertical join" before.
        
         | petalmind wrote:
         | It seems that lots of people independently coin this
         | expression. I did it too a couple of years ago. Just checked
         | Google, and there are lots of hits for this.
        
       | airstrike wrote:
       | _> Let's try this again using R._
       | 
       | This should actually be "Let's try this again using dplyr, one of
       | the most elegant pieces of software ever written".
       | 
       | Hadley Wickham is a treasure to humanity.
        
       | jarym wrote:
       | > "Here we typically expect that the combined dataset will have
       | the same number of rows as our original left side dataset."
       | 
       | For left join this isn't entirely true. If there are more
       | matching cases on the right joined table then you'll get
       | additional rows for each match. That is unless you take steps to
       | ensure only at most one row is matched per row on the left (eg
       | using something like DISTINCT ON in Postgres)
        
         | wood_spirit wrote:
         | Yes I picked that up and was tempted to comment. But then half
         | way down it addresses this with the section "Many
         | relationships":
         | 
         | > Until now we have discussed scenarios that are considered
         | one-to-one merges. In these cases, we only expect one
         | participant in a dataset to be joined to one instance of that
         | same participant in the other dataset.
         | 
         | > However, there are scenarios where this will not be the
         | case...
        
           | jarym wrote:
           | Got it, in my opinion the structure is likely to mislead
           | people because it doesn't make clear that it refers to 'one-
           | to-one' relationships at the outset (in fact not even, it
           | deals with the cases of one-to-one and one-to-none) - it only
           | refers to 'left join'.
        
             | wood_spirit wrote:
             | Yeap, the whole article is confusing to us who are already
             | very familiar with eg SQL joins and things. It's not using
             | mainstream terminology. But I guess we are not the
             | audience.
        
         | petalmind wrote:
         | This is my pet peeve. Top google search results and ChatGPT
         | suggest this "same number of rows" mental model which is
         | incomplete and breaks.
         | 
         | I wrote about this:
         | https://minimalmodeling.substack.com/p/many-explanations-of-...
        
       | zzzeek wrote:
       | when I saw the term "horizontal joins", I immediately went to,
       | what? what's a "vertical join?" must be a UNION, scrolled down
       | and sure enough.
       | 
       | Is there some reason to use non-standard terminology for posts
       | that are trying to be in-depth, authoritative tutorials ?
        
       | genman wrote:
       | Good material.
       | 
       | Joins can be also categorized by the used join algorithm.
       | 
       | The simplest join is a nested loop join. In this case a larger
       | dataset is iterated and another small dataset is combined using a
       | simple key lookup.
       | 
       | Then there is a merge join. In this case two larger datasets are
       | first sorted using merge sort and then aligned linearly.
       | 
       | Then there is a hash join. For the hash join a hash table is
       | generated first based on the smaller dataset and the larger
       | dataset is iterated and the join is made by making a hash lookup
       | to the generated hash table.
       | 
       | The difference between nested loop join and hash join might be
       | confusing.
       | 
       | In case of a nested loop join the second table is not loaded from
       | the storage medium first, instead an index is used to lookup the
       | location of the records. This has O(log n) complexity for each
       | lookup. For hash join the table is loaded and hash table is
       | generated. In this case each lookup has O(1) complexity but
       | creation of hash table is expensive (it has O(n) complexity) and
       | is only worth the cost when the dataset is relatively large.
        
         | tmoertel wrote:
         | In distributed systems, there's even a _broadcast_ hash join,
         | in which the hash table is not distributed across shards and
         | assigned to workers but copied in full to every worker. That
         | way, the other side of the join need not be sharded and
         | shuffled to align join keys across workers. This strategy can
         | save a lot of time and network bandwidth when one side of the
         | join is small enough to fit into a worker 's RAM and the other
         | side is staggeringly massive (e.g., logs). It lets the massive
         | side be processed in place, as it's streamed from storage.
        
       | lakomen wrote:
       | Beginners talking about SQL joins. What is this newbienews or
       | hackernews? "But but ChatGPT says"...
       | 
       | Disgusting
        
       ___________________________________________________________________
       (page generated 2024-01-20 23:00 UTC)