[HN Gopher] Let's talk about joins ___________________________________________________________________ Let's talk about joins Author : r4um Score : 105 points Date : 2024-01-20 06:52 UTC (16 hours ago) (HTM) web link (cghlewis.com) (TXT) w3m dump (cghlewis.com) | adrianmonk wrote: | From the opening of the "Vertical Joins" section: | | > _Similar to horizontal joins, there are many use cases for | joining data horizontally, also called appending data._ | | Shouldn't this read "joining data vertically"? This seems like a | typo. | cghlewis wrote: | This is definitely a typo! Thanks for catching this! | jethkl wrote: | The article omits the "natural join". A natural join | automatically joins tables on columns having the same name, and | the columns in the resulting output appear exactly once. Natural | joins often simplify queries, and they automatically adapt to | schema changes in source tables. Natural joins also encourage | standardized naming conventions, which has longer-term benefits. | youerbt wrote: | I feel like natural joins simplify writing queries, but not | exactly reading them (especially if you are not familiar with | the database). IMO a good compromise is the USING clause, which | acts like a natural join, but columns have to be named | explicitly. | amluto wrote: | Maybe some SQL server should require a token like | "USING_WILD_GUESS" when doing a natural join :) | chasil wrote: | The problem with USING is that it isn't supported everywhere. | hipadev23 wrote: | That's not a join type, that's syntactic sugar. | bradleybuda wrote: | A right join is also syntactic sugar for a left join | bakuninsbart wrote: | Exactly, that's why they shouldn't be used. If you feel the | need to use a right join, swap the direction around. And | only use inner joins if it is the only type of join you | use, otherwise specify conditions in the where clause. Both | serve to significantly lower mental overhead when dealing | with queries. | | Natural joins are naturally more implicit, and while SQL | tends to be a little bit verbose, given the significance of | data integrity and the dififculty of testing SQL, the | trade-off goes very clearly towards being explicit. | cldellow wrote: | Natural joins also automatically break your queries when two | columns happen to share a name but not the same meaning. | | Step 1: use natural join. Life is great. Step 2: someone adds a | `comment` field on table A. Life is great. Step 3: someone adds | a `comment` field on table B. Ruh roh. | | I'll use them in short-lived personal projects, but not on | something where I'm collaborating with other people on software | that evolves over several years. | jethkl wrote: | a defense against collisions like this is through CTEs that | select a minimal set of columns, with column names suitably | selected and standardized: CTE_A AS (SELECT | ... comment as comment_a from A...), CTE_B AS (SELECT | ... comment as comment_b from B...) | Cyberdog wrote: | Isn't this a lot more work both for the user and the RDBMS | than just using a relatively simple left join? | closeparen wrote: | It seems like the database should be able to figure this out | when a foreign key constraint is explicitly declared in the | DDL. | tqi wrote: | Should, but in practice I've rarely seen fk contraints used | in analytics data warehouses (mostly for etl performance | reasons) | civilized wrote: | > Natural joins also encourage standardized naming conventions, | which has longer-term benefits. | | This is a very positive spin on "you have to manage column | names very rigorously for this strategy to be sustainable". | mgaunard wrote: | There are lots of other joins not mentioned there. | | A popular one with time series is the asof join. | | There are also literal joins, which are generalizations of | adjacent difference. | mgaunard wrote: | meant lAteral sorry | HermitX wrote: | Excellent learning material, thanks for sharing. I've noticed an | interesting trend: JOINS are crucial for data analytics, yet many | new open-source data analytics products, or open-source OLAP | products, offer limited support for JOINS. Examples include | ClickHouse, Apache Druid, and Apache Pinot. It seems that | currently, only Trino and StarRocks provide robust support for | JOINS. Commercial products tend to have better support in this | area. I hope the various open-source projects will also enhance | their JOIN capabilities. | minitoar wrote: | The reason those tools have more limited support for joins is | mainly because they are making intentional trade offs in favor | of other features, eg performance in a particular domain. | totalhack wrote: | I've also been frustrated when testing out tools that kinda | keep you locked into one predetermined view, table, or set of | tables at a time. I made a semantic data modeling library that | puts together queries (and of course joins) for you as it uses | a drill-across querying technique, and can also join data | across different data sources in a secondary execution layer. | | https://github.com/totalhack/zillion | | Disclaimer: this project is currently a one man show, though I | use it in production at my own company. | closeparen wrote: | If you want arbitrarily powerful adhoc query support, you need | to wait for data to land in an offline warehouse or lake | environment where you have access to e.g. Presto/Trino and | Spark. If you want a near-real-time view then you're going to | need to design the data layout around your query pattern - do a | streaming join or other enrichment prior to OLAP ingestion. | necubi wrote: | Yep. There are always going to be constraints about how well | a system like clickhouse can support arbitrary joins. Queries | in clickhouse are fast because the data is laid out in such a | way that it can minimize how much it needs to read. | | Part of this is the columnar layout that means it can avoid | reading columns that are not involved in the query. However | it's also able to push query predicates into the table scan, | using metadata (like bloom filters) that tell it what values | are in each chunk of data. | | But for joins, you typically end up needing to read all of | the data and materialize it in memory. | | For realtime joins the best option is to do it in a steaming | fashion on ingestion, for example in a system like Flink or | Arroyo [0], which I work on. | | [0] https://github.com/ArroyoSystems/arroyo | closeparen wrote: | Something I have found pretty annoying is that Flink works | great for joining a stream against another stream where | messages to be joined are expected to arrive within a few | minutes of each other, but there is actually ~no platform | solution for joining a small, unchanging or slowly changing | table against a stream. We end up needing a service to | consume the messages, make RPC calls, and re-emit them with | new fields. | teunispeters wrote: | A related note, PostgreSQL is very good at joins, but MySQL - | with at the time much larger share - was never very good at | them (at the time). (I last explored this 2016 and before). But | a lot of web interfaces to data exploration (then) were based | on MySQL and its quirks, and that colours perspectives a lot. | tosh wrote: | asof join: | | https://code.kx.com/q/ref/asof/ | | https://code.kx.com/q/learn/brief-introduction/#time-joins | | https://clickhouse.com/docs/en/sql-reference/statements/sele... | | https://duckdb.org/docs/guides/sql_features/asof_join.html | | > Do you have time series data that you want to join, but the | timestamps don't quite match? Or do you want to look up a value | that changes over time using the times in another table? | | DuckDB blog post on temporal joins: | | https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-look... | | note: this idea is very useful for timestamps but can also be | used for other types of values | jamesblonde wrote: | The ASOF LEFT JOIN is the main join used to create training | data, where the labels are in the left table, and features are | in the tables on the RHS. Note, we use DuckDB for its left asof | join capabilities, but most vendors use Spark which is | hideously slow for left asof joins. | andy81 wrote: | If I was going to add anything - | | "Vertical join" is normally called union. | | "Right join" is just (bad) syntax sugar and should be avoided. | Left join with the tables reversed is the usual convention. | | The join condition for inner is optional- if it's always true | then you get a "cross join". Can be useful to show all the | possible combinations of two fields. | wood_spirit wrote: | All the other non-left joins are just syntactic sugar and can | be expressed using only left join...? | tzot wrote: | > All the other non-left joins are just syntactic sugar and | can be expressed using only left join...? | | A "vertical join" (SQL UNION) is one of the "non-left joins". | How can you transform a "vertical join" to a "left join"? | cbreezyyall wrote: | This feels like an interesting interview question. I think | you could simulate this with a full outer join on the | entire select list and coalesce? For a UNION ALL you could | put some literal column in the selects from both tables | that you set to different values and include that in the | join so you'd get a result set that will have all nulls in | the right table columns for the rows in the left table and | vice versa. Something like WITH top_t AS ( | SELECT a ,b ,c ,'top' as | nonexistent_col FROM table_1 ), bottom_t AS ( | SELECT a ,b ,c ,'bottom' | as nonexistent_col FROM table_2 ) SELECT | COALESCE(top_t.a, bottom_t.a) AS a | ,COALESCE(top_t.b, bottom_t.b) AS b | ,COALESCE(top_t.c, bottom_t.c) AS c FROM top_t FULL | OUTER JOIN bottom_t ON top_t.a = bottom_t.a | AND top_t.b = bottom_t.b AND top_t.c = bottom_t.c | AND top_t.nonexistent_col = bottom_t.nonexistent_col -- | remove this for a normal UNION | wood_spirit wrote: | Bravo!! | Little_Kitty wrote: | Of the tens of thousands of queries I've written I've needed | right join the exactly once. It's a feature which is neat in | that it exists, but the prevalence in teaching materials is | entirely unjustified. Cross joins are massively more practical | and enable some efficient transformations, but are usually | taught only as all to all without a clear position on why they | are useful. | nikhilsimha wrote: | Never heard a union be called a "vertical join" before. | petalmind wrote: | It seems that lots of people independently coin this | expression. I did it too a couple of years ago. Just checked | Google, and there are lots of hits for this. | airstrike wrote: | _> Let's try this again using R._ | | This should actually be "Let's try this again using dplyr, one of | the most elegant pieces of software ever written". | | Hadley Wickham is a treasure to humanity. | jarym wrote: | > "Here we typically expect that the combined dataset will have | the same number of rows as our original left side dataset." | | For left join this isn't entirely true. If there are more | matching cases on the right joined table then you'll get | additional rows for each match. That is unless you take steps to | ensure only at most one row is matched per row on the left (eg | using something like DISTINCT ON in Postgres) | wood_spirit wrote: | Yes I picked that up and was tempted to comment. But then half | way down it addresses this with the section "Many | relationships": | | > Until now we have discussed scenarios that are considered | one-to-one merges. In these cases, we only expect one | participant in a dataset to be joined to one instance of that | same participant in the other dataset. | | > However, there are scenarios where this will not be the | case... | jarym wrote: | Got it, in my opinion the structure is likely to mislead | people because it doesn't make clear that it refers to 'one- | to-one' relationships at the outset (in fact not even, it | deals with the cases of one-to-one and one-to-none) - it only | refers to 'left join'. | wood_spirit wrote: | Yeap, the whole article is confusing to us who are already | very familiar with eg SQL joins and things. It's not using | mainstream terminology. But I guess we are not the | audience. | petalmind wrote: | This is my pet peeve. Top google search results and ChatGPT | suggest this "same number of rows" mental model which is | incomplete and breaks. | | I wrote about this: | https://minimalmodeling.substack.com/p/many-explanations-of-... | zzzeek wrote: | when I saw the term "horizontal joins", I immediately went to, | what? what's a "vertical join?" must be a UNION, scrolled down | and sure enough. | | Is there some reason to use non-standard terminology for posts | that are trying to be in-depth, authoritative tutorials ? | genman wrote: | Good material. | | Joins can be also categorized by the used join algorithm. | | The simplest join is a nested loop join. In this case a larger | dataset is iterated and another small dataset is combined using a | simple key lookup. | | Then there is a merge join. In this case two larger datasets are | first sorted using merge sort and then aligned linearly. | | Then there is a hash join. For the hash join a hash table is | generated first based on the smaller dataset and the larger | dataset is iterated and the join is made by making a hash lookup | to the generated hash table. | | The difference between nested loop join and hash join might be | confusing. | | In case of a nested loop join the second table is not loaded from | the storage medium first, instead an index is used to lookup the | location of the records. This has O(log n) complexity for each | lookup. For hash join the table is loaded and hash table is | generated. In this case each lookup has O(1) complexity but | creation of hash table is expensive (it has O(n) complexity) and | is only worth the cost when the dataset is relatively large. | tmoertel wrote: | In distributed systems, there's even a _broadcast_ hash join, | in which the hash table is not distributed across shards and | assigned to workers but copied in full to every worker. That | way, the other side of the join need not be sharded and | shuffled to align join keys across workers. This strategy can | save a lot of time and network bandwidth when one side of the | join is small enough to fit into a worker 's RAM and the other | side is staggeringly massive (e.g., logs). It lets the massive | side be processed in place, as it's streamed from storage. | lakomen wrote: | Beginners talking about SQL joins. What is this newbienews or | hackernews? "But but ChatGPT says"... | | Disgusting ___________________________________________________________________ (page generated 2024-01-20 23:00 UTC)