Joins vs Unions

When I first started the data school (way back in April) I would refer to the “coming together” of any 2 data sources as a “join”. It wasn’t long before I was taught how inaccurate this was and, indeed, the very first page of my data school notebook has the following diagram:

This was my very first “ohhhh” moment at the data school. For anyone that can’t read my handwriting (i.e. most people), let me explain in words what these two diagrams are trying to say.

UNION – the data sources have the same columns and will be stacked on top of one another, creating a longer table

JOIN – the data sources have one or more columns in common that you can combine together, creating a wider table

The difference in terminology between this and my mathematical background used to confuse me. For instance, the “union-ing” of 2 sets in maths (i.e. A∪B) is also represented by the Full Outer Join diagram above. I blame this for how long it took me to remember the difference between the join and the union tool in Alteryx!

Cross Database Joins

CROSS DATABASE JOIN – Joins data sources from different locations/databases e.g. in the below example, Google Sheets is joining with an Excel File

These behave the same as joins, but there are limitations e.g. data sources stored on Tableau server cannot be joined with other data sources in Tableau Desktop.

So When Should I Blend?

BLEND – an aggregated left join

Let’s say you have two data sources that have one or more common columns, but the level of aggregation is different.

For example, if we think about the Orders table in Sample Superstore, each row represents a Product ID. Conversely, each row of the Returns table represents an Order ID – a higher level of aggregation. If we were interested in linking these returns to a Customer ID, then a join would create multiple rows for each Order ID, ballooning out our dataset. So blending would work better here.

Another time that blending can be useful is when the Cross database joins are not possible. In the Tableau server example, blending resolves this problem without having to download the data and store it locally.