My first application to the Data School was made of four different databases. If you are working with multiple databases you will have to tell Tableau how these databases will interact with each other, inevitably, these made me stumble upon joins and unions too soon for my own good.
When should I use a join? What type of join should I use? When should I use a union instead?
I must confess I didn’t really ask myself these questions as I didn’t have a clue what was this about. In my case this involved an awful lot of trial and error but I hope that after you read this post you will be able to answer the previous questions and approach joins and unions with confidence.
When you work with tabular data, it is shaped in a column-row structure (probably the most common example is Excel tables). In terms of joins and unions, the key point of this post is that joins work with columns (horizontally) and unions work with rows (vertically).
As mentioned before, joins work with columns. This means that when we join two data sources, what we are doing is basically deciding how the columns of these databases will interact with each other. For example, we could find a common column of data between the sources and just extract that, or we could just add all the different columns into a bigger dataset. In Tableau, this decision is fourfold: Inner Join, Left Join, Right Join, Full Outer Join. Whilst here we will explain these four joins, it is also possible to have the inverse of left and right joins (keep just columns that are not shared by the databases).
The best way to visualise this is with a Venn Diagram:
For demonstration let’s consider the following tables as our different databases:
Let’s pretend I am studying population in Spain, UK and Germany, and I have the first database, but now I also want to get some density information to further my analysis. This is a case when joins would come to our help. But which join?
The left join is taking everything I have in my left table (the original population table) and connecting it to the points in common within my right table (density table). Because the three countries in the original database are also in the density database, my left join (with “Country” as joining point) will display what I want, which is the three countries with their population and density information. Because my population table does not include any other row that is not shared with the density table (always considering “country” as joining point) it will not show any null values. That said, if I were to have “Uganda” in my country list, this would show “nulls” for the density columns (as you will see visually in the case for right join).
Whilst the left join works well for this case, the inner join would be the most correct join to use. Because we are focusing only on Spain, UK and Germany, doing an inner join would allow us to keep the information relevant only for the three selected countries. That said, if we were to have “Uganda” in the population table, this row would not be taken into account when doing the inner join and we would not have any “null” values.
The right join is the opposite of the left join. Therefore, this join takes everything from the density table and tries to match it with the population table. Because in this case the population table does not have the same number of countries as the density table we see there are a big number of nulls for the cells in which there aren’t values.
As I stated in the introduction, joins connect tabular data sources horizontally. Therefore, the outer join is the easiest to explain as it literally connects both tables side by side. Because in this case there aren’t any countries from the population table that are not in the density table, the outer join has the same result as the right join.
However, let’s pretend for this case that we are comparing population tables between Europe and Africa. If I do an outer join between these two tables, you can easily see that the result is the equivalent of putting the tables side by side in the opposite directions. Because we are trying to join two tables that do not have a single column in common, we get a number of nulls that corresponds to the number of cells from both tables.
If joins work with columns, unions work with rows (or appending values if you prefer that jargon). This means that when we apply a union into two or more databases, they will connect vertically. Going back into the examples, if we want to compare the population of our three countries in the first table with the African countries from the second table, the join would be a perfect match. Because they share the same columns, no data would appear as null.
However, if we try to do a union between our population table with our density table we will see that it works like doing an outer join but vertically, resulting in nulls for all those cells with missing values.
That’s all about introducing joins and unions, for any comments you can find me in Twitter @DiegoTParker