A beginner’s guide to combining multiple data sources

When it comes to prepping data for analysis, one of the most important tools to have in your wheelhouse is being able to combine different data sources. 

Union

The first method of combining two data sources is a union. This is when you want to stack two similar tables one on top of the other. This can only be done if both tables share the same columns and data types. Examples of when this could be useful include data sets which measure the same things, but are captured at different times or for different areas. 

Joins

If you think about unions as a vertical way to combine two tables, joins can be thought of as horizontal. Here, instead of stacking one table on top of another, you are combining rows based on a common field (or fields). 

When performing a join there are two things you must consider:

  • The join condition/clause: the field(s) that you are matching on (ID columns are usually useful for this).
  • The join type.

What do I mean by join type?

Depending on what you want the output of the join to look like, you have some different options. Let’s take a closer look:

Inner join

Usually the default option. This will return rows only where the join condition is met. 

Since you are only bringing back rows which satisfy the join clause, you would not expect this to produce any null values.

Left join and right join

These are essentially the same, it’s just a matter of perspective. To make this a bit clearer, let’s consider two tables, table A and table B: 

Scenario 1: Left join table B onto table A. 

This would return all the rows in table A, but would only return rows in B where the join condition has been met. 

Scenario 2: Right join table B onto A.

This would return all of the rows in B and only the matching ones in A. 

In either case, any rows which don’t meet the join condition, but are included in the output data set, will have null values for the fields from their ‘non-native’ table.

Full Join

Finally we have our outer joins. Here, all records will be returned, regardless of meeting the join condition or not. It’s worth noting that the use cases for this option are much more limited than the others. 

Any rows which are brought back and don’t meet the join clause will contain null values in their non-native columns.

Hopefully this has been a useful overview of your options when combining different data sources! 

Author:
Joseph Hughes
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2025 The Information Lab