In the first week of The Data School we have learned how we can reshape data during the data preparation stage. Five methods for this are listed below.
- Filtering
This is to ensure we are only looking at the most relevant and accurate data. Usually done by filtering out one or multiple values from a data field, which leads to the whole records being removed.
- Pivoting
Pivoting involves transforming rows to columns or columns to rows.
Columns to rows is useful when we want to update data sets with future records. Rows to columns converts each metric to its own columns.
- Aggregation
Aggregating data reduces the granularity. Granularity is the the unique combination of categorical fields that define each record. Removing at least one of these fields will reduce the granularity.
- Unions
Stacking data sets on top of each other to increase the number of records is forming a union. To do this we require matching or similar data fields.
- Joins
If we bring two data sets together to increase the number of data fields, this is forming a join. To do this we must have join condition - a matching data field in each data set to tie them together.
We can create inner or outer joins. Inner joins only keep records that are matched in both tables, and outer join keeps all records from both tables, whether matched or not.
