Data validation is the process of making sure your data is the process of making sure your data is accurate and of high quality. Poor validation can lead to wrong data which can result in poor data driven decisions being made. This can cause stakeholders (amongst other people) to lose trust in analytics.
The key themes of good data quality are:
- Completeness: there are no gaps in the data. The issues which can cause this include:
- Pipeline doesn’t load properly cause the data to not be up to date.
- Incorrect data types of fields e.g. a date field being set as a string.
- Schema issues.
- Incorrect joins can cause some data to not come through especially when completing inner joins.
- Timeliness: the data is up to date. Common issues that cause this include:
- Job scheduling issues e.g. when the source table changes which causes issues with the stored procedure.
- The issues which cause data not being up to date are similar to those of completeness. Those issues can cause data the most recent data to not come through.
- Uniqueness: there is no duplicate data in other words there is one row per record. Common issues include:
- Duplicate data, which results in more than one row for a unit record.
- Errors caused by manual entry of data.
- When joins are completed on incorrect fields or non unique fields, it can cause the data to duplicate which causes data which is not unique.
- Issues in stored procedures where only a specific part of a table is being updated and conditions are not aligned can cause data to be non unique.
- Consistency: there should be no contradiction in the data between different systems or datasets. Common issues which cause this include:
- Human errors when manually entering the data.
- Some data being out of date can cause inconsistent data.
- Organisational changes not being updated in all data sources.
- Validity: provides a true representation of business context.
- This can be caused by the absence of testing during the pipeline. Some common tests include non null and unique tests.
- Reference tables not being updated regularly can also cause data to not be valid.
- Integrity: structured to align with related data sources.
- Missing foreign keys or non unique primary keys can cause data to have issues with integrity.
- Disruptions song the pipeline.
- Poor data governance means there will be no accountability for data quality resulting in the integrity of data being compromised.
Some solutions to these issues include adding in check data constraints, hard coding data types further upstream. We can also use time stamp audits and comparing data with historical tables to check the data quality.
