I’ve just recently come off working with a client in the big data space, where we were looking to verify over 9 billions records of data. This was a big undertaking so we had to narrow down the areas of the data that we looked at for verification.
It’s worth saying that I wouldn’t usually recommend verification of all 9 billion records due to the time and complexity involved and I’ll go in to different techniques to ensure you do not have to do this in a future blog. However this client was very concerned about their data and every individual record meant something very specific so they were very insistent on this.
Structure
The first are we concentrated on is the structure of the data. Our data was in JSON-L format, where each line in a file is a fully formed JSON record. However whatever structure or format your data is in, this will always be a big area to look at.
The main way we did this was by verifying the JSON records against schemas. With JSON format this is easier to do so we would verify the record on the way in to the system and on the way out of the system. By doing this, we knew that every single record that we ingested in to the system had the correct structure and that we did not alter or break that structure before the record left the system.
For instance, we may have wanted to verify the following JSON:
{
"record":{
"people":[
{
"first_name":"jeff",
"surname":"peters"
},
{
"first_name":"laura",
"middle_name":"jane",
"surname":"smith"
}
],
"action":"CREATE"
}
}
In order to do this, we would create a JSON schema similar to the below:
{
"type": "object",
"properties": {
"record": {
"type": "object",
"properties": {
"people": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"first_name": {
"type": "string"
},
"middle_name": {
"type": "string"
},
"surname": {
"type": "string"
}
},
"required": [
"first_name",
"surname"
]
},
{
"type": "object",
"properties": {
"first_name": {
"type": "string"
},
"surname": {
"type": "string"
}
},
"required": [
"first_name",
"surname"
]
}
]
},
"action": {
"type": "string"
}
},
"required": [
"people",
"action"
]
}
},
"required": [
"record"
]
}
These are just example schemas and your situation and data may differ quite a lot to this JSON example. However structure is the most fundamental thing you can assure with your data. You can also assure data structure very early on in the process which is great as the earlier on you can assure it, then you can build up the confidence levels over time. We had excellent automation at all levels of our components to assure they would cope with the data structures they would be presented with. However nothing can prepare the system for the unexpected nature of real, live data which is where this verification was so useful!
Another thing to bear in mind with data structure is what to do if your data does not meet the required structure. In our case, we were streaming in data so we could send the data to a DLQ if it failed and examine the issue when appropriate.
Content
This is the hardest area to validate in the data because you require context about the data itself. Here we are talking about the actual values in the fields in the data. At this point we are confident on the structure of the data, but not necessarily the values within that structure. There are two parts to validating the content.
1. Basic content validation
Firstly, you can validate the contents of specific fields. This can be very useful and following on our example from above, we chose to put some of this within the schema itself. For instance, given the JSON above you can see the following field:
"type":"CREATE"
In our schema, we could do something similar to the below, which would mean we know when the record gets schema validated that the content of this field must be one of the allowed values.
"action":{
"type":"string",
"enum":[
"CREATE",
"EDIT",
"DELETE"
]
},
There are many different types of validation that can be performed in this way, including:
- Length
- Special characters
- Any form of regex check on the value
- Case checks
This is not an exhaustive list by any means! However this kind of basic data validation will only cover individual fields, not the relationships between the fields and the records.
2. Data meaning validation
To really examine the context, you need to understand what your incoming data means and how the data records all relate to each other. In a lot of data systems, they may be designed to be dumb and not really understand the data they ingest, meaning this data validation is not possible. However, in our system we had good communications to the team who owned the source data and so we had a good understanding of the data itself and how records related to each other.
To give an example of what you can do here, let’s take the example we used earlier and consider the action
field again, where you can see there are only three valid options:
"enum":[
"CREATE",
"EDIT",
"DELETE"
]
There is a logical order for these records to come in to the system:
CREATE
EDIT
(0 to many)DELETE
Knowing how these records are created is important, because it might be that records can not have a CREATE
(for instance for transient or incomplete records) for example, but let’s assume that the above is the only order that makes sense in this case.
The way we resolved this way to perform incoming validation checks, whereby we would check when a record came in what type it was. If it was a CREATE
then we would check that there is no record with that id already. If it was an EDIT
we would check that there was one record and it was a CREATE
and if it was a DELETE
we would simply check if there was a record. This allowed us to validate all the scenarios on every single record.
It’s worth noting that these checks were time consuming on the ingestion and so were actually turned off after a period of time in live running when we had built up enough confidence and ingested enough records that the client was happy with this. It’s another area where I would not recommend performing these kinds of checks on every record but in this case it was very explicitly what the client wanted.
Volume
For big data especially volume testing should form a very important aspect of your data testing. My client wanted to ensure that every single record that was ingested was also exported from the system, which is a big ask when we are dealing with billions of records.
Due to time and cost, it’s only not possible (or certainly not cost effective) to check the entire contents of every record. However we didn’t need to. This is because we are building on top of the structure and content tests that we have already done, meaning we are already confident in the data within the actual records, but we do need to ensure that all the records have been ingested, processed and exported.
To do this, we used import and export manifests. This means as we processed the ingested records both in to and out of the system, we were creating lightweight CSV files as a log of all the records that were successfully ingested. Each line in the CSVs would contain a |
delimited line (we used a |
because some of the record’s had commas in) that had the following information about a record:
- ID -> the id field of the record JSON
- Timestamp -> the timestamp within the record (rather than processing time)
- Type -> Imported record or exported record
This meant the resulting files were very small and allowed us much more manageable amounts of data to compare.
When you have the two sets of manifest files, there are many ways to compare them and it will depend on your context. We were working in an AWS system so it made sense to use an AWS Glue job in the form of a pyspark script to load all the files and compare the data inside. We then produced an output based on the following things:
- Are there any IDs that are in the export but not in the import?
- Are there any IDs in the export and the import with different timestamps?
We also used this to retrieve statistics about the number of ids we imported and a few other small things but the two things about were the ones that told us about the quality of the volume of data.
Given the answer to the above two questions is no then we are now at a stage where we are very confident that every single record we have been asked to ingest is ingested correctly and is available to export from the system too.
Conclusion
I’ve looked at structure
, content
and volume
here as the main areas to focus on with data validation and this has served me well in previous data projects. The key is that each area of validation builds on the ones before it and they all work together to give a level of confidence in the data. This is in a similar way to how you might create unit tests for an application to give a certain level of confidence and then build on these, rather than repeat them, with component tests and so on up the layers. None of the layers themselves give us the confidence we need for our validation, but when combined we have that high level of confidence we are looking for.
Appreciated this post Steve. Nice.