One of the most critical components of the platform that you are going to have to build are the data ingestion routines. Get these wrong and you’ll find delivery speed hit hard and data quality called into question. This article is the first in a series of posts on Building an Analytics Platform; make sure to read up on the other four articles for a comprehensive guide to your data architecture.So what are the do’s and don’ts of data ingestion?
Before you get building, there’s one important question you’re going to have to answer. Should you build your own code from scratch, or buy an ETL tool?
There’s no right answer to this one, but there are a few things to consider when making this decision. Personally, I’m a big advocate of building your own thing (or turning to the open source community for the building blocks of it).
I’d break down the main things to consider into five categories:
There really is no right answer on this one. Spend some time thinking about it and if you have the time and money, it’s worth considering doing a POC in order to judge exactly how well the tool you’re considering is suited to you. Regardless of what you choose, the do’s and don’ts below still apply.
Frameworks provide a consistent set of methodologies for dealing with different development tasks. Their aim is to make delivery fast and standardised and to make all code easy to support.
While the work you have to do to ingest data from a database is very different from ingesting data from a stream, there are always consistent functions both require. Error handling, logging and auditing are all examples of functions you want to perform consistently across all ingestion routines (more on these later).
A framework shouldn’t constrict though but should instead form the backbone of tools you build. For example, every new file you have to ingest is likely to have something different about it that will mean you code something new, but being able to go to your source repository and pull down the code that handles logging and alerting will save you a load of time and produce a consistent output for those that need to consume it.
Something that will save you a great deal of time is ensuring that standard use cases are easily configurable.
If you’ve built an ingestion routine that can ingest CSV files, then it should be reading parameters from a config file that determines how it interprets the data. For example: What character is being used as a delimiter? Is the data quoted and if so with what? What’s the end of line character? And so on.
This will mean that without writing any new code, you can deploy new data ingestion routines quickly and easily for common use cases.
This doesn’t just apply to files but works particularly well with database extractions. Building a parent program that receives all the details it needs to extract a new table or database from a set of configs makes turning around new data ingestion routines straight forward.
Something that is often overlooked is error handling and logging. Many people will put the bare minimum possible in for these features and while the code will run fine most of the time, it becomes a nightmare to unpick when there is an issue.
If the framework methodology isn’t being followed, this gets compounded by the fact that different developers do it in different ways meaning it becomes very hard to support.
The best way I’ve seen this done is by having a parent executable consume the stdout and stderr feeds. This then scans the results, formats any errors or warnings in there and emails the relevant people with the details. It also stores the output into log files associated with each run so they can be reviewed at any time.
This way, all a developer has to do is pipe the relevant messages to the stdout and stderr feeds. When this is done really well, they will have accounted for the majority of the most common failures and warnings, assigned them error codes and documented the fix in a wiki for the support teams.
This makes a world of difference when it comes to the ongoing support of ingestion routines and minimises the amount of handover required each time. However you choose to achieve it, don’t underestimate the value of good error handling and logging.
Similar to error handling and logging is the auditing of data going into the system. There are all kinds of issues that can happen whether it’s as data is being produced in the source system, in transit, or while you load it that makes validation of that data a sensible idea.
In it’s simplest form, you should make sure you ask any source system supplying you with data (where applicable) to provide you with a set of records you can validate the data against. This could be a md5 checksum for a CSV file that ensures that there was no truncation in transit. Or it could be a record count, the number of distinct values in a certain column or the min and max values in a number field.
Once you have this data, you can get a lot more confident that you have received and then loaded the data accurately. When the business inevitably raises a question about data quality, this is a handy piece of evidence to have at your disposal!
When ingesting data, unless there’s a good reason to do so you should keep it like for like with the source system. I’ll be looking at this more in my next post.
The one thing your ingestion routine should definitely be doing though is enriching that raw data with a standard set of metadata. This includes things like the source of the data, the date it came in, the time it was written, the process id that wrote it and so on.
This is hugely valuable as your data grows and makes life a lot easier when it comes to troubleshooting issues and managing data lineage.
It’s natural to want to try and get the frameworks and ingestion routines right the first time. But in reality, you will go through a number of iterations before you have something that works for you. And that’s fine.
No matter how much you read in advance, there are going to be some things no one can tell you about your systems and data, or your processes and edge cases that mean there will be something wrong with what you’ve done.
This is where the frameworks are so effective. By breaking the tasks up into different reusable modules, it’s easy to keep iterating and improving because you don’t have to replace everything at once.
If you tried to build it all perfectly first time, you’d take so long your users will move on to something that actually delivers for them.
Now this one is perhaps slightly contentious. I fully advocate the continual ingestion of data. By that, I mean going after new sources of data, not the daily ingestion of the same datasets you already have.
If you’ve got good leadership in place, you should be able to pre-empt the business needs and get that data in ahead of time.
The big ‘but’ here is that there needs to be a purpose, or at least an expected purpose, for the data that you’re ingesting. Data for data’s sake alone is worthless, and you’re going to end up with what commonly gets referred to as a data swamp.
As much as anything, I can guarantee you won’t have enough time to do everything you want so focusing on and prioritising value driving data is key and will help you avoid wasting time on ingestion that isn’t worth it.
People often get confused by the varying data sources out there. Is it structured or unstructured data? Is it coming in real-time as a stream or a nightly batch?
I would wager that, for the vast majority of people reading this, the vast majority of their data is well structured and stored in a database. More than that, you can be fairly sure that this majority isn’t getting the most value (or indeed any) out of it that they could.
This isn’t to say the best use case doesn’t involve a load of unstructured free text, and that’s where you need to start, but don’t get misled by complicated terms implying increased value.
Assess what you want to do, what the use cases are and where you can get the data to answer those questions in the most effective way possible.
Finally, you need to make some decisions about the security and sensitivity of your data. Part of this is whether your data will be encrypted at rest and properly permissioned. That’s something that should have been considered as part of the setup of the platform.
A bigger issue when it comes to data ingestion is what you are allowed to load into the platform. Many analytic platforms will choose to not take on personally identifiable information (PII). More often than not, a lot of PII data has little bearing on any analytics you are doing. What does someone’s name have to do with their shopping behaviour for example?
Of course, things like age, gender, and location are valuable. Whatever you decide is the standard for your platform, you need to make sure that the data ingestion routines are complying with it. Whether that involves entirely removing or at least obfuscating sensitive data, this is a decision you should get in place early on, and continue to monitor for the lifetime of the platform. With GDPR now at the forefront of the industry, this is more relevant than ever.
This article is the first in a series of posts on Building an Analytics Platform by Cynozure CTO James Lupton; make sure to read up on the other four articles for more insight.