Up until now, I’ve been looking at the softer side of building the platform, including whom we need to build it, the processes they would use and how to select your technology. This article is the first in a series of posts on Building an Analytics Platform; make sure to read up on the other four articles for a comprehensive guide to your data architecture.
I’m now going to start looking at the actual development work in building the platform. Below, I’ve laid out the primary components of the platform. From there, I will do a quick run-through of each area, explaining how each one fits into the bigger picture.
This is as much about how to manage ingestion as it is the type of sources you are using. We need to have consistent, well-managed approaches to ingestion that can be rapidly deployed, simply customised and easily supported.
Raw, base and analytics are terms you will hear again and again in the following posts. Raw is the layer we store our raw, untransformed data in. Base is where we begin to pull together a sensible model and apply some of the cleanup necessary to make it usable. Finally, analytics is where we provide the most user-friendly views of data. All of these layers live in the factory, the centralised, supported data store for the platform
Labs are areas that provide users with the ability to upload their own files to combine with data already in the factory and create their own tables as stepping stones in larger processes. Use cases can be trialled here before being moved over to the factory.
One consistent feature of a successful analytics platform is the demand to get data back out of it. There are two primary methods, batch and API, that can enable this. Much like ingestion, these need to be managed consistently for all implementations.
Storing data is great, but unless people can get at it there isn’t much value. Enabling a range of tools all the way from standard reporting tools to raw JDBC access for custom projects provides an array of mechanisms for people of all skill levels to get value out of the platform.
Surrounding all of this, the platform needs to be governed and administered, which is where the management tools come in. An essential part of any platform, these can’t be ignored despite the fact they often aren’t the most interesting part of the toolset.
Getting your data ingestion routines right can be crucial, as it forms the building blocks for everything else. As new projects and teams line up to get their data onto your platform, the faster you can churn through the requests, the happier your customers are going to be.
We’re going to look at how you build robust, repeatable ingestion routines that can handle a range of sources quickly and efficiently. They need to be able to handle data auditing and error handling, be simply configurable, but also extendible. In essence, we want to build ourselves a library of modules that can be combined into unique patterns, but with common components.
Whether or not you choose a DIY route or pick an off the shelf ETL tool, many of the same considerations still exist, and we’ll take a look at them all further on.
The “factory” is a banner term for a number of components and is the area that is going to be supported by our data team. It’s where things go to live when they are ready to become production-ready functions.
From a data storage perspective, I split the factory up into three logical layers which all have their own specific purpose.
The raw layer is all about storing all the data that’s been loaded in a near unadulterated form (where possible), tagged with the appropriate metadata. This is essentially a data lake. However, where some people see a data lake as an end state, here it’s a stepping stone to building something far more valuable.
End users will be more than welcome to come to the raw layer, whether it’s to access data that didn’t get pulled through to a later layer, or perhaps do some validation that base and analytics didn’t error during a transformation (and generate the wrong result).
The base layer is where we begin to clean up the data and make it presentable to users. In my next posts, I’ll talk through the modelling approach I’ve taken to base. There isn’t really a consensus yet on how to approach modelling in a Hadoop based system or what good looks like, but there are a few different competing methodologies starting to emerge.
For me, the most important thing here is to not stray too far from the source data. It’s the opportunity to organise the data into sensible business models, apply certain business rules, exclude certain records, clean up data types and column names, and derive a few useful fields.
The analytics layer is all about provisioning data to users in the most easily usable and fastest performing way possible. This takes one of two forms. The first can be thought of as the big, wide flattened tables that pre-join all the useful information about a topic. The second is the very specific tables that are structured to answer a more complicated question.
There are plenty of things to consider when building your analytics layer, but I’ll show how working closely with your key data customers to define something that everyone gets value from is the way forward.
Lab databases are simple. You provide an area for your customers to create their own tables or upload their own files. See, told you it was simple.
The reason this is so vital to the success of the platform is that it will give your users something that they often don’t get on a data platform – a place to call their own. I always advocate that these are largely self-managed areas and tend to designate one per team rather than one giant one or, at the other extreme, one per person.
This is crucial for killing off the need for lots of odd siloed databases across the business and the proliferation of shadow IT systems. By bringing people all into one place but giving them some freedom you can help get a consistent set of standards and practices in place without stifling innovation.
The lab also serves a vital function as the proving ground for use cases. If an analyst has built something useful there that’s getting a lot of use, you know it’s a good candidate to move into the factory so that it is run regularly with a proper support model. Likewise, before you invest in a project building something you can have its value tested out by your customers first.
While one of the goals of our analytics platform is to bring all data together into one place, once we’ve done something awesome and valuable with it there’s a good chance we’re going to need to share it with some other systems.
This could range from sending batch CSV files to your marketing system so it can target customers based on the results of your segmentation model to near real-time interaction with a website personalisation engine.
I’ll look at how we establish data provisioning routines with the same kind of wrap as I’ve outlined for the ingestion part of the business. This is all about system-to-system provisioning, rather than system-to-user.
There is a huge proliferation of different tools to work with data now. From standard BI reporting to modern visualisation tools, or SQL IDE’s to data science workbenches, there’s a lot to choose from.
I’ll look at what kind of functionality you need available to you (rather than at the specific vendors and products out there) in order to cover the most pressing needs of your data customers.
I’ll also highlight some of the main watch-outs. Depending on the technology you’re using you might need to be very careful about how well integration works. Not everything does what it says on the tin!
All platforms, regardless of the technology choice you make, need to be administered and secured. Likewise, managing data quality, governance and lineage are all important as well.
If you found this article valuable, you might be interested in our next Data Platform masterclass. This London-based session is led by James Lupton, coaching leaders in business, data and tech on how to build a data platform.