Cloud based database services such as Amazon Redshift, Microsoft SQL Azure and Google BigQuery are now commonplace. But handling the ETL (Extract, Transform, Load) processing required to perform analytics against their data has some challenges.
It is possible to run ETL routines in the Cloud and only incur cost for as long as you need them, but you will have to rethink your architecture and unlearn past architectural patterns.
Traditionally, companies invested in physical database servers. These were specified and sized to allow capacity for running ETL routines. Later, these physical servers became virtualised and then they were outsourced to data centres which provided failover and disaster recovery. Now databases have become services and exist in the cloud without the ability to use the hardware that provides the service.
All this poses a conundrum: where do you run your ETL routines now?
One solution would be to create a dedicated ETL virtual machine hosted in a data centre or by a cloud commute provider such as Amazon, Microsoft or Google. However, this suffers from the same problems and constraints as the legacy model of running these routines on a traditional database server. In fact these issues are exacerbated, as you now incur the cost of a permanent ETL server as well as that of the database service.
Because you are failing back on old hardware patterns, you are now paying for and maintaining a permanent virtual machine that has been provisioned and sized to cope with the largest ETL load. Most of the time this is sitting around idle, as typically ETL routines run on a regular cycle, so you are paying for a resource that you are underutilising.
A better approach is to rethink how you deploy, schedule and run ETL routines. These are often self-contained packages which perform a set task, such as moving and transforming data from one place to another. One job might be dependent upon another for its input data, but this a scheduling and orchestration issue rather than a physical connection between the two jobs.
That means you can package each ETL routine up separately. You can then spin up a cloud compute instance to undertake this (and only this) task which can be cleanly disposed of once its task is complete.
This way, you only pay for the computing power that you need and for the time that you need it. Smaller, lighter weight processes can be run on smaller, lower powered virtual machines, minimising cost.
Another advantage of this approach is that every ETL routine runs in a new, clean environment and is isolated from any other tasks. That means you will have no garbage accumulation issues, no interference from rouge tasks and no competition for local resources. It is altogether a much smoother and more reliable deployment environment.
The best way to achieve this is to deploy each ETL process as its own virtual machine using a tool such as Packer. This enables you to provision and size each individual machine according to its actual hardware requirements, to spin each one up individually as it is needed and only for so long as it takes to complete its work. The only overhead you require is a permanently available, but lightweight, scheduling and orchestration service or server.
Mark Fulgoni is a Principal Consultant in Red Olive’s Data Practice.