Infinity Works proposed first carrying out initial analysis of the pipelines, finding and removing redundant or obsolete pipelines to reduce the size of the problem space. This would reduce the effort required to introduce a better scheduling system to provide retries, dependency management, monitoring and alerting. With monitoring in place, it was then possible to examine the performance of each pipeline and target performance optimization work on the most valuable pipelines.
To improve the scheduling and management of pipelines, Infinity Works proposed the use of Airflow. Airflow is an open source data orchestration tool that can be used for pipeline scheduling, dependency management, and execution of tasks. Airflow was selected, because it is already in use at Photobox, and only required a small amount of customization to meet the requirements.
Airflow has built-in capabilities that make use of the Postgres relational database to store its internal state. Infinity Works felt that this was a good fit for this project, using AWS RDS (Relational Database Service) to host the database engine in order to reduce management overhead (e.g. configuring encryption, backups, clustering etc.).
Figure 1: Airflow UI showing a selection of scheduled data pipeline tasks (pipeline names hidden).
Docker containers are used to build and deploy the Airflow setup. To build these containers, AWS build tooling (Code Pipeline, Code Build and Code Deploy toolchain) were used. The AWS build tooling is well integrated with AWS systems and uses Identity and Access Management (IAM) roles to ensure that pipelines meet the principle of least privilege – i.e. have the minimum amount of permissions required to do their job.
To run the Airflow cluster, AWS Elastic Container Service (ECS) using both Elastic Cloud Compute (EC2) and Fargate were selected as a compute platform.
ECS was selected because it is a flexible platform that allows for scaling up and down of dynamic workloads, allowing Photobox to only pay for what they used.
Fargate is a Serverless container platform that allows a Docker container to be executed on demand. Fargate reduces the amount of time required to manage compute infrastructure, because the underlying operating system is managed by AWS.
For Photobox and Infinity Works, the benefits of Infrastructure as Code – audit, reuse, change control, ability to create test environments and to recover from disasters – mean that developing using Infrastructure as Code techniques is a default position. To not use Infrastructure as Code techniques would be unusual. In this case, Photobox were already using CloudFormation for their projects, and Infinity Works have a lot of experience using the tool.
To monitor the platform and pipelines, Infinity Works proposed the use of CloudWatch. CloudWatch support is built-in to ECS, making the integration simple and cost-effective. Airflow publishes StatsD metrics, and Infinity Works proposed the use of a sidecar container – a container specifically designed to monitor the Airflow container – hosted CloudWatch Agent to gather metrics from the running Airflow cluster and feed these back to CloudWatch.
Dashboards were created in CloudWatch to provide a visual representation of system health.
CloudWatch alarms notify operators about resource usage on Fargate containers. This allows operators to resize container instances appropriately to the CPU and memory requirements of each data pipeline.
Infinity Works helped Photobox further improve their DevOps processes within the Data Engineering space by implementing best practices in Infrastructure as Code and CI/CD by the use of Infrastructure as Code to manage the Airflow cluster.
Infinity Works produced a report containing recommendations to the wider Data Engineering team on where the principle of least privilege, Role Based Access Control (RBAC), and AWS account separation could be used to improve the security of the solution without impacting the solution.
To improve visibility of the health of operational systems, Infinity Works implemented structured logging using CloudWatch logs. This enabled the use of CloudWatch Insights to query and analyse logs data in real-time. Processing and pipeline alerts were connected to Slack channels in order to prompt engineering teams to take action in the case of delays or failures of data pipelines.
With legacy infrastructure, it is unfortunately not always possible to fully utilise infrastructure-as-code and CI/CD to tear down and spin up infrastructure on demand. Some older instances running in EC2 were highly customised and could not easily be rebuilt from a template. For these situations AWS EC2 Lifecycle Manager was used to protect the instances, providing backups, rollback capabilities and the ability to easily recover from disasters, whilst causing minimal change to the instances themselves, thus reducing risk of introducing unforeseen issues in unknown code.
Some data is loaded in bulk (e.g. initial ingests or backfills) using AWS Batch to perform transformations before writing output to S3 for later ingestion. In order to make this modular, extensible, and easy to maintain, Infinity Works implemented a CI/CD pipeline to produce new Docker images for use with AWS Batch.
Figure 2: Extract of architecture for bulk loading data