There are batch jobs that require much engineering and fine-tuning on serious hardware to make them feasible. However, many batch jobs run on oversized infrastructure and accumulate much more costs than necessary. Migrating these jobs to a serverless approach generates two advantages: a simplified workflow and massive savings in the long run. In this blog post, I will walk you through one exemplary architecture on AWS. It will showcase how batch jobs can run both reliably and cost-effectively in a serverless environment.
Note: this post provides an overview of the architecture and the basic principles. If you want to reproduce the architecture, take a look at the repository here which includes all the necessary code.
What are the requirements?
For the sake of our example, let’s consider three requirements:
- The interface for data in- and output of the service should be S3 buckets, i.e., general cloud storage.
- The service needs to run daily and should be as cheap as possible.
- The client’s developers want to focus on the processing logic, not the cloud infrastructure.
I will show you how these requirements translate to a serverless cloud architecture in a minute. But first, let’s talk about why serverless lends itself as a design approach.
Why choose a serverless approach?
There is a myriad of content about what serverless can and cannot do. I will, therefore, not spend time with generalized ruminations. Instead, let me justify serverless along the requirements defined above. One of the main reasons links to the development workflow; the other one is aboutcosts.
From a workflow perspective, a serverless approach provides a level of abstraction that developers highly appreciate. It frees them from managing and maintaining infrastructure. Since the cloud providers shoulder most of the usual responsibilities, developers can focus on implementing new features and adding business value. Such focus accelerates deployment and builds strong ownership.
When it comes to costs, serverless is – as a rule of thumb – cheaper than self-managed alternatives. Even when self-managed infrastructure seems to be more affordable, the wages for specialized teams typically outweigh the savings. Given the current state of the industry, it is tough to be more effective than the workforce behind the cloud services of AWS, Azure, or Google.
How to build the architecture for a serverless batch job?
Let me walk you through the proposed architecture in three steps that cover the organizational context, the overall architecture, and the specific building blocks and workflows.
Who will work with the implemented architecture?
Frequently, people develop cloud architectures without thinking about their users. In our case, we have to account for two roles to make it work. First, developers implement the processing logic. They should not be bothered with infrastructure details. Second, cloud engineers keep the system running. They should not be bothered with the details of the implemented logic.
In other words, we want to build a system with distinct realms of responsibilities and unambiguous touchpoints between them. To be clear, I do not advocate an organizational or cultural border between two camps. Instead, such a design aims at streamlining communication to decrease frustration and delays.
How is the overall architecture structured?
The big picture of this architecture contains three parts: networking, IAM roles and policies, and cloud services. The networking layer enables the services to communicate and protects against unauthorized access. The IAM roles and policies configure which actions services are allowed to take. The cloud services provide the infrastructure to integrate the business logic and execute the batch job. Due to the complexity in each part, this blog post focuses on the general workflow to keep things straightforward. As mentioned above, details will be available in related blog posts. Here is a graphical overview of the proposed architecture:
Blue indicates parts of the architecture that belong to development. In contrast, the orange box shows which building blocks run the batch job during operations. As you can see, both workflows share only one building block.
By conceptualizing the service in this way, developers and cloud engineers are largely decoupled and can focus on their respective responsibilities. Whenever developers push code to the master branch, the system builds a new Docker image. The next time the batch job runs, it will use the updated container image, i.e., run the updated business logic.
Developers need cloud engineers’ direct support in just two scenarios:
- An update to the business logic requires access to additional cloud resources. In this case, a cloud engineer adjusts the IAM roles and policies.
- An update necessitates different computational resources to run than the previous version. In this case, a cloud engineer needs to change the configuration of the hardware assigned to the batch job.
Both of these scenarios are easy to communicate and to account for by developers and cloud engineers.
Which role do the individual building blocks play?
The four development components
CodeCommit repositories are git repositories hosted on AWS with limited functionalities. Yet, they are a simple solution for keeping all the code within the AWS ecosystem. It is also possible to connect other versioning services, such as GitHub or Bitbucket, into the service via webhooks.
CodeBuild is the AWS flavor of a CI pipeline’s building part. It reads a specification file directly from the connected repository. CodeBuild then executes the instructions. Here, the CodeBuild project uses Docker to build a container image and pushes it to the registry for later use.
CodePipeline takes care of the orchestration and artifacts of CodeCommit and CodeBuild. In this case, it reacts to updates in the master branch of the code repository and triggers a new build.
The Elastics Container Registry (ECR) is the final building block of the developer workflow. The ECR is where CodeBuild pushes the new container images. The batch job collects it later on from here. The ECR is also where the realms of developers and engineers overlap.
The two batch job components
The computational heart of the service is the Elastic Container Service (ECS) in its Fargateflavor. Fargate is the serverless capacity provider of AWS. The user does not have to allocate VMs or other resources. Instead, a task definition includes specifications on how much computational power and memory are needed.
The second building block is a CloudWatch Trigger. Triggers can fire based on cron expressions or in intervals. Depending on whether the timing is essential, both can be viable options. When the trigger fires, a new instance of the task is sent to Fargate and executed.
These are already the main building blocks of this architecture. In production, some more services are necessary. For instance, there are S3 buckets for data storage, connections to CloudWatch logs for monitoring and debugging, and network configurations, such as VPC, subnets, and routing tables.
What are the alternatives?
Serverless is not a panacea; neither is this architecture. To leave you with an idea of when it is promising to follow this path, I want to sketch two alternative approaches.
The first alternative is to execute the workload on a dedicated virtual machine, i.e., an EC2 instance on AWS. There are two main ways to do this for the scenario described above.
First, one can keep an instance running and implement a Cron job (or something comparable) on it. I saw this pattern in a previous project, but it is a terrible choice for (at least) two reasons. The costs are way higher than the serverless approach because the VM costs money no matter whether its load is high or low. From the point of flexibility, changing the logic only works by terminating the process, updating the code, and restarting again.
Second, one could boot an instance for the batch job and terminate it afterward. That is similar to the approach above, but with one significant caveat: there are way more moving parts that can break during the process.
On the other side of the spectrum are serverless functions or Lambda functions in the AWS ecosystem. They are lean and very cheap. However, you need to consider two things before using them:
- Compared to container images, Lambda functions are less flexible. For instance, using third-party Python libraries can be painful since you have to provide a .zip file with the dependencies. Compare this hustle with a pip command as part of a Dockerfile.
- Lambda functions are limited in computational power and runtime duration. These limitations can become problematic when the amount of data or the complexity of the processing increases.
Thanks for reading!
If you want to know more about how to save cloud costs in general, have a look at our Cloud Cost Cleanup offer! If you need other help or wish to have more extensive discussions on the topics, my colleagues and I are happy to be there for you!