Recently I found myself in a situation where a customer (big in the music festival business) requested a cloud solution supporting the continuous reporting of administrative business workflows. They required an architecture which demands high availability and up and down scalability on a daily basis with factors of 500% due to the seasonal nature of the customer’s business. Serverless Lambdas initially proved to be a suitable technique for the requirements at hand. We used AWS step functions to lower VPC costs.
In this article I will take you on a tour from our initial naive Lambda setup to an improvement and explain what went wrong. Ultimately we decided on using AWS step functions to perform the coordination of the Lambda workflow, which added a benefit of being able to cross VPC boundaries. This enabled optimisation of our VPC component usage. Read about our journey below!
Initial Serverless VPC architecture
Our solution processes files being put on S3 by an external system. A Lambda function listens to S3 events, which takes a file, transforms the content, and loads it into an RDS database, so Quicksight can report on it. The AWS architecture is designed as follows.
After a few weeks, our solution fell short. Time-outs caused S3 reads to be unsuccessful and thus hanging in an infinite loop. Also for our happy flow, the entire read, transform, and load flow was unacceptably slow.
Improved Lambda data processing architecture
To speed up the processing, we knew we had to cut down the steps into parallel executions. RDS was under little stress, so there was room to shorten ingestion time. So after breaking the execution up it would look conceptually like the following. Where A, B, and C are different operations which each can be further parallelised as well.
One of the biggest challenges in splitting into parallel execution is that it requires a lot more orchestration/coordination effort. At first we thought about using SQS to orchestrate this work, but this requires significant rework in keeping some state (like in DynamoDB) to implement some state/progress awareness. Furthermore the added boilerplate was not appealing. AWS step functions was the second to consider which supports exactly this. The more complicated set-up introduced an additional requirement of putting a status notifications back to an origin service to notify about a successful data transformation. However, because the complete set-up was running inside a VPC, this would mean adding an egress-only internet gateway. This leads to pay-by-the-hour expenses, which in my view mismatch with the Serverless pricing philosophy. Having RDS is a necessity for our Quicksight reporting, but with our seasonal load, we stick to a pay-for-use model as much as possible.
Luckily, step-functions have no boundaries as to where Lambdas are located. So you are able to place Lambdas in VPC (even private subnets with zero internet access) close to secure resources, and invoke non-VPC Lambda functions with zero fixed cost internet usage at other steps!!!
Our new setup looks like the following architecture.
Reducing VPC costs with AWS step functions
We separated one huge Lambda invocation which operated at the maximum call length (15 mins) into parallel processing. Lambas are paid by the 100 ms so naturally a split of one to five separate Lambdas costs potentially almost an additional 400 ms per invocation. However, each workload can now be downsized to exactly the right resource utilisation in terms of memory and time. Every smaller run is also a tad more reliable in terms of duration (smaller variation) and your memory is quite consistent between runs, which makes for easier tuning. Our biggest payoff was that we could lose the NAT gateway. Which alone accommodates for 500 million Lambda requests of processing (100 ms, 512 mb).
In practice, some resources put restraints on the location of Lambdas, forcing them to run in private subnets. Security-wise I am happy that this is possible, but cost-wise it comes with a lot of added expenses. As I have shown, you can avoid VPC costs by using step functions. Other event chaining sources like S3, SQS, Kinesis share this quality. However, AWS step functions are the only thing that actually helps you with orchestration, making this a tool of my choice.
AWS step functions: https://aws.amazon.com/step-functions/