How can we help a fictitious startup kickstart its software development process? Using Terraform and AWS services, we’ll build an IT infrastructure that is ready within minutes and ticks quite a few boxes on the technical DevOps capabilities list.
Just pour hot water (and run a few Terraform commands)!
Short on time? Here’s an ultra-short abstract:
- Having an imaginary startup that wants to get into an iterative development loop as quickly as possible, using an Infrastructure as Code solution and cloud services can greatly shorten our ramp up time while remaining flexible for infrastructure growth.
- Going for a “minimum viable CD” approach, we can focus on implementing an initial set of technical DevOps capabilities that enable the team to get into the development/learning loop and take it from there.
- While we can address quite a few technical DevOps capabilities with the provided Terraform project, we should keep in mind that accelerating software delivery requires a multidimensional set of competencies. The cultural mindset of our organization and our team structures need to support the speed we aim for and provide a healthy and supportive environment for humans to create artifacts of value. If you’re interested in digging deeper into this, I published a blog post about this a while ago.
Want to try it out yourself? All code/configuration used in the upcoming examples is available on GitHub. All you need to get started is an AWS account.
A fictitious startup with a new product idea
Meet our imaginary startup customer, MyMountains. MyMountains belongs to a larger corporation, FarAway Fabrics, a sporting equipment and sustainable clothing manufacturer for outdoorsy people. FarAway Fabrics ran a few experiments to find out if customers would be willing to use a service that provides information on nearby mountains and their peaks. The experiments went well and a closed customer group provided positive feedback to the first MVP of the app. FarAway Fabrics decided that it is now time to move this early product to a new brand to allow it to grow into a sustainable source of income for the corporation. The business goals are set as follows:
- Have a usable product on the market before the competition
- Increase the customer base and evolve the product accordingly
To support quick learning, the team responsible for developing the product should be able to move quickly and be able to perform changes with a minimum lead time.
Software delivery goals
The MyMountains development team is eager to embark on the product development journey, and their product owner would like to have a look at the first potentially shippable product as soon as possible. That’s why we set our tech goals as follows:
- Get into the loop as quickly as possible: With little to no IT infrastructure, enable MyMountains to start product development as quickly as possible. How about in 30 minutes?
- Move fast inside the loop: Build an IT infrastructure that enables MyMountains to learn in short cycles and helps them keep the software delivery lead time low. The solution should be flexible enough to grow and change as more is learned about the market and its customers.
- Reduce complexity and distractions: Build the infrastructure so that it keeps developer cognitive load in check, so people can focus on the product and feel enabled to make changes if needed.
With these goals set, let’s decide on measures to fulfill these.
Getting into the loop
Assuming that all that MyMountains has at this point is a Kanban board, a Product Owner, and a number of motivated developers all fired up to get going, an idea to enable the team as quickly as possible would be to:
- Provision our infrastructure in the cloud: This allows us to pick the components we need off the shelf, with no wait time and provides the elasticity to grow dynamically, depending on how our product evolves. We’ll pick AWS, as there’s already some knowledge present in the team.
- Use an Infrastructure as Code solution: This allows us to automate infrastructure provisioning and creates a reproducible and extensible infrastructure blueprint. We’ll pick Terraform because it integrates well with AWS.
Moving fast inside the loop: Putting together our DevOps menu
As one of our goals is defined as keeping their software delivery lead time low, we’ll focus on making technical DevOps capabilities available to the MyMountains development team. As a starting point, we’ll define a minimum viable CD setup by picking the capabilities that have the most immediate value for the development team and use DORA’s technical DevOps capability list to pick capabilities we’d like in our solution:
- Capabilities marked green will be included in the initial solution. They have an immediate impact in the early development stage of our product.
- Capabilities marked grey won’t be included in the initial solution. These will become more important in later stages of product development as our product/team/company grows.
From the capabilities we picked, the provided solution will deliver us the backing services to check our code into version control, build and test our code, automatically deploy it to the target environment and perform continuous integration and delivery. As we’re deploying our entire architecture in the AWS cloud, we can tick off monitoring and observability as we’ll be able to analyse our application through CloudWatch.
The remaining capabilities marked relevant for this sample implementation (Trunk-Based Development, Loosely Coupled Architecture) won’t be included in the technical solution, but we’d encourage the team to consider accounting for these when building and extending their product. While I don’t have a strong opinion for or against Trunk-Based development, AWS encourages its implementation and made it the preferred mode of development when using CodePipeline (automated builds from the main branch are straightforward to configure, while building/deploying from feature branches are a lot more difficult to set up – AWS explained the reasons for this opinionated approach in their blog).
So much for the technical DevOps capabilities. I demonstrated in a previous blog post that the remaining three capability dimensions (measurement, process, cultural) aren’t any less important, however focusing on these is out of scope for this article. The non-technical capability dimensions have less of a software focus and are thus harder to implement and measure. In a real-world scenario, we would staff our team accordingly to be able to tackle these dimensions as well.
Complexity is everywhere, but where should we accept it, and where can we actively seek to reduce it? In a in a previous blog post, I proposed an answer to this by defining the role of cloud service providers. Dividing the technical aspects of our product into a lower service layer and an upper business-facing layer, we can eliminate a great deal of complexity by relying on off-the-shelf “X as a Service” products at a cloud service provider of our choice.
In addition to that, we’ll limit ourselves to a single cloud provider (AWS), a single AWS account, using Terraform for the entire infrastructure, and using AWS managed services wherever possible. This should enable the MyMountains development team to focus on the product instead of managing and configuring infrastructure.
I’ll admit, all of these components and services have a complexity of their own and need to be mastered like every other bit of technology. The idea here is that, by using ready-to-use services, the administrative burden shifts from the development team to the cloud service provider. Using Terraform, we can further reduce complexity for developers by eliminating the need to manually set up and configure all the services.
With the capabilities that will have to be part of our solution mapped out, let’s sketch a basic architecture to get an idea how our dev team’s infrastructure could look like.
The dev team decided on building a Spring Boot REST service to provide the first bit of functionality. It will be used as the first building block of the product’s application backend and we’ll build our architecture around it.
We’ll split the infrastructure into 2 parts:
- Build infrastructure: Contains everything that is needed to host the source code of the application as well as the infrastructure for building, testing, packaging and deploying the application. This part of the infrastructure is only needed once and will be shared for all developers and environments.
- Compute infrastructure: Contains all the resources needed to execute the application. This part of the infrastructure exists once per environment. As continuous testing and continuous integration are something we need to design the infrastructure for, we’ll provision this infrastructure once per environment. For the sake of simplicity, we’ll start with 2 environments, DEV and PROD. Deployments to PROD will occur using a blue/green strategy.
With the basic building blocks mapped out, our architecture would look something like this:
From a software developer’s perspective, here’s how the architecture would feel like:
- After pushing code to the Git Repository (provided by AWS CodeCommit), the code will automatically be built and tested using AWS CodePipeline and CodeBuild.
- The resulting container image will be pushed to Amazon’s ECR Container Registry.
All changes to the main repository branch will automatically be deployed in the DEV compute environment, which is an AWS ECS Fargate Service.
- Upon manual approval, the change is also deployed to the production environment using a blue/green strategy (provided by CodeDeploy).
- After successful deployment, the backend services are accessible via their load balancer’s url.
Building the infrastructure
With the architecture mapped out, we use Terraform’s HCL (Hashicorp Configuration Language) to express the architectural blueprint as code. We end up with a few modules that represent the respective part of the architecture.
You’ll find all the code and infrastructure definitions on GitHub.
As Terraform uses a declarative approach to create and maintain infrastructure, it needs to store its current view of the infrastructure (state) somewhere. Without further configuration, that would be the current directory from which it is executed.
As we want to ensure that we have our infrastructure provisioned once for all developers and users of the product, it would be a good idea to store its state in a centralized location, and ensure that no concurrent modifications occur. This can be ensured by implementing remote state (btw: There’s another nice codecentric Blog post on this topic). As we want to keep all our resources in AWS, we’ll use S3 and DynamoDB for this.
S3 will be our storage for Terraform’s state file, and DynamoDB will provide us with a centralized resource to set locks. The S3 bucket and DynamoDB table are set up in a separate module (Terraform definition) and are referenced in the main infrastructure definition file.
While there are various solutions and providers for hosting Git repositories and placing build and deployment pipelines on top of those (GitLab, GitHub Actions, Jenkins, etc.), I decided on using AWS services for our CI/CD infrastructure. Using the Terraform AWS provider, we define the following components:
- CodeCommit Repository (Terraform definition), which will host our Git Repo. For automated builds to work, we’ll also create the respective IAM roles and policies needed to be able to invoke CodePipeline whenever the Repository state changes. Also, we’ll make sure to output the repository’s URL so that it’ll be easy to clone.
- CodePipeline definition (Terraform definition), required for invoking builds and deploying our application into the desired environments. This contains the stage definitions, which define what needs to be done when, and which artifacts result from the respective steps.
- CodeBuild project definition (Terraform definition), which is linked to the CodeCommit repository as well as the container registry and contains information on which build servers to use for performing the actions defined in the CodePipeline definition. The build specification itself isn’t part of the project but will be included in the source code repository of the service we’re about to build. By default, CodeBuild expects a
buildspec.yaml file in the source code repository it is linked to. This mechanism is very similar to placing a
gitlab-ci.yaml or a Jenkinsfile in your source code repo.
First of all, in order to retain flexibility and to be able to introduce different deployment configurations, we’ll provision 2 separate compute environments (DEV & PROD). However, we want to keep the environments as identical as possible, so we’re using the same AWS services for the compute environment.
When running containers, many companies decide to use Kubernetes, and AWS provides its own “Kubernetes as a Service”, so we wouldn’t have to set up a cluster ourselves. While Kubernetes is widely adopted and comes with a lot of flexibility, it is also criticized by some as being overly complex and hard to grasp. For those who seek an alternative, AWS offers its own opinionated container service, ECS. It integrates well with other AWS services and comes with less configuration options. Having worked only with Kubernetes in the past, ECS seems worth a try, so we’ll give that a go for this solution.
Another decision to make is whether to use EC2 instances or rely on AWS Fargate to run the containers. We’ll stick with Fargate here as this reduces the administrative overhead for us while allowing us to remain flexible – with Fargate, we don’t have to worry about having the right amount of EC2 instances up and running to be able to keep up with the user load while keeping the cost in check.
The following components need to be set up:
- ECS cluster (Terraform definition): Required for running and scaling our backend service.
- ECS task (Terraform definition): The ECS task definition contains information on what container image to run, how many compute resources to allocate for it, and how to log information emitted by the container.
- ECS service (Terraform definition): Similar to a Kubernetes Service, this resource contains information on how to scale the task as well as the Load Balancer to link to.
For the DEV environment, the setup is pretty straightforward as we’ll simply keep the most recent image running. For the PROD environment, we want to perform blue/green deployment, so there’s a little more work to do here. Primarily, we need to add a bit of CodeDeploy configuration as well as another load balancer, so that we’re able to route traffic between the blue and green deployments.
In order for the compute environment to be accessible from the public internet, we’ll add the required network infrastructure. That means we add a VPC, a private and public subnet in one (DEV) or several (PROD) availability zones, an internet and NAT gateway, route table, and one (DEV) or two (PROD) application load balancers, along with their target groups and listeners. PROD requires two load balancers in order to allow for blue/green deployment.
Having all required infrastructure components specified in HCL, we’re now ready to have it provisioned by Terraform.
Provisioning the infrastructure
As the entire infrastructure has been specified in Hashicorp Configuration Language (HCL), all we need to do now is set up our AWS CLI and let Terraform do its magic. This could either be further automated by invoking Terraform as part of a build pipeline, or executed from any PC with a CLI. For the sake of simplicity, we’ll opt for the latter approach and carry out the following preparation steps:
- Create an AWS account
- Install git, the AWS CLI, and Terraform on our machine
- Configure the AWS CLI with your account details, configure the Git CLI
Once we’re set up, we’ll initialize the Terraform remote backend by navigating into the remote-state directory, and applying the configuration:
With the remote backend being set up, we can provision the infrastructure the same way by initialising and applying the terraform config after navigating into the infrastructure directory.
Once Terraform is done provisioning the infrastructure (which can take a few minutes), it’ll output a few variables for us:
- The URL for cloning the created Git repository
- The ARNs of the execution roles required to run our application in the ECS cluster
- The load balancer’s addresses for accessing the DEV/PROD deployments from the public internet
We should note these for later – our developers will need these.
Working with the Infrastructure – Developer perspective
The infrastructure is now ready for our team to get started. One of the developers prepared a minimum viable Spring Boot service that provides the first bit of functionality, which will serve as our deployment artifact for testing automated builds and deployments. We start by cloning the CodeCommit repository and copying our Spring Boot code over.
Adding a build specification
In order to build our service skeleton remotely, we need to add the required build specification (
buildspec.yml) to its source code repository. It is checked in alongside the service’s code and picked up by AWS’s CodePipeline service:
Adding deployment and task specifications
For deployment to the DEV stage, we can rely on the ECS default, which automatically deploys the latest available container version. For the PROD stage we’d like to perform blue/green deployment. For this to work, we already prepared the infrastructure by creating a CodeDeploy configuration. In order for CodeDeploy to perform the deployment, it needs to find the ECS application and task information in the container image.
For this to work, we add another 2 files to the application’s source code directory: We start with the application specification (
appspec.yml), which gives information about the target ECS service to deploy the container to:
Additionally, we add the the ECS production task definition (
Both files contain a number of placeholders. Values in angle brackets will be replaced by CodeDeploy during deployment. The ARN of the execution role is something we’ll have to supply ourselves. We’ll replace this value with the variable the Terraform stack supplied after creating the infrastructure (
With all required files added and modified according to our environment, it’s time to commit and push our repo for the first time. This should automatically trigger a build and deployment to the DEV environment.
Observing and testing the first service deployment
After pushing our walking service skeleton to CodeCommit, we can use the respective AWS service consoles to see what’s going on. We’ll start with CodePipeline to observe the build process:
After the DEV deployment completed, we should be able to verify that the service is up and running by checking the Task statuses in the ECS console (for this service, we configured 3 parallel Task instances):
Additionally, we can confirm that the service is reachable by accessing it via the DEV load balancer’s address and verify that it returns a few mountains from the Munich area:
Once the deployment to the DEV environment is done, we have to manually approve deployment to production in the CodePipeline console. A few minutes after we do that, we can verify that the service is up and running by directing cURL against the production environment’s load balancer. This should return an identical list of mountains. If something went wrong, we could use CloudWatch to have a look at the service’s logs (all logged output will go to the respective dev/prod log group that we created with the infrastructure earlier).
Bringing changes to production
With the infrastructure set up and our walking service skeleton running, we can start extending it to evolve our product. Let’s walk through the process of bringing a change to production.
Adding a mountain
Right now the service returns only mountains from around Munich and we’d like to add a few more impressive peaks. We start with the famous K2 peak in Pakistan. Once the code change is pushed/merged to the main branch, the build pipeline is triggered once again, replacing the current deployment in the DEV environment. We’ve tested the change and think it’s ready for deployment to production.
We hit approve, and blue/green deployment to production begins:
Switching to the CodeDeploy Console, we can see that due to our configured blue/green deployment, another ECS task set was deployed. Currently, both versions of our service are running alongside each other. The new version currently receives 10% of all traffic:
If we repeatedly query our service via cURL, we should be able to verify that roughly 1 in 10 responses contains the peak we just added. After a predefined wait time defined by the selected canary deployment strategy (in our case, 5 minutes), the traffic shift is carried out in full and all subsequent requests are routed to the new version.
At this point in the deployment, the old version will still remain up and running. Should we realize that anything is wrong, we could still roll back to the previous deployment. After another 5 minutes, the old deployment is eventually taken out of service and torn down. Once this final step has completed, our deployment has succeeded.
From now on, all responses against the production load balancer will return the newly added mountain:
We now have everything in place to grow our service and infrastructure and sustain a software delivery process that supports quick learning. With the solution now fully built and tested, let’s recap.
Lessons learned: Have we accelerated MyMountains?
With our goals being defined as enabling a team to enter and walk through an iterative learning and development cycle as quickly as possible while keeping complexity in check, the proposed Terraform project could be a solid first building block. It is ready to grow and change as more is learned about the market and its customers. It simplifies and standardizes infrastructure setup, ticks off quite a few boxes on the technical DevOps capabilities list, and thus allows the team to shorten its delivery lead time. Working in small batches and deploying small, incremental changes from development to production looks quite doable with this setup.
Looking at the DORA metrics once again (lead time for changes, time to restore services, deployment frequency, and change failure rate), it would be fair to say that the environment we built would allow us for a decent score in all of the mentioned sections.
Acceleration is more than DORA metrics
As I pointed out in an earlier article series, the act of using software to build a product or solve a problem is more than just a technological challenge, and its success depends greatly on how we fare in the machine room, but also how well we do culturally as an organization. While I spend the majority of my working day optimizing things like “the time it takes to get from commit to production” I’d like to stress that we must not forget about everything that happens before we commit our code, or before the first line of code is even typed.
We should think about how information flows through an organization, ask ourselves if people feel safe and encouraged to raise problems, and seek to reduce team cognitive load. Just like designing a system by defining its architecture, team structures need careful design, too. We need to be aware that the humans building the system need a team setup that puts both the product and the humans building it into its focus, creating an environment that helps them reduce waste and information loss.
There are great resources on mastering the DevOps skills to build software more effectively. Let’s also learn to become better at building effective team structures where people feel appreciated and empowered, and let’s broaden our view to see DevOps as a whole. Let’s understand better what acceleration means for our business, and what we can do to deliver value faster. If we can learn to see the whole picture, we might not only end up accelerating our business, but also become a place where people wholeheartedly like to work and collaborate. If we achieve that, then we win a lot more than market share or revenue.
I hope you enjoyed reading this article and found it useful. Thanks for reading!