Speed up your CI/CD jobs in Kubernetes

No Comments

A performant and well integrated CI/CD environment is one of the key factors for fast and agile software development. To achieve short feedback cycles and increase development speed, jobs need to be as fast as possible and – ideally – should start instantly to keep the runtime of your pipeline as low as possible.
This blog post will explain how to speed up your Kubernetes-based CI/CD infrastructure.

CI/CD with GitLab and Kubernetes

We use GitLab as our code-management tool. GitLab ships with a fully integrated CI/CD solution that supports executing your jobs on a Kubernetes cluster with the Kubernetes executor. Using this executor on an auto-scaling Kubernetes cluster can be a great way to have a dynamic CI/CD environment. This setup is capable of automatically providing to your users what they need in terms of resources. At the same time, costs are only caused when resources are used since auto-scaling is enabled.
For each CI job that’s triggered via a GitLab pipeline, the runner creates a new pod in the cluster. Therefore, there is usually a highly varying workload on the cluster with peak times and low times, often depending on the time of day.
Adding auto-scaling to such a cluster setup can be achieved with the Cluster-Autoscaler. This tool scales your cluster to an absolute minimum in times with only a few or no build jobs at all and scales out to a bunch of nodes, if a lot of jobs need to be processed.

How does the Cluster-Autoscaler work?

The Cluster-Autoscaler adds new nodes to the cluster if there are pods in the “unschedulable” state. With the default scan-interval, a scale-up is triggered up to 10 seconds after a pod was marked unschedulable.

unschedulable
A pod is considered unschedulable if there is no node suitable to host the workload. This might be the case if, for example, all resources are exhausted.

It shuts down nodes if they are unneeded for at least 10 minutes. Nodes are unneeded if they are empty or the workload can be shifted to the remaining nodes. Please refer to the documentation to obtain more information on when pods are considered shiftable.

The problem with autoscaling in CI/CD environments

The runner will schedule a pod for every CI job from a GitLab Pipeline. If there are free capacities in the cluster, this pod will almost immediately start and run your code. But what if the cluster’s resources are already fully allocated?
Depending on your setup and the chosen cloud provider, a scale up may need some time – up to 5 minutes (k8s-related initialization included), even on famous cloud providers like AWS, GCP or Azure. Assuming the worst-case scenario – adding 5 minutes to nearly every job – no matter if the job needs 5s or 20m? That may lead to very unhappy users and a lot of inefficiency.

Use overprovisioning to reduce startup overhead

One way to solve the previously described problem is overprovisioning.
Overprovisioning means that the cluster always provides some more resources than actually needed. With overprovisioning in place, we could make sure that there are always some resources available, so that your CI jobs won’t have to wait for new capacities to become available.

Unfortunately, this is not a built-in feature of the cluster-autoscaler. To achieve cluster-size dependent overprovisioning, the team providing the cluster-autoscaler proposes a solution in their FAQs, using the Cluster-Proportional Autoscaler (short: CPA) and a placeholder deployment.

How does the proposed solution work?

To achieve overprovisioning, you’d only need the placeholder deployment. Proposed is a deployment based on the pause-image. The only purpose of these pods is to allocate the configured amount of resources.
To benefit from the additional, allocated resources, the pause pods need to be evicted immediately if a build job is scheduled. This can be achieved using the PriorityClass-resource in Kubernetes. By assigning a PriorityClass with a low priority to the placeholders and a PriorityClass with a higher priority to the rest, Kubernetes will remove the pause pods in favour of the CI job.
Because the placeholder is controlled by a deployment controller, the stopped pods will be rescheduled. If there are no resources left to allocate in the cluster, the pod becomes unschedulable and the cluster-autoscaler triggers a scale-up of the nodes.

To improve the very static approach above, the FAQs suggest to use the Cluster-Proportional Autoscaler. The CPA is a tool which is capable of scaling a target resource based on the actual cluster size. It constantly checks how many nodes are part of the cluster (alternatively checks for sum of CPU cores) and adapts the number of replicas for the target resource as configured. With this component in place, you can control the amount of placeholder pods based on cluster-size.

Examples

For example, you can configure the CPA in a way that it always scales the target to half as many replicas as there are CPU cores.
Alternatively, you can define a ladder function, like: scale to 2 replicas if the cluster-size is up to 5 nodes, and to 7 replicas if the cluster size is more than 5 nodes.

A Helm chart to rule them all

At the time of writing this blog post, there was no Helm chart that installs all necessary components in your cluster. There is a fairly new Helm chart for the CPA, which can be found here. To deploy the placeholder deployment and the Priority-Class setup, one could use this helm-chart by Delivery Hero.

But, to make the installation as smooth and integrated as possible, we decided to create yet another Helm chart, combining both of the components and adding the possibility to use different overprovisioning configurations using schedules.
You can find the new Helm chart called cluster-overprovisioner on Github.

Without much configuration, this Helm chart deploys the CPA and a placeholder, called overprovisioning (OP), deployment as the target including the PriorityClass setup for evicting the pause pods. The only thing that should be adapted to your needs is the defaultConfig and the op.resources block. Examples and explanations for the former can be found in the Readme or the examples folder in the repo.
The latter one needs to be adapted to your use-case. In our case, we decided that each pause pod should reserve capacity for an average CI job.

Example configuration with descending replicas

Currently, we use the following configuration:

ladder:
  {
    "nodesToReplicas":
    [
      [ 0,7 ],
      [ 8,4 ],
      [ 12,0 ]
    ]
  }

We have more overprovisioning for smaller cluster sizes and disable it completely if the cluster grows bigger than 12 nodes.
We assumed, based on the default runtime of our CI jobs, that the bigger the cluster is, the more likely it becomes for some of the pods to be about to be terminated and space to be freed up for new build jobs. Therefore, we use the ladder mode with descending replicas the bigger the cluster becomes.

Use schedules to keep your bill under control

Assuming most of your devs are working within the same or similar time zone, you most definitely can define time frames, in which you can waive the start-up boost given by overprovisioning in favour of reducing your compute cost. Therefore we introduced a scheduling feature into our chart. This feature is based on CronJobs. It enables you to provide different configurations for the CPA using cron expressions.

schedules:
  - name: night
    # disable overprovisioning Monday - Friday from 6pm
    cronTimeExpression: "0 18 * * 1-5"
    config:
      ladder:
        {
          "nodesToReplicas":
          [
            [ 0,0 ]
          ]
        }
  - name: day
    # enable overprovisioning Monday - Friday from 7am
    cronTimeExpression: "0 7 * * 1-5"
    config:
      ladder:
        {
          "nodesToReplicas":
          [
            [ 0,7 ],
            [ 8,4 ],
            [ 12,0 ]
          ]
        }

For example, we have these schedules installed in our CI cluster. The night schedule completely disables overprovisioning after 6pm and on weekends. We do have scheduled jobs that run at night or even on weekends. For these the longer startup-time does not matter, as no one is waiting for the jobs to complete.
Another schedule, called day, increases the amount of overprovisioning from 7am on Monday to Friday to the desired amount.

As you can imagine, adding overprovisioning to your cluster increases your total costs. Instead of providing the same amount of overprovisioning 24/7, we strongly recommend making use of the schedules. This way, you achieve the best balance between low startup times and additional costs.

Conclusion

CI/CD infrastructures on Kubernetes benefit very much from adding autoscaling to the cluster. It reduces compute costs to an absolute minimum in times without build jobs and is capable of handling the busiest times. Implementing overprovisioning in this setup reduces the startup-times of your jobs. To minimize the additional costs added by the overprovisioning, we introduced a scheduling feature, with which you can enable overprovisioning only in times when it’s needed and achieve a good balance between performance and costs.

Frederik Grieshaber started as a Software Developer at codecentric AG before switching to its subsidiary, cc cloud GmbH. This included a shift of his focus towards designing, building and maintaining cloud systems. Since then, he specializes on working with Kubernetes and Gitlab CI/CD.

Thilo Wobker is working as Site Reliability Engineer at cc cloud. His focus is on building infrastructure on whatever it may be, but preferably something related to Kubernetes and with security in mind.

Comment

Your email address will not be published. Required fields are marked *