Speed up your CI/CD jobs in Kubernetes

2.9.2021 | 7 minutes of reading time

A performant and well integrated CI/CD environment is one of the key factors for fast and agile software development. To achieve short feedback cycles and increase development speed, jobs need to be as fast as possible and – ideally – should start instantly to keep the runtime of your pipeline as low as possible.
This blog post will explain how to speed up your Kubernetes-based CI/CD infrastructure.

CI/CD with GitLab and Kubernetes

We use GitLab as our code-management tool. GitLab ships with a fully integrated CI/CD solution that supports executing your jobs on a Kubernetes cluster with the Kubernetes executor. Using this executor on an auto-scaling Kubernetes cluster can be a great way to have a dynamic CI/CD environment. This setup is capable of automatically providing to your users what they need in terms of resources. At the same time, costs are only caused when resources are used since auto-scaling is enabled.
For each CI job that’s triggered via a GitLab pipeline, the runner creates a new pod in the cluster. Therefore, there is usually a highly varying workload on the cluster with peak times and low times, often depending on the time of day.
Adding auto-scaling to such a cluster setup can be achieved with the Cluster-Autoscaler . This tool scales your cluster to an absolute minimum in times with only a few or no build jobs at all and scales out to a bunch of nodes, if a lot of jobs need to be processed.

How does the Cluster-Autoscaler work?
The Cluster-Autoscaler adds new nodes to the cluster if there are pods in the “unschedulable” state. With the default scan-interval, a scale-up is triggered up to 10 seconds after a pod was marked unschedulable. unschedulable A pod is considered unschedulable if there is no node suitable to host the workload. This might be the case if, for example, all resources are exhausted. It shuts down nodes if they are unneeded for at least 10 minutes. Nodes are unneeded if they are empty or the workload can be shifted to the remaining nodes. Please refer to the documentation to obtain more information on when pods are considered shiftable.

How does the Cluster-Autoscaler work?

The Cluster-Autoscaler adds new nodes to the cluster if there are pods in the “unschedulable” state. With the default scan-interval, a scale-up is triggered up to 10 seconds after a pod was marked unschedulable.

unschedulable: A pod is considered unschedulable if there is no node suitable to host the workload. This might be the case if, for example, all resources are exhausted.

It shuts down nodes if they are unneeded for at least 10 minutes. Nodes are unneeded if they are empty or the workload can be shifted to the remaining nodes. Please refer to the documentation to obtain more information on when pods are considered shiftable.

The problem with autoscaling in CI/CD environments

The runner will schedule a pod for every CI job from a GitLab Pipeline. If there are free capacities in the cluster, this pod will almost immediately start and run your code. But what if the cluster’s resources are already fully allocated?
Depending on your setup and the chosen cloud provider, a scale up may need some time – up to 5 minutes (k8s-related initialization included), even on famous cloud providers like AWS, GCP or Azure. Assuming the worst-case scenario – adding 5 minutes to nearly every job – no matter if the job needs 5s or 20m? That may lead to very unhappy users and a lot of inefficiency.

Use overprovisioning to reduce startup overhead

One way to solve the previously described problem is overprovisioning.
Overprovisioning means that the cluster always provides some more resources than actually needed. With overprovisioning in place, we could make sure that there are always some resources available, so that your CI jobs won’t have to wait for new capacities to become available.

Unfortunately, this is not a built-in feature of the cluster-autoscaler. To achieve cluster-size dependent overprovisioning, the team providing the cluster-autoscaler proposes a solution in their FAQs , using the Cluster-Proportional Autoscaler (short: CPA) and a placeholder deployment.

How does the proposed solution work?

To achieve overprovisioning, you’d only need the placeholder deployment. Proposed is a deployment based on the pause-image. The only purpose of these pods is to allocate the configured amount of resources.
To benefit from the additional, allocated resources, the pause pods need to be evicted immediately if a build job is scheduled. This can be achieved using the PriorityClass-resource in Kubernetes. By assigning a PriorityClass with a low priority to the placeholders and a PriorityClass with a higher priority to the rest, Kubernetes will remove the pause pods in favour of the CI job.
Because the placeholder is controlled by a deployment controller, the stopped pods will be rescheduled. If there are no resources left to allocate in the cluster, the pod becomes unschedulable and the cluster-autoscaler triggers a scale-up of the nodes.

To improve the very static approach above, the FAQs suggest to use the Cluster-Proportional Autoscaler. The CPA is a tool which is capable of scaling a target resource based on the actual cluster size. It constantly checks how many nodes are part of the cluster (alternatively checks for sum of CPU cores) and adapts the number of replicas for the target resource as configured. With this component in place, you can control the amount of placeholder pods based on cluster-size.

Examples
For example, you can configure the CPA in a way that it always scales the target to half as many replicas as there are CPU cores. Alternatively, you can define a ladder function, like: scale to 2 replicas if the cluster-size is up to 5 nodes, and to 7 replicas if the cluster size is more than 5 nodes.

A Helm chart to rule them all

At the time of writing this blog post, there was no Helm chart that installs all necessary components in your cluster. There is a fairly new Helm chart for the CPA, which can be found here . To deploy the placeholder deployment and the Priority-Class setup, one could use this helm-chart by Delivery Hero.

But, to make the installation as smooth and integrated as possible, we decided to create yet another Helm chart, combining both of the components and adding the possibility to use different overprovisioning configurations using schedules.
You can find the new Helm chart called cluster-overprovisioner on Github.

Without much configuration, this Helm chart deploys the CPA and a placeholder, called overprovisioning (OP), deployment as the target including the PriorityClass setup for evicting the pause pods. The only thing that should be adapted to your needs is the defaultConfig and the op.resources block. Examples and explanations for the former can be found in the Readme or the examples folder in the repo.
The latter one needs to be adapted to your use-case. In our case, we decided that each pause pod should reserve capacity for an average CI job.

Example configuration with descending replicas
Currently, we use the following configuration: `1ladder: 2 { 3 "nodesToReplicas": 4 [ 5 [ 0,7 ], 6 [ 8,4 ], 7 [ 12,0 ] 8 ] 9 }` We have more overprovisioning for smaller cluster sizes and disable it completely if the cluster grows bigger than 12 nodes. We assumed, based on the default runtime of our CI jobs, that the bigger the cluster is, the more likely it becomes for some of the pods to be about to be terminated and space to be freed up for new build jobs. Therefore, we use the ladder mode with descending replicas the bigger the cluster becomes.

Example configuration with descending replicas

Currently, we use the following configuration:

1ladder:
2  {
3    "nodesToReplicas":
4    [
5      [ 0,7 ],
6      [ 8,4 ],
7      [ 12,0 ]
8    ]
9  }

We have more overprovisioning for smaller cluster sizes and disable it completely if the cluster grows bigger than 12 nodes.
We assumed, based on the default runtime of our CI jobs, that the bigger the cluster is, the more likely it becomes for some of the pods to be about to be terminated and space to be freed up for new build jobs. Therefore, we use the ladder mode with descending replicas the bigger the cluster becomes.

Use schedules to keep your bill under control

Assuming most of your devs are working within the same or similar time zone, you most definitely can define time frames, in which you can waive the start-up boost given by overprovisioning in favour of reducing your compute cost. Therefore we introduced a scheduling feature into our chart. This feature is based on CronJobs. It enables you to provide different configurations for the CPA using cron expressions.

1schedules:
2  - name: night
3    # disable overprovisioning Monday - Friday from 6pm
4    cronTimeExpression: "0 18 * * 1-5"
5    config:
6      ladder:
7        {
8          "nodesToReplicas":
9          [
10            [ 0,0 ]
11          ]
12        }
13  - name: day
14    # enable overprovisioning Monday - Friday from 7am
15    cronTimeExpression: "0 7 * * 1-5"
16    config:
17      ladder:
18        {
19          "nodesToReplicas":
20          [
21            [ 0,7 ],
22            [ 8,4 ],
23            [ 12,0 ]
24          ]
25        }

For example, we have these schedules installed in our CI cluster. The night schedule completely disables overprovisioning after 6pm and on weekends. We do have scheduled jobs that run at night or even on weekends. For these the longer startup-time does not matter, as no one is waiting for the jobs to complete.
Another schedule, called day, increases the amount of overprovisioning from 7am on Monday to Friday to the desired amount.

As you can imagine, adding overprovisioning to your cluster increases your total costs. Instead of providing the same amount of overprovisioning 24/7, we strongly recommend making use of the schedules. This way, you achieve the best balance between low startup times and additional costs.

Conclusion

CI/CD infrastructures on Kubernetes benefit very much from adding autoscaling to the cluster. It reduces compute costs to an absolute minimum in times without build jobs and is capable of handling the busiest times. Implementing overprovisioning in this setup reduces the startup-times of your jobs. To minimize the additional costs added by the overprovisioning, we introduced a scheduling feature, with which you can enable overprovisioning only in times when it’s needed and achieve a good balance between performance and costs.

Was this post helpful?

Likes

Blog authors

Frederik Grieshaber

Cloud Consultant & Service Lead GitLab Professional Services

Do you still have questions? Just send me a message.

Thilo Wobker

Do you still have questions? Just send me a message.

fromFrederik Grieshaber

Time to Renovate

How to keep your IT infrastructure up to date and reduce manual effort to a minimum by using Kubernetes, Helm, GitOps (FluxCD), Continuous Integration (GitLab-CI) and Renovate. When we moved into our house, everything was new and shiny. Well – it was...

DevOps
Infrastructure as Code

19.12.2022 | 8 Minuten Lesezeit

Daniel Marks

Frederik Grieshaber

Secure your Kubernetes workloads with OPA Gatekeeper

Last month, Kubernetes 1.25 was released. And with that, the long-announced removal of PodSecurityPolicies (short: PSPs) finally becomes reality. Finally? Yes – as Tabitha Sable from the Kubernetes SIG Security Team said herself in the linked blog post...

IT-Security
Kubernetes
Infrastructure

15.12.2022 | 8 Minuten Lesezeit

Frederik Grieshaber

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Public Cloud im regulierten Sektor: Das ist zu beachten

Es war längere Zeit ein weit verbreitetes und in strategischen Debatten häufig zitiertes Missverständnis, dass die Bundesanstalt für Finanzdienstleistungsaufsicht (BaFin) dem Einsatz von Public-Cloud-Anbietern wie AWS, Azure und Co. einen Riegel vorschiebt...

Cloud
Compliance

10.4.2024 | 6 Minuten Lesezeit

Marc Bialowons

Björn Bohn

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

AZ-900-Zertifizierung: Mein How-to!

Was ist AZ-900? Azure bietet eine Reihe verschiedener Zertifizierungen an. Zu finden sind sie hier. Darunter befindet sich auch die Zertifizierung AZ-900. Bei diesem Zertifikat handelt es sich um Microsoft Certified: Azure Fundamentals. Diese prüft unter...

Azure
Cloud

2.1.2024 | 5 Minuten Lesezeit

Ege Inanc

Mit FinOps die größten Kostenfallen bei AWS S3 verhindern

In der Welt der Cloud-Technologie und insbesondere bei AWS (Amazon Web Services) ist die effiziente Verwaltung von Ressourcen von entscheidender Bedeutung, um unnötige Kosten zu vermeiden. Dieser Blogbeitrag konzentriert sich auf AWS S3 und die teuren...

AWS
Cloud

27.11.2023 | 4 Minuten Lesezeit

Lukas Miliunas

Maximilian Mayer

Cloud FinOps

Cloud FinOps bietet einen etablierten Prozess, um Kosten für den Cloudbetrieb zu reduzieren (s. auch diesen Artikel). Zu diesem Zweck bietet es ein etabliertes Cloud-unabhängiges Vorgehen, das eine Organisation schrittweise aufgreifen kann. Das Tooling...

Cloud
Cloud Native
Green IT

26.10.2023 | 5 Minuten Lesezeit

Lukas Miliunas

Marco Paga

Mehr Struktur in der Cloud mit Azure Landing Zones

Die Migration in die Cloud bringt einige Herausforderungen mit sich. Viele Unternehmen stehen vor der Frage, wie ein effizienter und sicherer Aufbau einer skalierbaren Cloud-Infrastruktur umzusetzen ist. Die Antwort auf diese Herausforderung liegt in...

Cloud
Azure
IT-Governance

4.8.2023 | 4 Minuten Lesezeit

Florian Moll

Nils Bauroth

CI/CD-Pipelines mit AWS CDK CodePipeline

Das Aufsetzen der CI/CD-Pipeline ist ein typischer Task in der Anfangszeit eines Projekts. Ist die Pipeline dann aufgesetzt, sind Änderungen nur noch selten notwendig. Dementsprechend wenig Routine entwickeln Programmierende im Umgang mit der Konfiguration...

Cloud
CI/CD
AWS

17.7.2023 | 4 Minuten Lesezeit

Dennis

Green Cloud: Nachhaltig skalieren

Wenn Softwareprojekte in die Cloud gebracht werden, versprechen wir uns davon hohe Verfügbarkeit, planbare Kosten und eine immer dem Bedarf entsprechende Skalierung. Aufgrund der grenzenlosen Angebote ist es aber auch leicht, die Komponenten eines Systems...

Cloud
Softwarearchitektur
Green IT

12.6.2023 | 5 Minuten Lesezeit

Dennis

Crossplane: Eine Lösung für hybride Cloud-Herausforderungen?

Crossplane ist ein plattformübergreifendes Kontrollsystem (Control-Plane), das das Management von Cloud-Ressourcen vereinfachen und automatisieren soll. Das Tool ermöglicht es, verschiedene Cloud-Provider und lokale Ressourcen, z. B. Kubernetes-Cluster...

Cloud
Cloud Native

12.5.2023 | 2 Minuten Lesezeit

Matthias Niehoff

Green Cloud: Ideen für eine nachhaltigere Architektur

Die ökologische Nachhaltigkeit eines Systems ist aktuell häufig noch kein Thema. Nachhaltigkeit bedeutet für mich in diesem Kontext die Reduktion der verursachten Emissionen durch gesenkten Ressourcenverbrauch – egal ob die Emissionen beim Cloudprovider...

Cloud
Softwarearchitektur
Green IT

5.5.2023 | 5 Minuten Lesezeit

Dennis

Automatische Dependency-Updates mit Renovate

Bei der Softwareentwicklung ist es sinnvoll, bereits bestehende Funktionen wiederzuverwenden. Das spart Zeit und es wird unwahrscheinlicher, auf Probleme zu stoßen, die andere bereits gelöst haben. Funktionen können aus diesem Grund in Libraries gebündelt...

Softwareentwicklung
CI/CD

17.4.2023 | 6 Minuten Lesezeit

Alexander Backes

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

Platform Engineering mit BackstageIm folgenden Interview berichten Marc Schnitzius und Pascal Sochacki von ihren ersten Erfahrungen mit Backstage als Platform-Engineering-Lösung.Marco Paga: Marc, Pascal, ihr habt eine Sicht auf Platform Engineering, ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

2.3.2023 | 12 Minuten Lesezeit

Marco Paga

Maximilian Mayer

„Platform Engineering ist eine Art von Knowledge Sharing“

Warum „Platform Engineering“ eigentlich der falsche Begriff ist und wie man den Golden Path findet, erklärt Daniel Kocot, Senior Solution Architect, im folgenden Interview.Marco Paga: Warum ist Platform Engineering interessant?Daniel Kocot: Ich habe ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

20.2.2023 | 11 Minuten Lesezeit

Daniel Kocot

Marco Paga

Ist die Cloud der große Umweltsünder?

Rechenleistung und Speicher kosten nicht nur Geld. Sie verbrauchen auch Mengen – potenziell klimaschädlicher – Energie. Das überrascht die Wenigsten, im kollektiven Bewusstsein ist es aber bislang kaum angekommen. Sehr wohl bewusst ist es natürlich den...

Cloud

18.1.2023 | 2 Minuten Lesezeit

Matthias Niehoff

AWS Cloud Development Kit – Infrastructure as Code on Steroids

Infrastructure as Code (IaC) ist inzwischen ein alter Hut. Frameworks wie Terraform, Ansible und andere haben Standards geschaffen. Kaum jemand provisioniert produktive Systeme heute ohne IaC – sei es in der Cloud oder auf der eigenen Infrastruktur.Und...

Infrastructure as Code
AWS
Cloud

21.12.2022 | 3 Minuten Lesezeit

Matthias Niehoff

Infrastructure as Code in AWS: Keine Silver Bullet

TL;DR Es gibt keine Universalmethode. Infrastructure as Code ist ein vergleichsweise neuer Ansatz. Einige Lösungen rund um Infrastructure as Code befinden sich noch in der Entwicklung. Es gibt keinen klaren Favoriten. Die Wahl des passenden Tools hängt...

Cloud
AWS
Infrastructure as Code

13.12.2022 | 27 Minuten Lesezeit

Florian Wiech

Sören

Open Policy Agent – Maschinen, die auf Regeln starren

Der Open Policy Agent (OPA) ist eine universell einsetzbare, quelloffene Policy Engine, also eine Sammlung von Komponenten, die eine einheitliche und effiziente Umsetzung von Regeln aller Art erlaubt. Dieser Artikel zeigt ein kleines Praxisbeispiel. ...

CI/CD
Softwarearchitektur
IT-Security

19.10.2022 | 5 Minuten Lesezeit

Marco Paga

AWS CloudFront Functions testen

Mit den CloudFront Functions bietet AWS die Möglichkeit, den Funktionsumfang von CloudFront um kleine JavaScript-Funktionen zu erweitern. AWS führt diese Funktionen direkt an den Edge-Locations aus und ermöglicht es dadurch, alle ankommenden Requests...

Cloud
AWS
Testing
Softwareentwicklung

4.10.2022 | 3 Minuten Lesezeit

Dennis

Platform Engineering – Eine Einordnung

Aktuell kocht mit Platform Engineering gerade ein Thema hoch, das in den Weiten des World Wide Web für viele Reaktionen sorgt. Gerade auch Kunden aus dem Enterprise-Umfeld führt es zu interessanten Nebeneffekten, wenn aus DevOps-Teams plötzlich Platform...

Accelerate
CI/CD
DevOps

12.9.2022 | 4 Minuten Lesezeit

Daniel Kocot

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Speed up your CI/CD jobs in Kubernetes

CI/CD with GitLab and Kubernetes

The problem with autoscaling in CI/CD environments

Use overprovisioning to reduce startup overhead

How does the proposed solution work?

A Helm chart to rule them all

Use schedules to keep your bill under control

Conclusion

Was this post helpful?

Ja

Blog authors

Get in contact

Get in contact

Contact Frederik

Contact Thilo

More articles

Time to Renovate

Secure your Kubernetes workloads with OPA Gatekeeper

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Public Cloud im regulierten Sektor: Das ist zu beachten

Green Cloud: Daten und Emissionen sparen

AZ-900-Zertifizierung: Mein How-to!

Mit FinOps die größten Kostenfallen bei AWS S3 verhindern

Cloud FinOps

Mehr Struktur in der Cloud mit Azure Landing Zones

CI/CD-Pipelines mit AWS CDK CodePipeline

Green Cloud: Nachhaltig skalieren

Crossplane: Eine Lösung für hybride Cloud-Herausforderungen?

Green Cloud: Ideen für eine nachhaltigere Architektur

Automatische Dependency-Updates mit Renovate

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

„Platform Engineering ist eine Art von Knowledge Sharing“

Ist die Cloud der große Umweltsünder?

AWS Cloud Development Kit – Infrastructure as Code on Steroids

Infrastructure as Code in AWS: Keine Silver Bullet

Open Policy Agent – Maschinen, die auf Regeln starren

AWS CloudFront Functions testen

Platform Engineering – Eine Einordnung

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten