Remote training with GitLab-CI and DVC

27.1.2020 | 15 minutes of reading time

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or a computing instance in the cloud. In this article, we show how you can build a custom remote training set up for your machine learning models. We aim for automation and team collaboration.

From a technology point of view we use an EC2 instance on AWS for the training of the model. The automation is implemented via a GitLab-CI pipeline that we trigger with special commit messages. Furthermore, we use DVC to achieve reproducibility of the model training and, on the other hand, for versioning data and model. Furthermore, we use an S3 Bucket on AWS as remote storage for DVC. However, the setup does specifically require AWS and can be adapted, e.g. to on-premises hardware. For an introduction to DVC we refer to here .

If you are interested in our work, you likely have read the very popular article CD4ML . In our blog post, we cover one particular topic of that article very deeply, namely how to conduct the actual training, with special focus on technical aspects as well as team work. We do not discuss the setup of data pipelines, how to deploy the application, or monitoring.

DVC remote training: The high-level idea

The following picture gives a high-level view of the general idea of the setup.

First, we build a Docker image which we use to train the machine learning model. The Docker image provides all runtime dependencies for the training, e.g. libraries or command line programs. However, it does not contain the training data; that data is stored at the training location and must be mounted into the container when executing the training.

The image is pulled to the prefered training location. The choice of the training location is very flexible, the container could be running on your laptop, on a powerful machine in your basement, or anywhere else in the cloud. We use DVC to manage the training data. In particular, we use DVC’s functionality to permanently store and version data at the training location. This way we can avoid transferring the entire training data to the training location for each training. Instead, we utilize DVC to perform incremental updates. After the training, the training results as well as the associated training data are versioned and stored in the so-called DVC remote storage.

Finally, we trigger a GitLab release for any new version of the model. In this step we upload the training result to an S3 Bucket and use the GitLab releases API to generate a release page.

Our setup addresses teams where team members develop the ML pipeline simultaneously. Each team member uses a separate training location and therefore has access to exclusive compute power and a consistent training environment. However, they share a common remote storage location. The following picture visualizes the setup.

Key aspects

Before we dig into the details of our GitLab CI pipeline, we briefly discuss other key aspects of our setup. Afterwards, we discuss our project setup in more detail. Here the special focus is on the automated model training which consists of three stages in our GitLab CI pipeline.

The code repository

The complete code for the project can be found here here . The code repository covers three main concerns.

A DVC project with an ML pipeline.
The runtime environment for the remote training.
Executing the training and releasing the newly trained model.

For the sake of conciseness, we decided to implement all three concerns in a single repository. However, they should in general be split into three different repositories.

The ML pipeline

As our focus is on remote training, we do not discuss details of the ML pipeline (such as model architecture, training configuration, etc.) and treat the ML pipeline as a black box. Therefore, the example code implements only a rudimentary ML pipeline for classifying the Fruits 360 image data set.

We use a simple Keras model and export the trained model in the onnx format. Our colleague Nico Axtmann showcases the advantages of using the onnx format in his blog post (german).

Remote training

As discussed in the section “The high level idea”, the training does not take place in the GitLab runner. Instead, we execute the training on an EC2 instance. Dependencies of the training code, e.g. binary executable, libraries, are provided by a Docker container. After the training is completed, the container will be destroyed.

However, in order to save time and bandwidth, we do not check out the DVC project at each and every container start. Instead, the project is checked out to persistent memory of the EC2 instance hosting the container and is mounted into the container. This way only incremental code and training data changes must be fetched before the training.

Working in a team

Just like for software development, tooling does not eliminate the need to communicate with your team. (After all, tooling should help us establish reliable and efficient means of communication.) Good communication is of increased importance when developing software in the same (feature) branch. When training a model remotely, each ‘trainer’ prepares the repository by committing training data to it, then triggers the training process (via a special kind of commit), and after training has finished, the produced model is automatically committed to the repository as well.

Thus, remote training for the same feature branch is even more prone to race conditions in the commit history than common software development. In particular, in case of a feature-branch-based development process, merging to master must be coordinated carefully. Moreover, each training run relies on consistent data in the working directory. Consequently, two team members must not simultaneously trigger the training process in the same training location. Therefore, we utilize a different training location for each team member. This also allows everybody to independently choose the training branch. Plus, each training run has exclusive access to compute resources.

The CI pipeline

The GitLab CI pipeline definition is contained in the file .gitlab-ci.yml. In this section, the term pipeline refers to this CI pipeline, not to the DVC pipeline which we consider a given black box. The pipeline has three main concerns, namely building the Docker image that provides the training environment, the actual training, and the release of the trained model.

1stages:
2- build_train_image
3- train
4- release

In our simple pipeline, each stage contains exactly one job, which for simplicity is called the same as the stage. On each commit, a selection of the stages are executed. We either execute the build_train_image stage alone, or the train stage followed by the release stage.

Each stage runs a so-called GitLab runner somewhere in the cloud. In the train stage, the actual training is delegated away from the GitLab runner to a dedicated machine, as we discuss below.

Stage 1: Building the training image

Since the runtime environment for the training changes less frequently than the pipeline, we do not run the build_train_image stage on every commit. Instead, a special commit message is required to run this stage. In particular, the commit message has to start with build image.

The following snippet of .gitlab-ci.yml defines this trigger, where the variable $CI_COMMIT_MESSAGE is provided by the runner and contains the commit message.

1.requires-build-image-commit-message:
2only:
3variables:
4- $CI_COMMIT_MESSAGE =~ /^build image/

This snippet is referenced in the build_train_image stage as follows.

1build_train_image:
2stage: build_train_image
3extends:
4- .requires-trigger-training-commit-message

The training image definition is contained in the Dockerfile in the root of the GitLab repository. When the build_train_image stage runs, GitLab takes care of checking out the repository contents into the GitLab runner’s working directory. From here, the runner picks up the Dockerfile to build the training image.

We use kaniko to build the training image. Using kaniko does not require a Docker daemon in order to build the image. This increases security, since there is no need for privileges in the GitLab runner, and it usually speeds up the build.

We configure kaniko by using gcr.io/kaniko-project/executor:debug as the stage’s GitLab runner’s image. The first line of the stage script is required to configure kaniko correctly. The script uses some environment variables provided by GitLab. The custom variable $DOCKER_REGISTRY points to AWS’ Elastic Container Registry (ECR, for short), where the final image will be stored. The /kaniko/executor picks up the Dockerfile from the $CI_PROJECT_DIR variable, which is provided by default by the GitLab runner and refers to the checked out Git repository. The final image will be stored in the ECR under the name dvc_example:train_ followed by the name of the current branch, e.g. dvc_example:train_example_branch (the tag train is stored in the custom GitLab variable $DOCKER_TAG, the branch name is available in the default GitLab variable $CI_COMMIT_REF_NAME).

1build_train_image:
2stage: build_train_image
3…
4image:
5name: gcr.io/kaniko-project/executor:debug
6entrypoint: [""]
7script:
8# configure and run kaniko (ecr login creds come from env vars)
9- echo "{\"credHelpers\":{\"$DOCKER_REGISTRY\":\"ecr-login\"}}" > /kaniko/.docker/config.json
10- /kaniko/executor --context $CI_PROJECT_DIR \ 
11--dockerfile $CI_PROJECT_DIR/ \
12--destination $DOCKER_REGISTRY/dvc_example:${DOCKER_TAG}_${CI_COMMIT_REF_NAME}

Including the branch name in the image tag allows us to develop the training image without affecting team members working in other branches. (Note that, when creating a new branch, the training image must be built before the first training on this branch.)

Stage 2: Training the model

First, we present the base setup of the train stage. Training the model might be a time-consuming procedure. Therefore, as for building the training image, we do not train the model on each and every commit, but only if the committer specifically instructs the pipeline to execute the training. Again, a special commit message has to be provided that starts with trigger training followed by a descriptive tag (the tag marks the resulting training artifacts for release, see subsection below).

1.requires-trigger-training-commit-message:
2only:
3variables:
4- $CI_COMMIT_MESSAGE =~ /^trigger training [a-zA-Z0-9_\-\.]+/

In this stage, we use a python:3.6-alpine environment and supplement it with libraries and binaries needed to conduct the training. For example, we use the boto3 library to start and stop the EC2 instance. Credentials to communicate with AWS services are stored in custom GitLab runner environment variables, such that they are available to our calls of boto3.

1train:
2stage: train
3extends:
4- .requires-trigger-training-commit-message
5image: python:3.6-alpine
6script:
7- pip3 install boto3 fire
8…

Next, we outline what is happening in the train stage. The actual training will not be executed in the GitLab runner, as, generally, the runner is located on an “all-purpose” machine, whereas training might require special hardware like GPUs. Therefore, the train stage delegates the training to another machine, namely an AWS EC2 instance. We do not discuss the delegation in detail, but we note that the script bin/orchestrate_ec2.py takes care of starting/stopping the EC2 instance for cost efficiency and monitors the running instance to detect when the training is concluded. For better inspection of the pipeline, we log the orchestration command with all its parameters before actually executing it.

1train:
2…
3script:
4…
5- release_name=`bin/commit_message_to_release_name.sh …
6… "$CI_COMMIT_MESSAGE"`
7- cmd="cmd="python bin/orchestrate_ec2.py execute_orchestration …
8… $TRAIN_INSTANCE_FOR_USER $GITLAB_USER_EMAIL …
9… $CI_COMMIT_REF_NAME …
10… $DOCKER_REGISTRY/dvc_example ${DOCKER_TAG}_${CI_COMMIT_REF_NAME} $release_name"
11- echo $cmd
12- $cmd

The variables given as arguments to orchestrate_ec2.py configure the training and release, as we discuss in the following subsections:

Training configuration

The variables $DOCKER_REGISTRY and $DOCKER_TAG determine the Docker training image that is pulled to the EC2 instance before starting the container. Both variables have the same values as in the build_train_image stage, i.e., we use the most recent build of the training image.

To allow for a flexible development workflow in teams, we support branch-based development. For example, a team member might develop a new DVC pipeline stage in a branch other than master before making it available for the rest of the team (by merging to master). Since team members might conduct training for different branches simultaneously, each member uses a separate “private” EC2 instance.

The variable $GITLAB_USER_EMAIL is provided by default by the GitLab runner and identifies the committer for the pipeline run in question. The mapping stored in the file $TRAIN_INSTANCE_FOR_USER lets the orchestration script determine the committer’s private instance to forward the training to. For security, the content of the file $TRAIN_INSTANCE_FOR_USER is also a GitLab variable and not committed to the repository. This is what a mapping might look like (EC2 instance IDs are fake):

1{
2"marcel.mikl@codecentric.de": "i-0a9ec87b6ae9cf87b",
3"bert.besser@codecentric.de": "i-0b07acec0ef7a8fbc"
4}

The variable $CI_COMMIT_REF_NAME contains the branch name of the commit. The orchestration script instructs the EC2 instance to switch to the given branch before starting the training. Note that artifacts of all branches are pushed to the same DVC remote storage.

Release preparation

After training, in the final step of the EC2 instance we push newly generated binary artifacts to the DVC remote and commit/tag the DVC pipeline state in the Git repository. This is where the descriptive tag of the commit message comes into play; it serves as the Git commit tag for future reference of this training’s artifacts. We use the script bin/commit_message_to_release_name.sh to extract the third token of the commit message and store it in the variable $release_name. The orchestration script then forwards $release_name to the EC2 instance.

Stage 3: Releasing the model

After the train stage finishes successfully, the release stage takes care of making the training results available publicly, using the script bin/upload_and_release.sh. The script copies the file train.dvc to a public S3 bucket. Also, it creates a GitLab release page containing a link to the copied file (environment variables provide credentials and location information for the script). Again, the stage will only run if the committer demands a training using a commit message of the form trigger training , where the release tag determines the name of the GitLab release.

1release:
2stage: release
3image: python:3.6-alpine
4extends:
5- .requires-trigger-training-commit-message
6before_script:
7- apk add --no-cache curl
8- pip3 install awscli
9script:
10- release_name=`bin/commit_message_to_release_name.sh "$CI_COMMIT_MESSAGE"`
11- cmd="bin/upload_and_release.sh $release_name $BUCKET_NAME $CI_PROJECT_ID $GITLAB_TOKEN"
12- echo $cmd
13- $cmd

Further Thoughts

Separation of concerns

We chose to trigger the training using a special commit message, since we want training results to be tagged properly. That is, we use tags for releases of training results exclusively. Separating the GitLab pipeline orchestrating the training into another code repository would allow us to also use tags for triggering the training. In our opinion, this approach allows for better inspection into the history of who/when/… triggered a training.

Moving the code that creates the training container out of the DVC project’s repository clearly improves separation of concerns. The beneficiaries of this procedure would be e.g. data scientists, since their work environments for the ML pipeline are not ‘polluted’ with cloud and container concerns. However, this separation introduces a dependency, since the runtime environment must be prepared with the required software for the actual training.

Using the trained model in an application

Typically, the trained model will be used in an application, e.g. a web server providing prediction with a REST API. In order to build an application using the model, our proposed method to retrieve the model’s binary file referenced in train.dvc is

Initialize a DVC repository in the application repository and configure the same remote storage as for the model training repository.
Download the train.dvc file (or rather, any particular release of train.dvc) and place it in the application project.
Finally, execute dvc pull train.dvcto download all output files, e.g. the model.onnx binary, defined in train.dvc file.

In many cases, the model binary is not the only output file of the ML pipeline. For example, we also generate a model.config file which contains parameters to score data with the model. Depending on the context, various files and artifacts from the ML pipeline are required to build an application. By employing DVC ML pipelines, the .dvc file already defines the collection of the final output files of our ML pipeline and hence there is no need to additionally create a bundle (e.g. a zip file) with all relevant artifacts.

Note: The straight forward way to retrieve a particular binary file is to use the dvc get command, see the documentation . This command ‘downloads a file from a DVC project’, where the desired revision of the file is determined by the --rev parameter.

Reproducibility

The use of DVC to version the data and the training result allows for reproduction of all the results for specific releases. This is particularly relevant if the training results are used inside applications. In case there is a problem with the model in production, it is easily possible to reproduce the results for examination. In this case we use git checkout release-tag followed by dvc repro train.dvcand DVC will carry out all steps to reproduce the results automatically.

Automated testing

Our pipeline does not employ any kind of automated testing. Our only indication of failure is a process exiting with an error, that in turn fails the entire pipeline. For productively developing an ML pipeline, a smoke test is desirable: whenever a commit is pushed to the GitLab repository, the entire pipeline should be running on reduced data. If the pipeline succeeds, we can be confident that the training process on the full data set succeeds (and does not fail ‘last-minute’ after days of computation).

Some setups might profit from automatic detection of performance degradation. For example, an additional pipeline stage fails if, say, accuracy of the newly trained model is worse than the highest achieved previously.

Conclusion

In this blogpost we discussed a basic setup which allows to automate the training of machine learning models in production environments. We also provided an initial implementation of the setup and gave some insights in our reasoning. The setup should be considered as a possible starting point to build an automation setup for other projects. Typically, there is not one setup which is best for all projects and appropriate adjustments and considerations are required. We have already built similar model training pipelines based on these ideas for our customers which are used successfully in production.

Besides customized solutions there are also several existing cloud solutions such as AWS Sagemaker and Google Cloud AI Platform which tackle the model training (and even the model serving) under given framework conditions. Depending on the use case and the data involved it makes good sense to use cloud services, however, this discussion is a topic for an additional blog post.

Was this post helpful?

Likes

Blog authors

Marcel Mikl

Do you still have questions? Just send me a message.

Bert Besser

Do you still have questions? Just send me a message.

fromMarcel Mikl & Bert Besser

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

DISH-O-TRON – Train that vision model!

With this article we continue our endeavor of building dish-o-tron – an AI system designed to prevent the sudden appearance of dirty dishes in the community kitchen sink, and hence turning the community kitchen into a place of peace and harmony. This...

AI
Computer Vision

11.10.2020 | 11 Minuten Lesezeit

Marcel Mikl

Oliver Moser

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 Minuten Lesezeit

Marcel Mikl

Oliver Moser

DISH-O-TRON – No more dirty dishes thanks to AI

Sadly, to tell you the truth, doing dishes is still a thing. However, so far most of our readers still like our non-standard Deep Learning tutorial. Typically, AI is demonstrated as solving various toy problems. AI plays chess and Go, AI plays video ...

10.9.2020 | 7 Minuten Lesezeit

Marcel Mikl

Oliver Moser

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

Noch vor kurzer Zeit mussten für den Einsatz von künstlicher Intelligenz (KI) unter großem Aufwand eigene KI-Modelle erstellt werden. Heute ist für viele Anwendungsfälle die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Machine Learning
Python

29.7.2020 | 11 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und Konstruktion eigener neuronaler Netze möglich. Heute ist die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken. So kann man ...

Cloud
Computer Vision
Data
Python
Machine Learning
Google Cloud
Künstliche Intelligenz

8.7.2020 | 11 Minuten Lesezeit

Nico Axtmann

Marcel Mikl

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und ausreichend Spezialwissen möglich. Hauptsächlich große Internet-Konzerne wie Google, Apple und Facebook hatten das Geld, die Daten und die Expertise, um ...

Data
Machine Learning
Künstliche Intelligenz

6.7.2020 | 7 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 Minuten Lesezeit

Marcel Mikl

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Warum gelingt es Data-Science-Initiativen häufig nicht, einen echten Mehrwert zu schaffen? Wir haben einige Ursachen dafür ausgemacht. In diesem Blogpost stellen wir vier typische Fallen für Data-Science-Projekte vor und geben Tipps, wie Du sie umschiffen...

Machine Learning
Data
Künstliche Intelligenz
Softwareentwicklung

27.3.2020 | 11 Minuten Lesezeit

Marcel Mikl

Great Expectations: Validating datasets in machine learning pipelines

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true...

Python
Data
Machine Learning

17.2.2020 | 6 Minuten Lesezeit

Marcel Mikl

E-Mail-Klassifizierung mit SpaCy

Noch vor kurzer Zeit war E-Mail-Klassifikation mittels Deep Learning nur mit Spezialwissen und ausreichend Data Science Know-how möglich. Heute existieren sehr gute Open-Source-Bibliotheken mit fertigen Deep-Learning-Modellen, welche sehr weit optimiert...

Data
Machine Learning

28.4.2019 | 8 Minuten Lesezeit

Marcel Mikl

Kunden-E-Mails effizient verarbeiten – mit künstlicher Intelligenz

Einleitung Künstliche Intelligenz (KI) findet sich heutzutage scheinbar überall. Bereits ohne den derzeitigen Hype-Faktor um KI ist der Begriff nur schwer zu greifen. Viele Unternehmen sehen sich unter Zugzwang, KI als neue Technologie einzusetzen und...

Data
Künstliche Intelligenz

7.4.2019 | 7 Minuten Lesezeit

Marcel Mikl

Oliver Moser

Wie trainiert man eigentlich neuronale Netze?

Neuronale Netze sind für Außenstehende häufig von einer mystischen Aura umgeben. Sie werden regelmäßig in Verbindung mit menschlichen Gehirnen gebracht, und ihnen wird eine sich verselbständigende Intelligenz zugeschrieben. Das macht sie für viele mysteriös...

Künstliche Intelligenz

27.8.2018 | 8 Minuten Lesezeit

Marcel Mikl

DVC dependency management – a guide

This post is a follow-up to A walkthrough of DVC that deals with managing dependencies between DVC projects. In particular, this follow-up is about importing specific versions of an artifact (e.g. a trained model or a dataset) from one DVC project into...

Data
AI
Machine Learning

26.8.2019 | 10 Minuten Lesezeit

Bert Besser

Veronika Schwan

A walkthrough of DVC

This post is on how to systematially organize Machine Learning (ML) model development. A model’s performance improves when, e.g., you tune its parameters or when more training data becomes available. To measure improvement, you should track at least ...

Data
AI
Machine Learning
Python

13.3.2019 | 12 Minuten Lesezeit

Bert Besser

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

CI/CD-Pipelines mit AWS CDK CodePipeline

Das Aufsetzen der CI/CD-Pipeline ist ein typischer Task in der Anfangszeit eines Projekts. Ist die Pipeline dann aufgesetzt, sind Änderungen nur noch selten notwendig. Dementsprechend wenig Routine entwickeln Programmierende im Umgang mit der Konfiguration...

Cloud
CI/CD
AWS

17.7.2023 | 4 Minuten Lesezeit

Dennis

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Automatische Dependency-Updates mit Renovate

Bei der Softwareentwicklung ist es sinnvoll, bereits bestehende Funktionen wiederzuverwenden. Das spart Zeit und es wird unwahrscheinlicher, auf Probleme zu stoßen, die andere bereits gelöst haben. Funktionen können aus diesem Grund in Libraries gebündelt...

Softwareentwicklung
CI/CD

17.4.2023 | 6 Minuten Lesezeit

Alexander Backes

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

Platform Engineering mit BackstageIm folgenden Interview berichten Marc Schnitzius und Pascal Sochacki von ihren ersten Erfahrungen mit Backstage als Platform-Engineering-Lösung.Marco Paga: Marc, Pascal, ihr habt eine Sicht auf Platform Engineering, ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

2.3.2023 | 12 Minuten Lesezeit

Marco Paga

Maximilian Mayer

„Platform Engineering ist eine Art von Knowledge Sharing“

Warum „Platform Engineering“ eigentlich der falsche Begriff ist und wie man den Golden Path findet, erklärt Daniel Kocot, Senior Solution Architect, im folgenden Interview.Marco Paga: Warum ist Platform Engineering interessant?Daniel Kocot: Ich habe ...

Softwareentwicklung
Accelerate
CI/CD
DevOps
Platform Engineering

20.2.2023 | 11 Minuten Lesezeit

Daniel Kocot

Marco Paga

Open Policy Agent – Maschinen, die auf Regeln starren

Der Open Policy Agent (OPA) ist eine universell einsetzbare, quelloffene Policy Engine, also eine Sammlung von Komponenten, die eine einheitliche und effiziente Umsetzung von Regeln aller Art erlaubt. Dieser Artikel zeigt ein kleines Praxisbeispiel. ...

CI/CD
Softwarearchitektur
IT-Security

19.10.2022 | 5 Minuten Lesezeit

Marco Paga

Platform Engineering – Eine Einordnung

Aktuell kocht mit Platform Engineering gerade ein Thema hoch, das in den Weiten des World Wide Web für viele Reaktionen sorgt. Gerade auch Kunden aus dem Enterprise-Umfeld führt es zu interessanten Nebeneffekten, wenn aus DevOps-Teams plötzlich Platform...

Accelerate
CI/CD
DevOps

12.9.2022 | 4 Minuten Lesezeit

Daniel Kocot

Tekton Triggers in der Praxis

Tekton Triggers in der PraxisDieser Artikel ist Teil einer Reihe, die sich mit Tekton CI/CD und dem praktischen Einsatz beschäftigt.Im ersten Artikel haben wir die Installation vorgenommen und die erste Pipeline erstellt. Im zweiten Artikel haben wir...

CI/CD

4.3.2022 | 6 Minuten Lesezeit

Marco Paga

Tekton Buildpack Pipeline: Alles schon da?

Im ersten Artikel haben wir die Tekton-Installation gemeistert, erste API-Objekte kennengelernt und dabei eine erste kleine Pipeline erstellt. Hier eine kurze grafische Zusammenfassung als Erinnerung. Jetzt werden wir eine praktisch nutzbare Pipeline...

CI/CD
Softwareentwicklung

11.2.2022 | 5 Minuten Lesezeit

Marco Paga

Tekton Cloud-Native CI/CD: Ein pragmatisches Intro

In diesem Artikel möchte ich einen Überblick über Tekton geben mit dem Ziel, die Grundlagen zu erklären und einen schnellen Einstieg zu ermöglichen.Tekton möchte laut eigener Homepage der Standard für CI / CD werden. Zum einen bietet es ein Framework...

CI/CD
Kubernetes
Softwareentwicklung

19.1.2022 | 6 Minuten Lesezeit

Marco Paga

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Deployment konfigurierbarer Single Page Applications

In den letzten Jahren ist die Implementierung von Frontends in Form von Single Page Applications (kurz SPA) immer beliebter geworden. Bei Single Page Applications handelt es sich um Webseiten, die auf den Web-Technologien HTML, CSS und vor allem JavaScript...

DevOps
Frontend
CI/CD
Container
JavaScript

8.6.2021 | 6 Minuten Lesezeit

Philip Sanetra

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Remote training with GitLab-CI and DVC

DVC remote training: The high-level idea

Key aspects

The code repository

The ML pipeline

Remote training

Working in a team

The CI pipeline

Stage 1: Building the training image

Stage 2: Training the model

Training configuration

Release preparation

Stage 3: Releasing the model

Further Thoughts

Separation of concerns

Using the trained model in an application

Reproducibility

Automated testing

Conclusion

Was this post helpful?

Ja

Blog authors

Get in contact

Get in contact

Contact Marcel

Contact Bert

More articles

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

DISH-O-TRON – Train that vision model!

DISH-O-TRON – Gather that DATA you must!

DISH-O-TRON – No more dirty dishes thanks to AI

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Thinking AI means re-thinking data

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Great Expectations: Validating datasets in machine learning pipelines

E-Mail-Klassifizierung mit SpaCy

Kunden-E-Mails effizient verarbeiten – mit künstlicher Intelligenz

Wie trainiert man eigentlich neuronale Netze?

DVC dependency management – a guide

A walkthrough of DVC

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

CI/CD-Pipelines mit AWS CDK CodePipeline

Große Sprachmodelle: Was ist ein LLM?

Automatische Dependency-Updates mit Renovate

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

„Eine Plattform ist ein Produkt, die Entwickler-Teams sind die Kunden“

„Platform Engineering ist eine Art von Knowledge Sharing“

Open Policy Agent – Maschinen, die auf Regeln starren

Platform Engineering – Eine Einordnung

Tekton Triggers in der Praxis

Tekton Buildpack Pipeline: Alles schon da?

Tekton Cloud-Native CI/CD: Ein pragmatisches Intro

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Deployment konfigurierbarer Single Page Applications

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten