Remote training with GitLab-CI and DVC

No Comments

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or a computing instance in the cloud. In this article, we show how you can build a custom remote training set up for your machine learning models. We aim for automation and team collaboration.

From a technology point of view we use an EC2 instance on AWS for the training of the model. The automation is implemented via a GitLab-CI pipeline that we trigger with special commit messages. Furthermore, we use DVC to achieve reproducibility of the model training and, on the other hand, for versioning data and model. Furthermore, we use an S3 Bucket on AWS as remote storage for DVC. However, the setup does specifically require AWS and can be adapted, e.g. to on-premises hardware. For an introduction to DVC we refer to here.

If you are interested in our work, you likely have read the very popular article CD4ML. In our blog post, we cover one particular topic of that article very deeply, namely how to conduct the actual training, with special focus on technical aspects as well as team work. We do not discuss the setup of data pipelines, how to deploy the application, or monitoring.

DVC remote training: The high-level idea

The following picture gives a high-level view of the general idea of the setup.

dvc remote training: high level setup

First, we build a Docker image which we use to train the machine learning model. The Docker image provides all runtime dependencies for the training, e.g. libraries or command line programs. However, it does not contain the training data; that data is stored at the training location and must be mounted into the container when executing the training.

The image is pulled to the prefered training location. The choice of the training location is very flexible, the container could be running on your laptop, on a powerful machine in your basement, or anywhere else in the cloud. We use DVC to manage the training data. In particular, we use DVC’s functionality to permanently store and version data at the training location. This way we can avoid transferring the entire training data to the training location for each training. Instead, we utilize DVC to perform incremental updates. After the training, the training results as well as the associated training data are versioned and stored in the so-called DVC remote storage.

Finally, we trigger a GitLab release for any new version of the model. In this step we upload the training result to an S3 Bucket and use the GitLab releases API to generate a release page.

Our setup addresses teams where team members develop the ML pipeline simultaneously. Each team member uses a separate training location and therefore has access to exclusive compute power and a consistent training environment. However, they share a common remote storage location. The following picture visualizes the setup.

multiple team members

Key aspects

Before we dig into the details of our GitLab CI pipeline, we briefly discuss other key aspects of our setup. Afterwards, we discuss our project setup in more detail. Here the special focus is on the automated model training which consists of three stages in our GitLab CI pipeline.

The code repository

The complete code for the project can be found here here. The code repository covers three main concerns.

  • A DVC project with an ML pipeline.
  • The runtime environment for the remote training.
  • Executing the training and releasing the newly trained model.

For the sake of conciseness, we decided to implement all three concerns in a single repository. However, they should in general be split into three different repositories.

The ML pipeline

As our focus is on remote training, we do not discuss details of the ML pipeline (such as model architecture, training configuration, etc.) and treat the ML pipeline as a black box. Therefore, the example code implements only a rudimentary ML pipeline for classifying the Fruits 360 image data set.

We use a simple Keras model and export the trained model in the onnx format. Our colleague Nico Axtmann showcases the advantages of using the onnx format in his blog post (german).

Remote training

As discussed in the section “The high level idea”, the training does not take place in the GitLab runner. Instead, we execute the training on an EC2 instance. Dependencies of the training code, e.g. binary executable, libraries, are provided by a Docker container. After the training is completed, the container will be destroyed.

However, in order to save time and bandwidth, we do not check out the DVC project at each and every container start. Instead, the project is checked out to persistent memory of the EC2 instance hosting the container and is mounted into the container. This way only incremental code and training data changes must be fetched before the training.

Working in a team

Just like for software development, tooling does not eliminate the need to communicate with your team. (After all, tooling should help us establish reliable and efficient means of communication.) Good communication is of increased importance when developing software in the same (feature) branch. When training a model remotely, each ‘trainer’ prepares the repository by committing training data to it, then triggers the training process (via a special kind of commit), and after training has finished, the produced model is automatically committed to the repository as well.

Thus, remote training for the same feature branch is even more prone to race conditions in the commit history than common software development. In particular, in case of a feature-branch-based development process, merging to master must be coordinated carefully. Moreover, each training run relies on consistent data in the working directory. Consequently, two team members must not simultaneously trigger the training process in the same training location. Therefore, we utilize a different training location for each team member. This also allows everybody to independently choose the training branch. Plus, each training run has exclusive access to compute resources.

The CI pipeline

The GitLab CI pipeline definition is contained in the file .gitlab-ci.yml. In this section, the term pipeline refers to this CI pipeline, not to the DVC pipeline which we consider a given black box. The pipeline has three main concerns, namely building the Docker image that provides the training environment, the actual training, and the release of the trained model.

stages:
- build_train_image
- train
- release

In our simple pipeline, each stage contains exactly one job, which for simplicity is called the same as the stage. On each commit, a selection of the stages are executed. We either execute the build_train_image stage alone, or the train stage followed by the release stage.

Each stage runs a so-called GitLab runner somewhere in the cloud. In the train stage, the actual training is delegated away from the GitLab runner to a dedicated machine, as we discuss below.

Stage 1: Building the training image

Since the runtime environment for the training changes less frequently than the pipeline, we do not run the build_train_image stage on every commit. Instead, a special commit message is required to run this stage. In particular, the commit message has to start with build image.

The following snippet of .gitlab-ci.yml defines this trigger, where the variable $CI_COMMIT_MESSAGE is provided by the runner and contains the commit message.

.requires-build-image-commit-message:
only:
variables:
- $CI_COMMIT_MESSAGE =~ /^build image/

This snippet is referenced in the build_train_image stage as follows.

build_train_image:
stage: build_train_image
extends:
- .requires-trigger-training-commit-message

The training image definition is contained in the Dockerfile in the root of the GitLab repository. When the build_train_image stage runs, GitLab takes care of checking out the repository contents into the GitLab runner’s working directory. From here, the runner picks up the Dockerfile to build the training image.

We use kaniko to build the training image. Using kaniko does not require a Docker daemon in order to build the image. This increases security, since there is no need for privileges in the GitLab runner, and it usually speeds up the build.

We configure kaniko by using gcr.io/kaniko-project/executor:debug as the stage’s GitLab runner’s image. The first line of the stage script is required to configure kaniko correctly. The script uses some environment variables provided by GitLab. The custom variable $DOCKER_REGISTRY points to AWS’ Elastic Container Registry (ECR, for short), where the final image will be stored. The /kaniko/executor picks up the Dockerfile from the $CI_PROJECT_DIR variable, which is provided by default by the GitLab runner and refers to the checked out Git repository. The final image will be stored in the ECR under the name dvc_example:train_ followed by the name of the current branch, e.g. dvc_example:train_example_branch (the tag train is stored in the custom GitLab variable $DOCKER_TAG, the branch name is available in the default GitLab variable $CI_COMMIT_REF_NAME).

build_train_image:
stage: build_train_image
…
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
script:
# configure and run kaniko (ecr login creds come from env vars)
- echo "{\"credHelpers\":{\"$DOCKER_REGISTRY\":\"ecr-login\"}}" > /kaniko/.docker/config.json
- /kaniko/executor --context $CI_PROJECT_DIR \ 
--dockerfile $CI_PROJECT_DIR/ \
--destination $DOCKER_REGISTRY/dvc_example:${DOCKER_TAG}_${CI_COMMIT_REF_NAME}

Including the branch name in the image tag allows us to develop the training image without affecting team members working in other branches. (Note that, when creating a new branch, the training image must be built before the first training on this branch.)

Stage 2: Training the model

First, we present the base setup of the train stage. Training the model might be a time-consuming procedure. Therefore, as for building the training image, we do not train the model on each and every commit, but only if the committer specifically instructs the pipeline to execute the training. Again, a special commit message has to be provided that starts with trigger training followed by a descriptive tag (the tag marks the resulting training artifacts for release, see subsection below).

.requires-trigger-training-commit-message:
only:
variables:
- $CI_COMMIT_MESSAGE =~ /^trigger training [a-zA-Z0-9_\-\.]+/

In this stage, we use a python:3.6-alpine environment and supplement it with libraries and binaries needed to conduct the training. For example, we use the boto3 library to start and stop the EC2 instance. Credentials to communicate with AWS services are stored in custom GitLab runner environment variables, such that they are available to our calls of boto3.

train:
stage: train
extends:
- .requires-trigger-training-commit-message
image: python:3.6-alpine
script:
- pip3 install boto3 fire
…

Next, we outline what is happening in the train stage. The actual training will not be executed in the GitLab runner, as, generally, the runner is located on an “all-purpose” machine, whereas training might require special hardware like GPUs. Therefore, the train stage delegates the training to another machine, namely an AWS EC2 instance. We do not discuss the delegation in detail, but we note that the script bin/orchestrate_ec2.py takes care of starting/stopping the EC2 instance for cost efficiency and monitors the running instance to detect when the training is concluded. For better inspection of the pipeline, we log the orchestration command with all its parameters before actually executing it.

train:
…
script:
…
- release_name=`bin/commit_message_to_release_name.sh …
… "$CI_COMMIT_MESSAGE"`
- cmd="cmd="python bin/orchestrate_ec2.py execute_orchestration …
… $TRAIN_INSTANCE_FOR_USER $GITLAB_USER_EMAIL …
… $CI_COMMIT_REF_NAME …
… $DOCKER_REGISTRY/dvc_example ${DOCKER_TAG}_${CI_COMMIT_REF_NAME} $release_name"
- echo $cmd
- $cmd

The variables given as arguments to orchestrate_ec2.py configure the training and release, as we discuss in the following subsections:

Training configuration

The variables $DOCKER_REGISTRY and $DOCKER_TAG determine the Docker training image that is pulled to the EC2 instance before starting the container. Both variables have the same values as in the build_train_image stage, i.e., we use the most recent build of the training image.

To allow for a flexible development workflow in teams, we support branch-based development. For example, a team member might develop a new DVC pipeline stage in a branch other than master before making it available for the rest of the team (by merging to master). Since team members might conduct training for different branches simultaneously, each member uses a separate “private” EC2 instance.

The variable $GITLAB_USER_EMAIL is provided by default by the GitLab runner and identifies the committer for the pipeline run in question. The mapping stored in the file $TRAIN_INSTANCE_FOR_USER lets the orchestration script determine the committer’s private instance to forward the training to. For security, the content of the file $TRAIN_INSTANCE_FOR_USER is also a GitLab variable and not committed to the repository. This is what a mapping might look like (EC2 instance IDs are fake):

{
"marcel.mikl@codecentric.de": "i-0a9ec87b6ae9cf87b",
"bert.besser@codecentric.de": "i-0b07acec0ef7a8fbc"
}

The variable $CI_COMMIT_REF_NAME contains the branch name of the commit. The orchestration script instructs the EC2 instance to switch to the given branch before starting the training. Note that artifacts of all branches are pushed to the same DVC remote storage.

Release preparation

After training, in the final step of the EC2 instance we push newly generated binary artifacts to the DVC remote and commit/tag the DVC pipeline state in the Git repository. This is where the descriptive tag of the commit message comes into play; it serves as the Git commit tag for future reference of this training’s artifacts. We use the script bin/commit_message_to_release_name.sh to extract the third token of the commit message and store it in the variable $release_name. The orchestration script then forwards $release_name to the EC2 instance.

Stage 3: Releasing the model

After the train stage finishes successfully, the release stage takes care of making the training results available publicly, using the script bin/upload_and_release.sh. The script copies the file train.dvc to a public S3 bucket. Also, it creates a GitLab release page containing a link to the copied file (environment variables provide credentials and location information for the script). Again, the stage will only run if the committer demands a training using a commit message of the form trigger training , where the release tag determines the name of the GitLab release.

release:
stage: release
image: python:3.6-alpine
extends:
- .requires-trigger-training-commit-message
before_script:
- apk add --no-cache curl
- pip3 install awscli
script:
- release_name=`bin/commit_message_to_release_name.sh "$CI_COMMIT_MESSAGE"`
- cmd="bin/upload_and_release.sh $release_name $BUCKET_NAME $CI_PROJECT_ID $GITLAB_TOKEN"
- echo $cmd
- $cmd

Further Thoughts

Separation of concerns

We chose to trigger the training using a special commit message, since we want training results to be tagged properly. That is, we use tags for releases of training results exclusively. Separating the GitLab pipeline orchestrating the training into another code repository would allow us to also use tags for triggering the training. In our opinion, this approach allows for better inspection into the history of who/when/… triggered a training.

Moving the code that creates the training container out of the DVC project’s repository clearly improves separation of concerns. The beneficiaries of this procedure would be e.g. data scientists, since their work environments for the ML pipeline are not ‘polluted’ with cloud and container concerns. However, this separation introduces a dependency, since the runtime environment must be prepared with the required software for the actual training.

Using the trained model in an application

Typically, the trained model will be used in an application, e.g. a web server providing prediction with a REST API. In order to build an application using the model, our proposed method to retrieve the model’s binary file referenced in train.dvc is

  • Initialize a DVC repository in the application repository and configure the same remote storage as for the model training repository.
  • Download the train.dvc file (or rather, any particular release of train.dvc) and place it in the application project.
  • Finally, execute dvc pull train.dvcto download all output files, e.g. the model.onnx binary, defined in train.dvc file.

In many cases, the model binary is not the only output file of the ML pipeline. For example, we also generate a model.config file which contains parameters to score data with the model. Depending on the context, various files and artifacts from the ML pipeline are required to build an application. By employing DVC ML pipelines, the .dvc file already defines the collection of the final output files of our ML pipeline and hence there is no need to additionally create a bundle (e.g. a zip file) with all relevant artifacts.

Note: The straight forward way to retrieve a particular binary file is to use the dvc get command, see the documentation. This command ‘downloads a file from a DVC project’, where the desired revision of the file is determined by the --rev parameter.

Reproducibility

The use of DVC to version the data and the training result allows for reproduction of all the results for specific releases. This is particularly relevant if the training results are used inside applications. In case there is a problem with the model in production, it is easily possible to reproduce the results for examination. In this case we use git checkout release-tag followed by dvc repro train.dvcand DVC will carry out all steps to reproduce the results automatically.

Automated testing

Our pipeline does not employ any kind of automated testing. Our only indication of failure is a process exiting with an error, that in turn fails the entire pipeline. For productively developing an ML pipeline, a smoke test is desirable: whenever a commit is pushed to the GitLab repository, the entire pipeline should be running on reduced data. If the pipeline succeeds, we can be confident that the training process on the full data set succeeds (and does not fail ‘last-minute’ after days of computation).

Some setups might profit from automatic detection of performance degradation. For example, an additional pipeline stage fails if, say, accuracy of the newly trained model is worse than the highest achieved previously.

Conclusion

In this blogpost we discussed a basic setup which allows to automate the training of machine learning models in production environments. We also provided an initial implementation of the setup and gave some insights in our reasoning. The setup should be considered as a possible starting point to build an automation setup for other projects. Typically, there is not one setup which is best for all projects and appropriate adjustments and considerations are required. We have already built similar model training pipelines based on these ideas for our customers which are used successfully in production.

Besides customized solutions there are also several existing cloud solutions such as AWS Sagemaker and Google Cloud AI Platform which tackle the model training (and even the model serving) under given framework conditions. Depending on the use case and the data involved it makes good sense to use cloud services, however, this discussion is a topic for an additional blog post.

Marcel Mikl

Due to the mathematical influence during his doctorate, Marcel is accustomed to solving problems in a structured way. Currently, Marcel is particularly interested in technologies around the topic of data science/engineering and how added value can be realized.

Bert Besser

IT-Consultant, holding a PhD in computer science theory, with an interest in machine learning.

Comment

Your email address will not be published. Required fields are marked *