A walkthrough of DVC

13.3.2019 | 12 minutes of reading time

This post is on how to systematially organize Machine Learning (ML) model development. A model’s performance improves when, e.g., you tune its parameters or when more training data becomes available. To measure improvement, you should track at least which data was used for training, the model’s current definition, model configuration (hyper parameters, etc.), and the achieved model performance. In particular, you should version these properties together.

Meet DVC (data version control), which supports you with this task, and more (e.g., check out this follow-up post on how to use DVC-managed data in other projects).

Implementing a DVC pipeline makes all of data loading, preprocessing, training, performance evaluation, etc. fully reproducible (and therefore also allows you to automate retraining). Training data, model configuration, the readily trained model, and performance metrics are versioned such that you can conveniently skip back to any given version and inspect all associated configuration and data. Also, DVC provides an overview of metrics for all versions of your pipeline, which helps identify your best work. Training data, trained models, performance metrics, etc. are shared with team members to allow for efficient collaboration.

A toy project

This post walks you through an example project (available in this GitHub repository ), in which a neural network is trained to classify images of handwritten digits from the MNIST data set . As the available image set of handwritten digits grows, we retrain the model to improve its accuracy. (Note that, for intuition, in the following figure we depict a much simpler neural network architecture than actually used in the project.)

To prepare the working environment, clone the above Git repository, change into the cloned directory, and run the start_environment.sh bash, see the following code block. The script start_environment.sh creates a Docker image and container, the bash parameter makes the script log you into the container as user dvc in the working folder /home/dvc/walkthrough. Commands given throughout this article can be found in the script /home/dvc/scripts/walkthrough.sh, also available in the container. (Note: Calling ./start_environment.sh with the additional parameter walkthrough executes the walkthrough.sh script before logging you in.)


# $ is the host prompt in the cloned folder
# $$ is the container prompt in the working folder /home/dvc/walkthrough

$ git clone https://github.com/bbesser/dvc-walkthrough
$ cd dvc-walkthrough
$ ./start_environment.sh bash
$$ cat /home/dvc/scripts/walkthrough.sh

Preparing the repository

The working folder /home/dvc/walkthrough already contains the subfolder code holding the required code. Let us turn /home/dvc/walkthrough into a "DVC-enabled" Git repository. DVC is built on top of Git. All DVC configuration is versioned in the same Git repository as your model code, in the subfolder .dvc, see the following code block. Note that tagging this freshly initialized repository is not a must—we create a tag only for the purpose of later parts of this walkthrough.


# Code is shortened, for brevity.
# See /home/dvc/scripts/walkthrough.sh for complete code.

$$ git init
$$ git add code
$$ git commit -m "initial import"
$$ dvc init
$$ git status
        new file: .dvc/.gitignore
        new file: .dvc/config
$$ git add .dvc
$$ git commit -m "init dvc"
$$ git tag -a 0.0 -m "freshly initialized with no pipeline defined, yet"

Defining the pipeline

Our pipeline consists of three stages, namely 1. loading data, 2. training, and 3. evaluation, and the pipeline also produces performance metrics of the trained model. Here is a schematic:

For simplicity, we implement a dummy loading stage, which just copies given raw image data into the repository. However, since our goal is to retrain the model as more and more data is available, our loading stage can be configured for the amount of data to be copied. This configuration is located in the file config/load.json. Similarly, training stage configuration is located in config/train.json (our neural network’s architecture allows you to alter the number of convolution filters). Let’s put this confiuration under version control.


$$ mkdir config
$$ echo '{ "num_images" : 1000 }' > config/load.json
$$ echo '{ "num_conv_filters" : 32 }' > config/train.json
$$ git add config/load.json config/train.json
$$ git commit -m "add config"

Stages of a DVC pipeline are connected by dependencies and outputs. Dependencies and outputs are simply files. E.g. our load stage depends on the configuration file config/load.json. The load stage outputs training image data. If upon execution of our load stage the set of training images changes, the training stage picks up these changes, since it depends on the training images. Similarly, a changed model will be evaluated to obtain its performance metrics. Once the pipeline definition is in place, DVC takes care of reproducing only those stages with changed dependencies, as we discuss in detail in the section Reproducing the pipeline .

The following dvc run command configures our load stage, where the stage’s definition is stored in the file given by the -f parameter, dependencies are provided using the -d parameter, and training image data is output into the folder data given by the -o parameter. DVC immediately executes the stage, already generating the desired training data.


$$ dvc run -f load.dvc -d config/load.json -o data python code/load.py
Running command:
        python code/load.py
Computing md5 for a large directory data/2. This is only done once.
...
$$ git status
        .gitignore
        load.dvc
$$ cat .gitignore
data
$$ git add .gitignore load.dvc
git commit -m "init load stage"

Recall that our load stage outputs image data into the folder data. DVC has added the data folder to Git’s ignore list. This is because large binary files are not to be versioned in Git repositories. See section DVC-cached files on how DVC manages such data.

After adding .gitignore and load.dvc to version control, we define the other two stages of our pipeline analogously, see the following code block. Note the dependency of our training stage to the training config file. Since training typically takes long times (not so in our toy project, though), we output the readily trained model into a file, namely model/model.h5. As DVC versions this binary file, we have easy access to this version of our model in the future.


$$ dvc run -f train.dvc -d data -d config/train.json -o model/model.h5 python code/train.py
...
$$ dvc run -f evaluate.dvc -d model/model.h5 -M model/metrics.json python code/evaluate.py
...

For the evaluation stage, observe the definition of the file model/metrics.json as a metric (-M parameter). Metrics can be inspected using the dvc metrics command, as we discuss in section Comparing versions . To wrap up our first version of the pipeline, we put all stage definitions (.dvc-files) under version control and add a tag.


$$ git add ...
$$ git commit ...
$$ git tag -a 0.1 -m "initial pipeline version 0.1"

Remark: DVC does not only support tags for organizing versions of your pipeline, it also allows you to utilize branch structures.

Finally, let’s have a brief look at how DVC renders our pipeline.


$$ dvc pipeline show --ascii evaluate.dvc

Remark: Observe that stage definitions call arbitrary commands, i.e., DVC is language-agnostic and not bound to Python. No one can stop you from implementing stages in Bash, C, or any other of your favorite languages and frameworks like R, Spark, PyTorch, etc.

DVC-cached files

For building up intuition on how DVC and Git work together, let us skip back to our initial Git repository version. Since no pipeline is defined yet, none of our training data, model, or metrics exist.

Recall that DVC uses Git to keep track of which output data belongs to the checked out version. Therefore, additionally to choosing the version via the git command, we have to instruct DVC to synchronize outputs using the dvc checkout command. I.e., when initializing the repository git and dvc have to be used in tandem.


$$ git checkout 0.0
$$ dvc checkout
$$ ls data
ls: cannot access 'data': No such file or directory # desired :-)

Back to the latest version, we find that DVC has restored all training data.


$$ git checkout 0.1
$$ dvc checkout
$$ ls data
0  1  2  3  4  5  6  7  8  9 # one folder of images for each digit

Similarly, you can skip to any of your versioned pipelines and inspect their configuration, training data, models, metrics, etc.

Remark: Recall that DVC configures Git to ignore output data. How is versioning of such data implemented? DVC manages output data in the repository’s subfolder .dvc/cache (which is also ignored by Git, as configured in .dvc/.gitignore). DVC-cached files are exposed to us as hardlinks from output files into DVC’s cache folder, where DVC takes care of managing the hardlinks.

Reproducing the pipeline

Pat yourself on the back. You have mastered building a pipeline, which is the hard part. Reproducing (parts of) it, i.e., re-executing stages with changed dependencies, is super-easy. First, note that if we do not change any dependencies, there is nothing to be reproduced.


$$ dvc repro evaluate.dvc
...
Stage 'load.dvc' didnt change.
Stage 'train.dvc' didnt change.
Stage 'evaluate.dvc' didnt change.
Pipeline is up to date. Nothing to reproduce.

When changing the amount of training data (see pen icon in the following figure) and calling the dvc repro-command with parameter evaluate.dvc for the last stage (red play icon), the entire pipeline will be reproduced (red arrows).


$$ echo '{ "num_images" : 2000 }' > config/load.json
$$ dvc repro evaluate.dvc
...
Warning: Dependency 'config/load.json' of 'load.dvc' changed.
Stage 'load.dvc' changed.
Reproducing 'load.dvc'
...
Warning: Dependency 'data' of 'train.dvc' changed.
Stage 'train.dvc' changed.
Reproducing 'train.dvc'
...
Warning: Dependency 'model/model.h5' of 'evaluate.dvc' changed.
Stage 'evaluate.dvc' changed.
Reproducing 'evaluate.dvc'

Observe that DVC tracks changes to dependencies and outputs through md5-sums stored in their corresponding stage’s .dvc-file:


$$ git status
        modified:   config/load.json
        modified:   evaluate.dvc
        modified:   model/metrics.json
        modified:   load.dvc
        modified:   train.dvc
$$ git diff load.dvc
...
deps:
-- md5: 44260f0cf26e82df91b23ab9a75bf4ae
+- md5: 2e13ea50a381a9be4809026b71210d5a
...
outs:
-  md5: b5136af809e828b0c51735f40c5d6db6.dir
+  md5: 8265c35e6eb17e79de9705cbbbd9a515.dir

Let us save this version of our pipeline and tag it.


$$ git add load.dvc train.dvc evaluate.dvc config/load.json model/metrics.json
$$ git commit -m "0.2 more training data"
$$ git tag -a 0.2 -m "0.2 more training data"

Reproducing partially

What if only training configuration changes, but training data remains the same? All stages but the load stage should be reproduced. We have control over which stages of the pipeline are reproduced. In a first step, we reproduce only the training stage by issuing the dvc repro command with parameter train.dvc, the stage in the middle of the pipeline (we increase the number of convolution filters in our neural network).


$$ echo '{ "num_conv_filters" : 64 }' > config/train.json
$$ dvc repro train.dvc
...
Stage 'load.dvc' didnt change.
Warning: Dependency 'config/train.json' of 'train.dvc' changed.
Stage 'train.dvc' changed.
Reproducing 'train.dvc'...

We can now reproduce the entire pipeline. Since we already performed re-training, the trained model was changed also, and only the evaluation stage will be executed.


$$ dvc repro evaluate.dvc
...
Stage 'load.dvc' didnt change.
Stage 'train.dvc' didnt change.
Warning: Dependency 'model/model.h5' of 'evaluate.dvc' changed.
Stage 'evaluate.dvc' changed.
Reproducing 'evaluate.dvc'...

Finally, let us increase the amount of available training data and trigger the entire pipeline by reproducing the evaluate.dvc stage.


$$ echo '{ "num_images" : 3000 }' > config/load.json
$$ dvc repro evaluate.dvc
...
Warning: Dependency 'config/load.json' of 'load.dvc' changed.
Stage 'load.dvc' changed.
Reproducing 'load.dvc'
...
Warning: Dependency 'data' of 'train.dvc' changed.
Stage 'train.dvc' changed.
Reproducing 'train.dvc'
...
Warning: Dependency 'model/model.h5' of 'evaluate.dvc' changed.
Stage 'evaluate.dvc' changed.
Reproducing 'evaluate.dvc'...

Again, we save this version of our pipeline and tag it.


$$ git add config/load.json config/train.json evaluate.dvc load.dvc train.dvc model/metrics.json
$$ git commit -m "0.3 more training data, more convolutions"
$$ git tag -a 0.3 -m "0.3 more training data, more convolutions"

Comparing versions

Recall that we have defined a metric for the evaluation stage in the file model/metrics.json. DVC can list metrics files for all tags in the entire Git repository, which allows us to compare model performances for various versions of our pipeline. Clearly, increasing the amount of training data and adding convolution filters to the neural network improves the model’s accuracy.


$$ dvc metrics show -T # -T for all tags
...
0.1:
        model/metrics.json: [0.896969696969697]
0.2:
        model/metrics.json: [0.9196969693357294]
0.3:
        model/metrics.json: [0.9565656557227626]

Actually, the file model/metrics.json stores not only the model’s accuracy, but also its loss. To display only the accuracy, we have configured DVC with an XPath expression as follows. This expression is stored in the corresponding stage’s .dvc file.


$$ dvc metrics modify model/metrics.json --type json --xpath acc
$$ cat evaluate.dvc
...
metric:
 type: json
 xpath: acc
...

Remark 1: DVC also supports metrics stored in CSV files or plain text files. In particular, DVC does not interpret metrics, and instead treats them as plain text.

Remark 2: For consistent metrics display over all pipeline versions, metrics should be configured in the very beginning of your project. In this case, configuration contained in .dvc-files is the same for all versions.

Sharing data

When developing models in teams, sharing training data, readily trained models, and performance metrics is crucial for efficient collaboration–each team member retraining the same model is a waste of time. Recall that stage output data is not stored in the Git repository. Instead, DVC manages these files in its .dvc/cache folder. DVC allows us to push cached files to remote storage (SSH, NAS, Amazon, S3, …). From there, each team member can pull that data to their individual workspace’s DVC cache and work with it as usual.

For the purpose of this walkthrough, we fake remote storage using a local folder called /remote. Here is how to configure the remote and push data to it.


$$ mkdir /remote/dvc-cache
$$ dvc remote add -d fake_remote /remote/dvc-cache # -d for making the remote default
$$ git add .dvc/config # save the remote's configuration
$$ git commit -m "configure remote"
$$ dvc push -T

The -T parameter pushes cached files for all tags. Note that dvc push intelligently pushes only new or changed data, and skips over data that has remained the same since the last push.

How would your team member access your pushed data? (If you followed along in your shell, exit the container and recreate it by calling ./start_environment.sh bash. The following steps are documented in /home/dvc/scripts/clone.sh and should be applied in the /home/dvc-folder.) Recall that cloning the Git repository will not check out training data etc., since such files are managed by DVC. We need to instruct DVC to pull that data from the remote storage. Thereafter, we can access the data as before.


$$ cd /home/dvc
$$ git clone /remote/git-repo walkthrough-cloned
$$ cd walkthrough-cloned
$$ ls data
ls: cannot access 'data': No such file or directory # no training data there :(
$$ dvc pull -T # -T to pull for all tags
$$ ls data
0  1  2  3  4  5  6  7  8  9 # theeere is our training data :)

Remark: Pushing/pulling DVC-managed data for all tags (-T parameter) is not advisable in general, since you will send/receive lots of data.

Conclusion

DVC allows you to define (language-agnostic) reproducible ML pipelines and version pipelines together with their associated training data, configuration, performance metrics, etc. Performance metrics can be evaluated for all versions of a pipeline. Training data, trained models, and other associated binary data can be shared (storage-agnostic) with team members for efficient collaboration.

If you like, share your thoughts in the comments !

In this follow-up post we discuss how to use DVC artifacts as dependencies in other projects.

Was this post helpful?

Likes

Blog author

Bert Besser

Do you still have questions? Just send me a message.

fromBert Besser

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 Minuten Lesezeit

Marcel Mikl

Bert Besser

DVC dependency management – a guide

This post is a follow-up to A walkthrough of DVC that deals with managing dependencies between DVC projects. In particular, this follow-up is about importing specific versions of an artifact (e.g. a trained model or a dataset) from one DVC project into...

Data
AI
Machine Learning

26.8.2019 | 10 Minuten Lesezeit

Bert Besser

Veronika Schwan

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

ChatGPT im Alltag eines Python-Entwicklers

Seit einigen Tagen spiele ich mit ChatGPT herum. Beruflich und privat konnte ich damit einige Fragen bearbeiten, bspw. welche Alternativen es zu bestimmten Tools gibt, was Vorteile von Teilzeit für den Arbeitgeber sind oder wer ich bin. Leider weiß ChatGPT...

NLP
Python
Künstliche Intelligenz

27.1.2023 | 7 Minuten Lesezeit

Robert Meißner

Manches gehört zusammen, manches besser nicht - Konnaszenz in Python

Wir alle kennen es. Wir bekommen neuen Code und irgendwie macht der merkwürdige Sachen. Teilweise müssen wir Reverse Engineering betreiben. Wir wundern uns, warum eine Umgebungsvariable nicht korrekt gesetzt wird oder der Login schief geht. Bis wir merken...

Python
Softwareentwicklung
Softwarearchitektur

30.11.2022 | 7 Minuten Lesezeit

Robert Meißner

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

„Strawberry JSON Fields Forever“: Filtern nach JSON-Feldern mit GraphQL...

Schon die Beatles besangen ein uraltes Problem in ihrem Song „Strawberry JSON Fields Forever“ : Wie lässt sich mit der GraphQL Library Strawberry für Python nach Werten in JSON-Feldern einer PostgreSQL-Datenbank filtern?SetupUm das zu zeigen, braucht...

Frontend
API
Python

26.6.2022 | 4 Minuten Lesezeit

Michael Eichenseer

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Wie man Java-Klassen in Python benutzt

Generell sollte man zwar für jedes Problem das passende Werkzeug nutzen. Aber oftmals wird man gezwungen, den Hammer Java zu nutzen, weil der Rest des Hauses mit diesem Hammer gebaut wurde. Eine moderne Lösung dieses Problems ist natürlich die Microservice...

Künstliche Intelligenz
Java
Python

15.11.2021 | 8 Minuten Lesezeit

Hendrik Schawe

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

Automatisch skaliertes Cloud Native Consent Management in der Google Cloud

Immer häufiger ersetzen unsere Kunden lokale Rechenzentren durch eine Cloud-Infrastruktur. Die Gründe sind Ausfallsicherheit, Wartbarkeit und vor allem Skalierbarkeit. Mit dem letzten dieser Aspekte befassen wir uns in diesem Blogartikel anhand eines...

APM
Python
Cloud
Google Cloud
Infrastructure
Softwarearchitektur
Serverless

28.6.2021 | 16 Minuten Lesezeit

Markus Lüger

Christopher

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

A walkthrough of DVC

A toy project

Preparing the repository

Defining the pipeline

DVC-cached files

Reproducing the pipeline

Reproducing partially

Comparing versions

Sharing data

Conclusion

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Remote training with GitLab-CI and DVC

DVC dependency management – a guide

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Große Sprachmodelle: Was ist ein LLM?

Bessere SQL-Datenpipelines mit dbt

ChatGPT im Alltag eines Python-Entwicklers

Manches gehört zusammen, manches besser nicht - Konnaszenz in Python

Streaming Wikipedia mit Apache Kafka

„Strawberry JSON Fields Forever“: Filtern nach JSON-Feldern mit GraphQL...

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Wie man Java-Klassen in Python benutzt

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Automatisch skaliertes Cloud Native Consent Management in der Google Cloud

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten