AWS SageMaker Machine Learning Data handling

17.1.2020 | 9 minutes of reading time

Seven ways of handling image and machine learning data with AWS SageMaker and S3

If you start using AWS machine learning services, you will have to dive into data handling with AWS SageMaker and S3. We want to show you seven ways of handling image and machine learning data with AWS SageMaker and S3 in order to speed up your coding and make porting your code to AWS easier.

If you are working on computer vision and machine learning tasks, you are probably using the most common libraries such as OpenCV , matplotlib , pandas and many more. As soon as you are working with or migrating to AWS SageMaker, you will be confronted with the challenge of loading, reading, and writing files from the recommended AWS storage solution, which is the Simple Storage Service or S3. This article helps you migrate your existing code to the AWS environment. If you want to migrate already existing code that was not written for SageMaker, you need to know some techniques to get the job done fast. This article gives a short overview of how to handle computer vision and machine learning data with SageMaker and options how to port your notebooks that have not been written for SageMaker.

If you are new around here, please take a look at at our AI portfilio , YouTube channel and our Deep Learning Bootcamp.

Storage Architecture of SageMaker and S3

In order to get a better understanding of the setup, we will take a short look at the storage architecture of SageMaker.

AWS SageMaker storage architecture

With your local machine learning setup you are used to managing your data locally on your disk and your code probably in a Git repository on GitHub. For coding you probably use a Jupyter notebook, at least for experimenting. In this setup you are able to access your data directly from your code.

In contrast to that, machine learning in AWS relies on somewhat temporary SageMaker machine learning instances that can be started and stopped. As soon as you terminate (delete) your instance and load your notebook into a new instance, all the data on the instance is gone unless it was stored elewhere. All data that should be permanent or needs to be shared between different instances, e.g. being available to a training instance, should be held outside of the instance storage. The place where to put the ML input data is the Amazon Simple Storage Service – S3. You’ll find additional hints on proper access rights and cost considerations at the very end of the article.

The data on your instance either reside on the instances file system, Elastic Block Storage, or in memory. Additional sources of data and code could be public or private Git repositories, either hosted on Github or in AWS CodeCommit. The code can reside in your instances storage as part of the inference/training job in the AWS Training/Inference Images that are held in the Elastic Container Registry.

During the creation steps of the the SageMaker ML instance you define how much instance storage you want to assign. It needs to be enough to handle all the ML data that you want to work with. In our case we assign the standard 5GB. At this step you should be aware that this instance is different from the training instance, which you will spawn from your notebook. The notebook instance might need much less of storage and compute power than the training instance. Often it is enough to use a small instance for the data preparation steps and not an accelerated one.

Specifying the instance volume size

Seven ways to access your machine learning data and to reuse your existing code

Depending on what you want to do with your data and how often you need it during work, you have the following options

Using a Code Repository for data delivery
Code based data replication
Copying data to the instance with the AWS client
Streaming data from S3 to the instance-memory
Using temporary files on the instance
Make use of S3 compatible framework method
Replace ML framework functions with AWS custom methods

1 – Using a code repository for data delivery

One way to bring the original code and small ML datasets on the SageMaker instance is the use of your Git repository. The repository is cloned initially into you SageMaker instance. All the data will be available at the root directory of your jupyter notebook. This method is not necessarily recommended for all cases.

Adding a repository

The issue might be that source control management systems such as Git do not cope very well with bigger chunks of data. Especially they try to generate diffs for files which does not work well with large binary files. A good article about the pros and cons of holding training data in Git repositories can be found here . An alternative to Git is DVC, which stands for Data version control. We have already published a walkthrough of DVC and and article about DVC dependency management . The idea of DVC is that the information about the ML binary data is placed in small text file in your Git repository, but the actual binary data is managed with DVC. After checking out your code base version from GIT you would use a command like ‘!dvc checkout’ to get the data from your binary storage, which could also be AWS S3. Shell commands are executed by placing a ‘!’ in front of the command you want to execute from your notebook.

2 – Code based data replication

Another easy way to work with your already existing scripts, without too much of a modification, is to make a full copy of your training data on the SageMaker instance. You can do it as part if your code or by using command line tools (see below). Basically you traverse the whole ML data tree, create locally all the directories you need create all files that are needed.

1# Download all S3 data to the your instance 
2import boto3 
3from botocore.exceptions import ClientError 
4s3 = boto3.resource('s3', region_name='us-east-2') 
5bucket = s3.Bucket('sagemaker-cc-people-counter-trainingsset') 
6for my_bucket_object in bucket.objects.all():    
7    key = my_bucket_object.key    
8    print(key)    
9    if not os.path.exists(os.path.dirname(key)):           
10        os.makedirs(os.path.dirname(key))     
11    try:         
12        bucket.download_file(key, key)     
13    Except ClientError as e:         
14        if e.response['Error']['Code'] == "404":             
15            print("No object with this key.")        
16        else:             
17            raise

copying the bucket to your instance

3 – Copying data to the instance with the AWS client

A very simple and easy way to copy data from your S3 bucket to your instance is to use the AWS command line tools. You can copy your data back and forth between s3:// and your instance storage, as well between s3:// bucket and s3:// bucket. It is important that you set your IAM Policies correctly (see hints at the end of the article).

1!aws s3 cp s3://$bucket/train/images train/images/ --recursive

The documentation can be found here .

4 – Streaming data from S3 to the SageMaker instance-memory

Streaming means to read the object directly to memory instead of writing it to a file. Also interesting but not necessary for our current challenge, is the question of lazy reading with S3 resources – reading only the actually needed part of the file – you can find some more description here https://alexwlchan.net/2017/09/lazy-reading-in-python/

1import matplotlib.image as mpimage
2…
3image = mpimage.imread(img_fd)

original call

1import boto3 import io s3 = boto3.client('s3') 
2obj = s3.get_object(Bucket='bucket', Key='key') 
3image = mpimg.imread(io.BytesIO(obj['Body'].read()), 'jp2')

call using streaming data

5 – Using temporary files on the SageMaker instance

Another way to work with your usual methods is to create temporary files on your SageMaker instance and feed them into the standard methods as a file path. Tempfiles provides automatic cleanup. For more information you can refer to the documentation .

1from matplotlib import pyplot as plt
2...
3img= plt.imread(img_path)

original call

1import boto3
2import tempfile
3from matplotlib import pyplot as plt
4...
5s3 = boto3.resource('s3', region_name='us-east-2')
6bucket = s3.Bucket('sagemaker-cc-people-counter-trainingsset')
7object = bucket.Object(img_path)
8tmp = tempfile.NamedTemporaryFile()
9with open(tmp.name, 'wb') as f:
10    object.download_fileobj(f)
11    f.seek(0,2)
12    img = plt.imread(tmp.name)
13    print (img.shape)

new approach by using temporary files

6 – Make use of S3 compatible framework method

Some of the popular frameworks implement more options to access data than file path stings of file descriptors. As an example the pandas library uses the URI schemes to properly identify the method of accessing the data. While file:// will look on the local file system, s3:// accesses the data through the AWS boto library. You will find additional infos here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html. For pandas any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected.

1import pandas as pd
2data = pd.read_csv('file://oilprices_data.csv')

original call accessing local files

1import pandas as pd
2data = pd.read_csv('s3://bucket....csv')

new call with S3 URI

7 – Replace ML framework functions with AWS custom methods

Some further examples for using AWS native methods instead of machine learning library calls.

1plt.imshow(Image.open(img_paths[0]))

original call

can be replaced by

1from s3fs.core import S3FileSystem
2    with s3fs.open('{}/{}'.format('sagemaker-cc-people-counter-trainingsset', img_paths[0])) as f:   
3        display(Image.open(f))

call using s3fs

Another Example with scipy

1import scipy.io as io
2mat = io.loadmat(img_path.replace('.jpg','.mat'))

original call

can be replaced by

1from s3fs.core import S3FileSystem 
2s3fs = S3FileSystem() 
3mat = pio.loadmat(s3fs.open('{}/{}'.format('sagemaker-cc-people-counter-trainingsset', img_path.replace('.jpg','.mat'))

call using s3fs

Conclusion

The task of porting jupyter notebooks to AWS SageMaker can be a little bit tedious at first, but if you know what tricks to use it gets a lot easier. A key part of porting the notebooks is to get the data handling right and to decide what approach you want to take in order to enable or replace your usual ML framework calls. We have shown some options how to approach this task. If you have some additional tricks or hints, please let me know @kherings. I recommend you to have a look at our AI portfolio , youtube channel and our deep learning bootcamp.

Additional Hints

S3 access rights

In order to access your data from your SageMaker Instance you need to have proper access rights. Actually, the SageMaker Instance that is running needs to have the proper access rights to use the S3 Service and access the bucket (directory) where the data is held. Your SageMaker Instance needs to have a proper AWS service role, that contains a IAM policy with the rights to access the S3 Bucket. There are two options, either let SageMaker generate a AmazonSageMakerFullAccess role for you, or make a custom one.

The AmazonSageMaker-ExecutioRole lets the notebook access all S3 buckets, containing the string ‘sagemaker’ in its name. The other quick option is to create a S3 Full Access Policy to you custom role. (not recommended)

Considering S3 storage cost for your image data

If you are working [with data] on the AWS cloud, you should keep an eye on the cost of your actions in order not to have an unpleasant surprise. Typically you rather save money with AWS in comparison to a local setup, but you should be aware of the cost drivers and use the AWS Cost Explorer and the AWS SageMaker pricing tables . Depending on the size and frequency of the access to your ML data you might want to change the storage class of your S3 bucket or activate S3 Intelligent-Tiering. Price comparisons can be found here .

Was this post helpful?

Likes

Blog author

Kai Herings

Do you still have questions? Just send me a message.

fromKai Herings

Physical regression testing for the Thermomix

Automating physical regression testing of products with computer vision and robotics Testing a physical product can be a highly manual task. The advances in Deep Learning techniques and computer vision have led to a situation where we can start to strive...

AWS
IoT
Computer Vision
Product management
AI
Testing

31.3.2020 | 8 Minuten Lesezeit

Kai Herings

GANmojis: Deep Learning meets Emojis

Die Generierung künstlicher Bilder mittels Deep Learning schlägt seit einigen Jahren immer wieder hohe Wellen in den Medien. Im Rahmen von codecentric.ai beschäftigen auch wir uns seit Kurzem mit diesem Thema. In diesem Blogartikel zeigen wir euch, ...

Künstliche Intelligenz

22.1.2019 | 5 Minuten Lesezeit

Tim Sabsch

Kai Herings

codecentric auf dem Digitalgipfel 2018 in Nürnberg

Dieses Jahr hat in Nürnberg wieder der Digital-Gipfel der Bundesregierung stattgefunden. Unter dem Motto ‚Den digitalen Wandel gemeinsam gestalten‘ haben sich im Nürnberger Messe-Zentrum Vertreter von Politik, Wirtschaft, Wissenschaft und Gesellschaft...

Künstliche Intelligenz

4.12.2018 | 5 Minuten Lesezeit

Kai Herings

MQ! Audi Innovation Summit – Ein Zwischenbericht aus der Transformation...

Ein Zwischenbericht aus der Transformation des deutschen Automobilsektors. Der zweitägige Audi MQ! Innovation Summit ist eine Plattform, um gemeinsam mit ausgewählten Teilnehmern die kommenden Herausforderungen des Automobil-Sektors – konkreter noch...

Agilität
Künstliche Intelligenz
Machine Learning
Agile Transformation

13.11.2018 | 4 Minuten Lesezeit

Kai Herings

Deep Learning diesel car detection with AWS Deeplens

With this series, we would like to give you an understanding of different machine and deep learning approaches, illustrated by the example of recognizing diesel vehicles. In this article, we have summarized the approach based on deep learning in neural...

AWS
Computer Vision
AI
Machine Learning

12.11.2018 | 10 Minuten Lesezeit

Kai Herings

Diesel detection with Machine Learning: The HOG detector

Diesel city driving bans as a use case for machine and deep learning applications In this series of articles, we would like to give you an understanding of different machine and deep learning approaches using the example of detecting diesel cars by recognition...

Machine Learning

22.10.2018 | 9 Minuten Lesezeit

Kai Herings

Deep Diesel: Machine & Deep Learning for diesel car detection

In this article series, we show different machine and deep learning approaches on the use case of detecting diesel cars as well as environmental zone badges and type labels on the cars. This article gives an introduction and an overview of the article...

AWS
Machine Learning

15.10.2018 | 8 Minuten Lesezeit

Kai Herings

Erkennung von Dieselfahrzeugen mit AWS DeepLens

Wir wollen euch in dieser Artikelreihe anhand des Beispiels “Erkennung von Dieselfahrzeugen” unterschiedliche Machine- und Deep-Learning-Ansätze näherbringen. In diesem Artikel haben wir das Vorgehen auf Basis von Deep-Learning in neuronalen Netzen zusammengefasst...

AWS
Computer Vision
Künstliche Intelligenz
Machine Learning

22.7.2018 | 10 Minuten Lesezeit

Kai Herings

Deep Diesel – Teil 2: Machine-Learning-Dieselfilter HOG Detektor

Durchsetzung von Diesel-Fahrverboten als Beispiel für Machine- und Deep-Learning-Anwendungen Wir zeigen euch in dieser Artikelreihe anhand des Beispiels “Erkennung von Umweltplaketten an Fahrzeugen” unterschiedliche Machine- und Deep-Learning-Ansätze...

Computer Vision
Künstliche Intelligenz
Softwareentwicklung

29.5.2018 | 9 Minuten Lesezeit

Kai Herings

Deep Diesel: Machine und Deep Learning zur Erkennung von Dieselfahrzeugen

Durchsetzung von Diesel Fahrverboten als Beispiel für Machine- und Deep-Learning-Anwendungen Wir zeigen euch in dieser Artikelreihe anhand des Beispiels “Erkennung von Umweltplaketten an Fahrzeugen” unterschiedliche Machine- und Deep-Learning-Ansätze...

AWS
Künstliche Intelligenz
Softwareentwicklung
Machine Learning

16.4.2018 | 7 Minuten Lesezeit

Kai Herings

Ein Überblick aktuell relevanter netzpolitischer Themen

Aktuell gibt es mehrere politische Initiativen auf Bundes- und EU-Ebene, um infrastrukturelle und inhaltliche Aspekte des Internets stärker zu kontrollieren und zu regulieren. Viele der aktuellen politischen Entwicklungen haben das Potential, Veränderungen...

Community
Compliance
Machine Learning

22.3.2018 | 6 Minuten Lesezeit

Kai Herings

Implementing a simple Smart Contract for Asset Tracking [blockcentric ...

Our article series “blockcentric” discusses blockchain-related technology, projects, organization and business concerns. It contains knowledge and findings from our work, but also news from the area. Blockchain in the supply chain is covered in several...

Crypto
Blockchain

18.1.2018 | 7 Minuten Lesezeit

Kai Herings

Blockchain in the Supply Chain – Implementation with industrial shop floor...

Our article series “blockcentric” discusses Blockchain technology, projects, organization and business concerns. It contains knowledge and findings from our work but also news from the area. Blockchain in the supply chain is covered in several parts...

IoT
Blockchain
Software architecture
Agile methods

26.10.2017 | 4 Minuten Lesezeit

Kai Herings

Blockchain in the Supply Chain – A Practical Introduction [blockcentric...

Our article series “blockcentric” discusses Blockchain-related technology, projects, organization and business concerns. It contains knowledge and findings from our work but also news from the area. Blockchain in the supply chain is covered in several...

Blockchain
DevOps
IoT
Software architecture
Webdevelopment

23.10.2017 | 8 Minuten Lesezeit

Kai Herings

Unblocking the Supply Chain with Blockchain

Warum Blockchain in der Supply Chain? Der Grund für die breite Adaption einer Technologie in einer bestimmten Domäne ist simpel: Sie löst konkrete Probleme oder schafft neue Werte. Wir schauen uns in diesem Blog Post an, warum Blockchain ein guter Kandidat...

Softwarearchitektur
Blockchain

20.8.2017 | 6 Minuten Lesezeit

Kai Herings

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Mit FinOps die größten Kostenfallen bei AWS S3 verhindern

In der Welt der Cloud-Technologie und insbesondere bei AWS (Amazon Web Services) ist die effiziente Verwaltung von Ressourcen von entscheidender Bedeutung, um unnötige Kosten zu vermeiden. Dieser Blogbeitrag konzentriert sich auf AWS S3 und die teuren...

AWS
Cloud

27.11.2023 | 4 Minuten Lesezeit

Lukas Miliunas

Maximilian Mayer

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

CI/CD-Pipelines mit AWS CDK CodePipeline

Das Aufsetzen der CI/CD-Pipeline ist ein typischer Task in der Anfangszeit eines Projekts. Ist die Pipeline dann aufgesetzt, sind Änderungen nur noch selten notwendig. Dementsprechend wenig Routine entwickeln Programmierende im Umgang mit der Konfiguration...

Cloud
CI/CD
AWS

17.7.2023 | 4 Minuten Lesezeit

Dennis

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

AWS Cloud Development Kit – Infrastructure as Code on Steroids

Infrastructure as Code (IaC) ist inzwischen ein alter Hut. Frameworks wie Terraform, Ansible und andere haben Standards geschaffen. Kaum jemand provisioniert produktive Systeme heute ohne IaC – sei es in der Cloud oder auf der eigenen Infrastruktur.Und...

Infrastructure as Code
AWS
Cloud

21.12.2022 | 3 Minuten Lesezeit

Matthias Niehoff

Infrastructure as Code in AWS: Keine Silver Bullet

TL;DR Es gibt keine Universalmethode. Infrastructure as Code ist ein vergleichsweise neuer Ansatz. Einige Lösungen rund um Infrastructure as Code befinden sich noch in der Entwicklung. Es gibt keinen klaren Favoriten. Die Wahl des passenden Tools hängt...

Cloud
AWS
Infrastructure as Code

13.12.2022 | 27 Minuten Lesezeit

Florian Wiech

Sören

AWS CloudFront Functions testen

Mit den CloudFront Functions bietet AWS die Möglichkeit, den Funktionsumfang von CloudFront um kleine JavaScript-Funktionen zu erweitern. AWS führt diese Funktionen direkt an den Edge-Locations aus und ermöglicht es dadurch, alle ankommenden Requests...

Cloud
AWS
Testing
Softwareentwicklung

4.10.2022 | 3 Minuten Lesezeit

Dennis

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Green Cloud: Emissionen unserer Cloud-Architektur messen

Überall wird von der Cloud geschwärmt: Grenzenlose Skalierung und unzählige Features sind bereits „out of the box“ verfügbar. Das alles gibt es zu unschlagbar günstigen Preisen. Das Thema Nachhaltigkeit kommt dabei selten zur Sprache. Rechenzentren verbrauchen...

AWS
Azure
Cloud
Google Cloud
Green IT

24.4.2022 | 6 Minuten Lesezeit

Dennis

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

AWS SageMaker Machine Learning Data handling

Seven ways of handling image and machine learning data with AWS SageMaker and S3

Storage Architecture of SageMaker and S3

AWS SageMaker storage architecture

Specifying the instance volume size

Seven ways to access your machine learning data and to reuse your existing code

1 – Using a code repository for data delivery

Adding a repository

2 – Code based data replication

copying the bucket to your instance

3 – Copying data to the instance with the AWS client

4 – Streaming data from S3 to the SageMaker instance-memory

original call

call using streaming data

5 – Using temporary files on the SageMaker instance

original call

new approach by using temporary files

6 – Make use of S3 compatible framework method

original call accessing local files

new call with S3 URI

7 – Replace ML framework functions with AWS custom methods

original call

call using s3fs

original call

call using s3fs

Conclusion

Additional Hints

S3 access rights

Considering S3 storage cost for your image data

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Physical regression testing for the Thermomix

GANmojis: Deep Learning meets Emojis

codecentric auf dem Digitalgipfel 2018 in Nürnberg

MQ! Audi Innovation Summit – Ein Zwischenbericht aus der Transformation...

Deep Learning diesel car detection with AWS Deeplens

Diesel detection with Machine Learning: The HOG detector

Deep Diesel: Machine & Deep Learning for diesel car detection

Erkennung von Dieselfahrzeugen mit AWS DeepLens

Deep Diesel – Teil 2: Machine-Learning-Dieselfilter HOG Detektor

Deep Diesel: Machine und Deep Learning zur Erkennung von Dieselfahrzeugen

Ein Überblick aktuell relevanter netzpolitischer Themen

Implementing a simple Smart Contract for Asset Tracking [blockcentric ...

Blockchain in the Supply Chain – Implementation with industrial shop floor...

Blockchain in the Supply Chain – A Practical Introduction [blockcentric...

Unblocking the Supply Chain with Blockchain

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Mit FinOps die größten Kostenfallen bei AWS S3 verhindern

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

CI/CD-Pipelines mit AWS CDK CodePipeline

Große Sprachmodelle: Was ist ein LLM?

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Bessere SQL-Datenpipelines mit dbt

AWS Cloud Development Kit – Infrastructure as Code on Steroids

Infrastructure as Code in AWS: Keine Silver Bullet

AWS CloudFront Functions testen

Streaming Wikipedia mit Apache Kafka

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Green Cloud: Emissionen unserer Cloud-Architektur messen

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.