AWS SageMaker Machine Learning Data handling

No Comments

Seven ways of handling image and machine learning data with AWS SageMaker and S3

If you start using AWS machine learning services, you will have to dive into data handling with AWS SageMaker and S3. We want to show you seven ways of handling image and machine learning data with AWS SageMaker and S3 in order to speed up your coding and make porting your code to AWS easier.

If you are working on computer vision and machine learning tasks, you are probably using the  most common libraries such as OpenCV, matplotlib, pandas and many more. As soon as you are working with or migrating to AWS SageMaker, you will be confronted with the challenge of loading, reading, and writing files from the recommended AWS storage solution, which is the Simple Storage Service or S3. This article helps you migrate your existing code to the AWS environment. If you want to migrate already existing code that was not written for SageMaker, you need to know some techniques to get the job done fast. This article gives a short overview of how to handle computer vision and machine learning data with SageMaker and options how to port your notebooks that have not been written for SageMaker.

If you are new around here, please take a look at at our AI portfilio, YouTube channel and our Deep Learning Bootcamp.

Storage Architecture of SageMaker and S3

In order to get a better understanding of the setup, we will take a short look at the storage architecture of SageMaker.

AWS SageMaker Data architecture

AWS SageMaker storage architecture

With your local machine learning setup you are used to managing your data locally on your disk and your code probably in a Git repository on GitHub. For coding you probably use a Jupyter notebook, at least for experimenting. In this setup you are able to access your data directly from your code. 

In contrast to that, machine learning in AWS relies on somewhat temporary SageMaker machine learning instances that can be started and stopped. As soon as you terminate (delete) your instance and load your notebook into a new instance, all the data on the instance is gone unless it was stored elewhere. All data that should be permanent or needs to be shared between different instances, e.g. being available to a training instance, should be held outside of the instance storage. The place where to put the ML input data is the Amazon Simple Storage Service – S3. You’ll find additional hints on proper access rights and cost considerations at the very end of the article.

The data on your instance either  reside on the instances file system, Elastic Block Storage, or in memory. Additional sources of data and code could be public or private Git repositories, either hosted on Github or in AWS CodeCommit. The code can reside in your instances storage as part of the inference/training job in the AWS Training/Inference Images that are held in the Elastic Container Registry. 

During the creation steps of the the SageMaker ML instance you define how much instance storage you want to assign. It needs to be enough to handle all the ML data that you want to work with. In our case we assign the standard 5GB. At this step you should be aware that this instance is different from the training instance, which you will spawn from your notebook. The notebook instance might need much less of storage and compute power than the training instance. Often it is enough to use a small instance  for the data preparation steps and not an accelerated one.

AWS SageMaker Instance Volume Size

Specifying the instance volume size

Seven ways to access your machine learning data and to reuse your existing code

Depending on what you want to do with your data and how often you need it during work, you have the following options

  1. Using a Code Repository for data delivery 
  2. Code based data replication
  3. Copying data to the instance with the AWS client
  4. Streaming data from S3 to the instance-memory
  5. Using temporary files on the instance
  6. Make use of S3 compatible framework method
  7. Replace ML framework functions with AWS custom methods

1 – Using a code repository for data delivery 

One way to bring the original code and small ML datasets on the SageMaker instance is the use of your Git repository. The repository is cloned initially into you SageMaker instance. All the data will be available at the root directory of your jupyter notebook. This method is not necessarily recommended for all cases.

Using Git repositories in AWS Sage maker for data delivery

Adding a repository

The issue might be that source control management systems such as Git do not cope very well with bigger chunks of data. Especially they try to generate diffs for files which does not work well with large binary files. A good article about the pros and cons of holding training data in Git repositories can be found here. An alternative to Git is DVC, which stands for Data version control. We have already published a walkthrough of DVC and and article about DVC dependency management. The idea of DVC is that  the information about the ML binary data is placed in small text file in your Git repository, but the actual binary data is managed with DVC.  After checking out your code base version from GIT you would use a command like ‘!dvc checkout’ to get the data from your binary storage, which could also be AWS S3. Shell commands are executed by placing a ‘!’ in front of the command you want to execute from your notebook.

2 – Code based data replication

Another easy way to work with your already existing scripts, without too much of a modification, is to make a full copy of your training data on the SageMaker instance. You can do it as part if your code or by using command line tools (see below). Basically you traverse the whole ML data tree, create locally all the directories you need create all files that are needed. 

# Download all S3 data to the your instance 
import boto3 
from botocore.exceptions import ClientError 
s3 = boto3.resource('s3', region_name='us-east-2') 
bucket = s3.Bucket('sagemaker-cc-people-counter-trainingsset') 
for my_bucket_object in bucket.objects.all():    
    key = my_bucket_object.key    
    print(key)    
    if not os.path.exists(os.path.dirname(key)):           
        os.makedirs(os.path.dirname(key))     
    try:         
        bucket.download_file(key, key)     
    Except ClientError as e:         
        if e.response['Error']['Code'] == "404":             
            print("No object with this key.")        
        else:             
            raise
copying the bucket to your instance

3 – Copying data to the instance with the AWS client

Another easy way to work with your already existing scripts, without too much of a modification, is to make a full copy of your training data on the SageMaker instance. You can do it as part if your code (see above) or by using command line tools. 

A very simple and easy way to copy data from your S3 bucket to your instance is to use the AWS command line tools. You can copy your data back and forth between s3:// and your instance storage, as well between s3:// bucket and s3:// bucket. It is important that you set your IAM Policies correctly (see hints at the end of the article).

!aws s3 cp s3://$bucket/train/images train/images/ --recursive

The documentation can be found here.

4 – Streaming data from S3 to the SageMaker instance-memory

Streaming means to read the object directly to memory instead of writing it to a file. Also interesting but not necessary for our current challenge, is the question of lazy reading with S3 resources – reading only the actually needed part of the file – you can find some more description here https://alexwlchan.net/2017/09/lazy-reading-in-python/

import matplotlib.image as mpimage
…
image = mpimage.imread(img_fd)
original call
import boto3 import io s3 = boto3.client('s3') 
obj = s3.get_object(Bucket='bucket', Key='key') 
image = mpimg.imread(io.BytesIO(obj['Body'].read()), 'jp2')
call using streaming data

5 – Using temporary files on the SageMaker instance

Another way to work with your usual methods is to create temporary files on your SageMaker instance and feed them into the standard methods as a file path. Tempfiles provides automatic cleanup. For more information you can refer to the documentation.

from matplotlib import pyplot as plt
...
img= plt.imread(img_path)
original call
import boto3
import tempfile
from matplotlib import pyplot as plt
...
s3 = boto3.resource('s3', region_name='us-east-2')
bucket = s3.Bucket('sagemaker-cc-people-counter-trainingsset')
object = bucket.Object(img_path)
tmp = tempfile.NamedTemporaryFile()
with open(tmp.name, 'wb') as f:
    object.download_fileobj(f)
    f.seek(0,2)
    img = plt.imread(tmp.name)
    print (img.shape)
new approach by using temporary files

6 – Make use of S3 compatible framework method

Some of the popular frameworks implement more options to access data than file path stings of file descriptors. As an example the pandas library uses the URI schemes to properly identify the method of accessing the data. While file:// will look on the local file system, s3:// accesses the data through the AWS boto library. You will find additional infos here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html. For pandas any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected.

import pandas as pd
data = pd.read_csv('file://oilprices_data.csv')
original call accessing local files
import pandas as pd
data = pd.read_csv('s3://bucket....csv')
new call with S3 URI

7 – Replace ML framework functions with AWS custom methods

Some further examples for using AWS native methods instead of machine learning library calls.

plt.imshow(Image.open(img_paths[0]))
original call

can be replaced by

from s3fs.core import S3FileSystem
    with s3fs.open('{}/{}'.format('sagemaker-cc-people-counter-trainingsset', img_paths[0])) as f:   
        display(Image.open(f))
call using s3fs

Another Example with scipy

import scipy.io as io
mat = io.loadmat(img_path.replace('.jpg','.mat'))
original call

can be replaced by

from s3fs.core import S3FileSystem 
s3fs = S3FileSystem() 
mat = pio.loadmat(s3fs.open('{}/{}'.format('sagemaker-cc-people-counter-trainingsset', img_path.replace('.jpg','.mat'))
call using s3fs

Conclusion

The task of porting jupyter notebooks to AWS SageMaker can be a little bit tedious at first, but if you know what tricks to use it gets a lot easier. A key part of porting the notebooks is to get the data handling right and to decide what approach you want to take in order to enable or replace your usual ML framework calls. We have shown some options how to approach this task. If you have some additional tricks or hints, please let me know @kherings. I recommend you to have a look at our AI portfolio, youtube channel and our deep learning bootcamp.

Additional Hints

S3 access rights 

In order to access your data from your SageMaker Instance you need to have proper access rights. Actually, the SageMaker Instance that is running needs to have the proper access rights to use the S3 Service and access the bucket (directory) where the data is held. Your SageMaker Instance needs to have a proper AWS service role, that contains a IAM policy with the rights to access the S3 Bucket. There are two options, either let SageMaker generate a AmazonSageMakerFullAccess role for you, or make a custom one.

Proper IAM Policies and roles are necessary to access S3

The AmazonSageMaker-ExecutioRole lets the notebook access all S3 buckets, containing the string ‘sagemaker’ in its name. The other quick option is to create a S3 Full Access Policy to you custom role. (not recommended)

Making an AWS SageMaker accessible S3 bucket

Considering S3 storage cost for your image data

If you are working [with data] on the AWS cloud, you should keep an eye on the cost of your actions in order not to have an unpleasant surprise. Typically you rather save money with AWS in comparison to a local setup, but you should be aware of the cost drivers and use the AWS Cost Explorer and the AWS SageMaker pricing tables. Depending on the size and frequency of the access to your ML data you might want to change the storage class of your S3 bucket or activate S3 Intelligent-Tiering. Price comparisons can be found here.

S3 Storage tiers

Kai Herings

Kai Herings leads the Innovation Acceleration Team for codecentric. He is involved in projects in various areas, such as Deep-Learning & A.I, AR/VR, Industry 4.0, IoT, Blockchain, …

Comment

Your email address will not be published. Required fields are marked *