Going serverless: How to move files from on-prem SFTP to AWS S3

No Comments

Motivation

It is not so rare that we as developers land in a project where the customer uses SFTP (SSH File Transfer Protocol) for exchanging data with their partners. Actually, I can hardly remember a project where SFTP wasn’t in the picture. In my last project, for example, the customer was using field data loggers that were logging and transferring measured values to an on-premises SFTP server. But what was different that time was that we were building a serverless solution in AWS. As you probably know, when an application operates in the AWS serverless world, it is absolutely essential to have your data in S3, so it can easily be used with other AWS services for all kinds of purposes: processing, archiving, analytics, and so on.

What we needed was a mechanism to poll the SFTP server for new files and move them into the S3 bucket. As a result, we built a custom serverless solution with combination of AWS managed services. It is reasonable to ask why we didn’t use AWS Transfer for SFTP. While the answer is simple (it didn’t exist at that time), I think a custom solution still maintains its value for small businesses, where traffic is not heavy and the SFTP server is already part of the existing platform. If this sounds interesting, keep on reading to find out more.

From SFTP to AWS S3: What you will read about in this post

  • Custom solution for moving files from SFTP to S3
  • In-depth description of the architecture
  • Solution constraints and limitations
  • Full source code
  • Infrastructure as Code
  • Detailed guide on how to run it in AWS
  • Video instructions

The architecture

Let’s briefly start by explaining what our solution will do. It will scan an SFTP folder and it will move (meaning both copy & delete) all files from it into an S3 bucket. Actually, it doesn’t have to be only one folder/bucket pair, you can configure as many source and destination pairs as you want. Another important thing to ask is: when does it get executed? It does so based on a schedule. You will use a Cron expression to schedule the execution, so it is pretty flexible there.

The following is a list of AWS services and tech stacks in use:

How it works

how it works moving files sftp aws s3

CloudWatch Event is scheduled to trigger Lambda, and Lambda is responsible for connecting to SFTP and moving files to their S3 destination.
This approach requires only one Lambda to be deployed, because it is source- (SFTP folder) and destination- (S3 bucket) agnostic. When CloudWatch Event triggers Lambda, it passes the source and destination as parameters. You can deploy a single Lambda, and many CloudWatch Events that will all trigger the same Lambda, but with different source/destination parameters.

Node.js and Lambda: Connect to FTP and download files to AWS S3

The centerpiece is a Node.js Lambda function. It uses the ftp client module for communicating with FTP server. Every time CloudWatch Event triggers Lambda, it will execute this method:

async execute(event: ImportFilesEvent): Promise<void> {
    const ftpConfig = await this.readFtpConfiguration();
    this.ftp.configure(ftpConfig);
    await this.ftp.connect();
    const files = await this.ftp.list(event.ftp_path);
    for (const ftpFile of files) {
        const fileStream = await this.ftp.get(`${event.ftp_path}/${ftpFile.name}`);
        await this.s3.put(fileStream, ftpFile.name, event.s3_bucket);
        await this.ftp.delete(`${event.ftp_path}/${ftpFile.name}`);
    }
    this.ftp.disconnect();
}

It iterates through the content of the given folder and moves each file to the S3 bucket. As soon as the file is successfully moved, it removes the file from its original location.
Notice event.ftp_path and event.s3_bucket in the code above. They are coming from the CloudWatch Event Rule definition, which will be described in a following section.

CloudWatch Event Rule

CloudWatch Event is scheduled to trigger Lambda by creating CloudWatch Event Rules. Every Rule consists of Cron expression and Input Constant. Input Constant is exactly the mechanism we can use to pass source and destination.

CloudWatch Event Rule

Now, when you take a look at the signature of the handler, you’ll see an ImportFilesEvent:

const handler: Handler<ImportFilesEvent, void> = async (event: ImportFilesEvent) => {
    console.log(`start execution for event ${JSON.stringify(event)}`);
    ...
};

This is exactly the value of the Input Constant and it is shown in the logged output as:

2019-01-16T07:30:11.430Z ... start execution for event
{
    "ftp_path": "source-one",
    "s3_bucket": "destination-one"
}

FTP connection parameters

FTP connection parameters are stored in another AWS Service called Parameter Store. Parameter Store is a nice way for storing configuration and secret data. Value is stored as JSON:

{
  "host": "18x.xxx.xxx.xx",
  "port": 21,
  "user": "*********",
  "password": "********"
}

When Lambda executes readFtpConfiguration(), it reads the FTP Configuration from Parameter Store.

Limitations and constraints

Be aware that this is not a solution to synchronization of SFTP and S3, neither is it in real time. Don’t expect that as soon as file is uploaded to SFTP, it will appear on S3. It will execute on schedule.
Another thing is how much data it can handle. AWS Lambda has its limitations. Since this solution is built to scan entire folder and transfer all files from it, if there are too many files, or files are very large, it can happen that Lambda hits one of its limits. It works well when there were dozen of files and each file was never larger than a few KBs.

If there are network issues during transfer, Lambda will break, but since Amazon CloudWatch Events invokes Lambda functions asynchronously, it will retry execution. But I encourage you to explore its limits on your own, and let me know in the comments section if you see how to build more resilience for failures.

Tests are missing. Testing Lambda is another big topic and I wanted to focus on the architecture instead. However, you can refer to another blogpost to find out more about this topic.

Run the code with Terraform

To use Lambda and other AWS services, you need an AWS account. If you don’t have an account, see Create and Activate an AWS Account.
Another thing you’ll need to install is Terraform, as well as Node.js.

When everything is set up, run git clone to get a copy of the repository, where the full source code is shared.

$ git clone git@gitlab.codecentric.de:milica.zivkov/ftp-to-s3-transfer.git

You will run this code in a second. But before that, you’ll need to make two changes. First, go to the provision/credentials/ftp-configuration.json and put real SFTP connection parameters. Yes, this means you will need an SFTP server, too. This code will try to download folders named source-one and source-two, so make sure you have them created.
Second, go to the provision/variables.tf and change the value of default attribute. AWS has that rule for naming S3 buckets – names should be globally unique. You will use this parameter to achieve this uniqueness.

Next, build the Node.js Lambda package that will produce Lambda-Deployment.zip required by terraform.

$ cd move-ftp-files-to-s3
$ npm run build:for:deployment
$ cd dist
$ zip -r Lambda-Deployment.zip . ../node_modules/

When Lambda-Deployment.zip is ready, start creating the infrastructure.

$ cd ../../provision
$ terraform init
$ terraform apply

If you prefer video instructions, have a look here:

Now, you should see a success message Apply complete! Resources: 17 added, 0 changed, 0 destroyed.. At this point all AWS Resources should be created and you can check them out by logging in to AWS Console. Navigate to the CloudWatch Event Rule section and see the Scheduler timetable, to find information when Lambda will be triggered. In the end, you should see files moved from

1. source-one FTP folder –> destination-one-id S3 bucket and
2. source-two FTP folder –> destination-two-id S3 bucket

Summary: Going serverless by moving files from SFTP to AWS S3

This was a presentation of a lightweight and simple solution for moving files from more traditional services to serverless world. It has its limitations for larger-scale data, but it proves stable for smaller-sized businesses. I hope it will help you or serve as an idea when you encounter a similar task. Thank you for reading.

Comment

Your email address will not be published. Required fields are marked *