Overview

SMACK Stack DC/OS Style

No Comments

In the world of Internet of things (IoT) you work with a continuous flow of data. For this you have two options at hand, the first is to do batch processing long after the data is collected. The other option is to analyse the data while it is being collected. How to do this Fast Data approach can be found in this blog post: IoT Analytics Platform

The focus of the previous article had been on the SACK part of the SMACK Stack. That would be Apache Spark, Akka, Apache Cassandra and Apache Kafka. The following article will show how to complete the picture of a SMACK Stack application and how to actually run it with DC/OS and Mesos, for which the M in SMACK stands.

Infrastructure

So what’s the infrastructural part which is needed to run this fine SMACK application?
First of all, Mesos and the corresponding modules like marathon, chronos and DC/OS are needed. As Mesos itself is limited and only gets its kicks from working with DC/OS, we’ll focus on DC/OS. As the description of how to best install DC/OS with terraform is a topic of its own. We’ll skip that and depend on a pre-configured environment.
Now that the underlying “operating system” is available, let us take a closer look at what else is needed. As the S in SMACK stands for Spark, we will need to install the Spark Framework in DC/OS.

dcos_install_spark

This Spark framework starts and keeps track of the Apache Spark master server, which will take care of our soon to come Spark jobs.
So now we’re set with Apache Spark and ready to actually digest the incoming data, but to do this, we need some data to work with. Therefore we will need an Apache Kafka Server first, that’s another easy task that can be installed in the same way as we’ve done previously.
Now that the consuming side is satisfied, we need a way to get our data into Kafka and since we’re able to consume via the Spark job, but the job has no place to put the data to. Therefore another part of the SMACK stack is needed to be installed: Apache Cassandra.

dcos_install_cassandra

After that is done, Cassandra itself doesn’t know how to store the data, therefore it’s crucial to initialize the database with a schema. Chronos is there to the rescue, we’re able to add add a job. But since the Job needs a Docker image, we can’t do that via the Web UI, we’re gonna use the REST-API instead.
To add the new Job we need to issue the following curl command:

bash curl -L -H 'Content-Type: application/json' -X POST -d '{json hash}' master.mesos/service/chronos/scheduler/iso8601

The {json hash} needs to be replaced by the following json syntax, which can be used for creating this job. Just make sure that the placeholder for cassandra host is replaced by the actual dns name entry.

{
    "schedule": "R0/2014-03-08T20:00:00.000Z/PT1M",
    "name": "init_cassandra_schema_job",
    "container": {
        "type": "DOCKER",
        "image": "codecentric/bus-demo-schema",
        "network": "BRIDGE",
        "forcePullImage": true
    },
    "cpus": "0.5",
    "mem": "512",
    "command": "/opt/bus-demo/import_data.sh $CASSANDRA_HOST",
    "uris": []
}

This will generate a new job with chronos, as it can be seen in the chronos overview:

dcos_chronos_overview

Now make sure that this job is executed at least once to create the cassandra keyspaces needed for the application.

Easiness of Installation

Now that the foundation for our application is running we need to install our own application. The digesting part with consuming from Kafka and publishing to Cassandra is possible. Now how do we get our own Docker images containing of the Akka Ingest part into our cluster? That is quite easy the installation of an application contained inside a docker image can be achieved by using marathon. Let’s start with a new Application:

dcos_install_custom_json

In this application we switch to the JSON mode and add a JSON snippet to it, as can be seen below:

{
  "id": "/ingest",
  "cpus": 1,
  "mem": 2048,
  "disk": 0,
  "instances": 1,
  "container": {
    "type": "DOCKER",
    "volumes": [],
    "docker": {
      "image": "codecentric/bus-demo-ingest",
      "network": "HOST",
      "privileged": false,
      "parameters": [],
      "forcePullImage": true
    }
  },
  "env": {
    "CASSANDRA_HOST": "$CASSANDRA_HOST",
    "KAFKA_HOST": "$KAFKA_HOST",
    "KAFKA_PORT": "$KAFKA_PORT"
  }
}

But beware, we need to change those settings for cassandra and kafka host and port to the actual names used in our cluster. The same needs to be done for the Akka Http Service Docker image which also contains the front-end part.

{
    "id": "/dashboard",
    "container": {
        "type": "DOCKER",
        "docker": {
            "image": "codecentric/bus-demo-dashboard",
            "network": "HOST",
            "forcePullImage": true
        }
    },
    "acceptedResourceRoles": [
        "slave_public"
    ],
    "env": {
        "CASSANDRA_HOST": "$CASSANDRA_HOST",
        "CASSANDRA_PORT": "9042"
        "KAFKA_HOST": "$KAFKA_HOST",
        "KAFA_PORT" : "9092"
    },
    "healthChecks": [
        {
          "path": "/",
          "protocol": "HTTP",
          "gracePeriodSeconds": 300,
          "intervalSeconds": 60,
          "timeoutSeconds": 20,
          "maxConsecutiveFailures": 3,
          "ignoreHttp1xx": false,
          "port": 8000
        }
    ],
    "cpus": 1,
    "mem": 2048.0,
    "ports": [
        8000
    ]
}

Again, here it is crucial to exchange the cassandra and kafka host names for it. Now the cluster contains the applications for accepting the incoming data and for visualizing the data.

dcos_chronos_installation_json

dcos_marathon

As our data is streaming into the system and we’re ready to actually show that data, it’s crucial to have the spark jobs running. Those spark jobs needed for this platform are installed via the dcos cli.

dcos spark run --submit-args='--supervise -Dspark.mesos.coarse=true --driver-cores 1 --driver-memory 1024M --class de.nierbeck.floating.data.stream.spark.KafkaToCassandraSparkApp https://s3.eu-central-1.amazonaws.com/codecentric/big-data/bus-demo/bus-demo-digest-assembly-0.1.0.jar METRO-Vehicles $CASSANDRA_HOST $CASSANDRA_PORT $KAFKA_HOST $KAFKA_PORT'

Again make sure you exchange the placeholders with the actual ips and ports.

How do we know it is working?

Now that everything seems to be working, how do we know it actually does? One possibility is to check if the spark jobs are actually running. For this navigate to the DC/OS overview and select the Apache Spark master. Here a list of running spark jobs is given, the one we just previously deployed should be shown there. After we validated that this job is running, we might want to know how much data is streamed into the system. So you need to find the ip address of the KafkaToCassandraSparkApp driver. Take that ip address with port 4040 and you’re able to see how the data is streamed through the system.

dcos_spark_runtime

Now that we know the data is actually streamed into the system let’s take a look at the front end.
When navigating to the frontend at :8080 you’ll see the Openlayers frontend where the latest Bus data should be flowing in.

SMACK Stack DC/OS style …

… this Article has shown, doing SMACK the DC/OS style plain rocks. The easiness and readiness of the DC/OS platform give great value to the user.
Just the setup of this showcase takes you about half an hour and you’re ready to go and have your IoT analytics platform running. From the application developer view and operations view this platform has shown its readiness.
One part is missing in this article and might find its way into another article in future, it is the automatic setup of such a platform. Within the sources of the showcase you’ll find hints on how one can achieve that with terraform, but it’s out of scope of this article.
DC/OS with an automated setup will give you this application in NO-Time compared to creating a dedicated system on exclusive hardware. This leads us to the next extra you get from running with DC/OS on either on-premise or cloud hosted software. Your Application will scale as easy as this setup by just adding additional nodes.

Links:

  • https://blog.codecentric.de/en/2016/07/iot-analytics-platform/
  • https://github.com/zutherb/terraform-dcos
  • https://github.com/ANierbeck/BusFloatingData
Achim Nierbeck

Achim Nierbeck works as senior IT Consultant for codecentric AG in Karlsruhe.
He has more than 15 years experience of working in the field of Java Enterprise. In the last years he focused on working with the SMACK stack and Apache Cassandra.
In his private time the Apache member serves as PMC of Apache Karaf and is the Lead developer of the OPS4j Pax Web project.

Share on FacebookGoogle+Share on LinkedInTweet about this on TwitterShare on RedditDigg thisShare on StumbleUpon

Post by Achim Nierbeck

Big Data

IoT Analytics Platform

Big Data

IoT Analyse-Plattform

More content about Big Data

Comment

Your email address will not be published. Required fields are marked *