Overview

The SMACK stack – hands on!

2 Comments

The SMACK stack is all the rage these days. Instead of just talking about it, this post is going to guide you through the steps for setting up a simple SMACK stack that will enable you to get a hands on experience with the tools.
In the first step, we will setup a DC/OS-managed Mesos datacenter on AWS. This is going to provide the basic infrastructure on which we will then install Kafka, Cassandra and Spark. Then we are going to deploy simple ingestion and digestion jobs that pump data right across the toolchain.

Step 0: Prerequisites

If you want to try the examples yourself, you’ll need the following:

  • Vagrant and Virtualbox
  • Ansible
  • An AWS Account. The AWS resources used in this example exceed the free tier, regular AWS fees apply. You have been warned!
  • An EC2 key pair called dcos-intro
  • A Twitter access key

The source code is available at https://github.com/ftrossbach/intro-to-dcos.

Step 1: sMack stands for Mesos

DC/OS is Mesosphere’s distributed operating system on top of Apache Mesos. Version 1.7 has just been released as open source. DC/OS provides AWS CloudFormation templates to set up the datacenter. The following diagram gives a high level overview of the resources that are created for the example in this blog post:

 

aws

The CloudFormation template creates these AWS resources

In the public subnet that is reachable from the internet, we can see a master node and a public Mesos slave. One master is sufficient as there is no need for high availability (HA). DC/OS also provides templates that provision three masters and thus provide HA capability. The private subnet is only accessible from the public subnet. It contains 5 mesos slaves that will run any application that does not need exposure to the outside world. If you want to know more, you can have a look at my colleague Bernd Zuther’s Github repository. Bernd gives a more detailed overview of DC/OS. He also transformed the CloudFormation template into a Terraform template that is more easily readable.

To get the stack running, we need to check out the project at Github. We have also to provide AWS credentials for AWS in form of the environment variables:

  • DCOS_INTRO_AWS_REGION
  • DCOS_INTRO_AWS_ACCESS_KEY
  • DCOS_INTRO_AWS_SECRET_KEY

Calling vagrant up then sets up a virtual Ubuntu that will serve as our gateway to DC/OS. After the machine is created, the Ansible CloudFormation task takes the template and creates the Stack in AWS. This is happening synchronously, so it will take a couple of minutes. After Ansible is done, we can take a look at our brand new instances:

amazon

The instances that have been created in AWS as shown in the AWS console

Ansible also prints the URL of the master load balancer:

TASK [dcos-cli : This is the URL of your master ELB] ****************
ok: [default] => {
    "msg": "http://dcos-test-ElasticL-***.elb.amazonaws.com"
}

The most important URLs for accessing your datacenter are:

  • DC/OS dashboard: http://<master_load_balancer>/
  • The task scheduling framework Marathon: http://<master_load_balancer>/marathon
  • Plain Mesos: http://<master_load_balancer>/mesos

Let’s take a look at our brand new DC/OS dashboard:

dcos

The DC/OS dashboard that is available at http://<master_load_balancer>

As we have not yet deployed any applications, all resources are still up for grabs.

Now that we have Mesos running in form of DC/OS, we are going to install the remaining SMACK infrastructure components.

Step 2: smaCk stands for Cassandra

Ansible already installed the DC/OS CLI in our Vagrant box. DC/OS CLI is a command line interface to manage applications in the datacenter. From within the Vagrant box (use vagrant ssh), the Cassandra Mesos framework can be installed by issuing one simple shell command:

dcos package install cassandra

This will start three Cassandra nodes with some defaults regarding memory and disk requirements. Configuration options are described at https://dcos.io/docs/1.7/usage/tutorials/cassandra/.

If you look at the DC/OS dashboard, you might see that Cassandra reports as unhealthy. It should turn to healthy once all Cassandra nodes are up and running. You can also now see that some resources have been allocated in DC/OS.

Setting up a Cassandra keyspace is unfortunately not yet possible via the DC/OS CLI. We need to log onto one of our master nodes using SSH and the private key of key pair “dcos-intro”:

ssh core@<mesos_master_ip> -i dcos-intro.pem

We use docker to start a cqlsh Cassandra command line:

docker run -ti cassandra:2.2.5 cqlsh node-0.cassandra.mesos

Once in cqlsh, we can create the required keyspace and table:

CREATE KEYSPACE IF NOT EXISTS dcos WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'};
CREATE TABLE IF NOT EXISTS dcos.tweets(date timestamp, text text, PRIMARY KEY (date, text));

Yes, this is going to be yet another Twitter-based example.

Step 3: smacK stands for Kafka

Similarly to Cassandra, we are also going to use command line tools to set up Kafka.
Issuing the command

dcos package install --package-version=0.9.4.0 kafka

installs the DC/OS Kafka service. We are using version 0.9.4.0 of this service because the most recent version that was released with DC/OS 1.7 caused some incompatibilities with the frameworks used in the example.

Compared to Cassandra, this the installation of the framework will not lead to an automatic startup of Kafka brokers. We are going to use the DC/OS Kafka command line to add and start Kafka brokers:

dcos kafka broker add 0..2 --port 9092

This command adds three brokers – broker-0, broker-1 and broker-2 – that are supposed to run on port 9092. Further configuration options are available at https://docs.mesosphere.com/usage/services/kafka/.
Adding brokers is not the same as starting them, issuing

dcos kafka broker start 0..2

does that job.

We also use the CLI to create a Kafka topic very creatively called “topic” with ten partitions and a replication factor of three:

dcos kafka topic add topic --partitions 10 --replicas 3

Step 4: Smack stands for Spark

You can probably guess by now how to install Spark, but here it is anyway:

dcos package install spark

This sets up Spark on Mesos in so called cluster mode. Spark jobs will be submitted to a MesosClusterDispatcher that will distribute Spark jobs on demand to available cluster resources. There are no dedicated Spark executors!

This completes the installation of the SMACK infrastructure. A look at the Mesos Web UI reveals all running tasks:

mesos

The Mesos Web UI at http://<master_load_balancer>/mesos shows the running tasks

The example application

By now we have all the infrastructure available to run a SMACK application, so let’s just do that.
The example application streams tweets via Kafka and Spark into Cassandra. It is logically divided into two parts:

  • data ingestion
  • data digestion

We’ll see more about how these parts are set up in the following sections.

Step 5: smAck stands for Akka – data ingestion

The application reads tweets using the twitter4j library and pushes them onto Kafka using Akka reactive streams and the reactive kafka extensions. The code is available in the Github repository and we will not go into detail on this as it is a straightforward Akka stream implementation. Much more interesting is the deployment aspect. The application is packaged as a docker container and deployed to a public docker registry at https://hub.docker.com/r/ftrossbach/dcos-intro/. To deploy the app in our datacenter, we use DC/OS’ scheduler Marathon and the DC/OS CLI. A new deployment in Marathon is defined in JSON. For the ingestion, we use the following configuration:

{
 "id": "ingestion",
 "container":{
   "docker":{
     "forcePullImage":true,
     "image":"ftrossbach/dcos-intro",
     "network":"BRIDGE"
     "privileged":false
   },
   "type":"DOCKER"
 },
 "cpus":1,
 "mem":2048,
 "env":{
   "KAFKA_HOSTS":"broker-0.kafka.mesos",
   "KAFKA_PORT":"9092",
   "twitter4j.debug":"true",
   "twitter4j.oauth.accessToken":"****",
   "twitter4j.oauth.accessTokenSecret":"****",
   "twitter4j.oauth.consumerKey":"****",
   "twitter4j.oauth.consumerSecret":"****"
 }
}

In this file, we specify the Docker container we want to deploy, some variables like the number of CPUs or memory, and most importantly some environment variables for the container.
We trigger the deployment by issuing

dcos marathon app add /vagrant/provisioning/ingestion.json

– you will need to include valid Twitter credentials in that file.

If you want to check the status of the application, you can take a look into Marathon at http://<master_load_balancer>/marathon. The plain Mesos UI at http://<master_load_balancer>/mesos also has this information. You have access to log files in both UIs.

marathon-ingestion

The Marathon page for the ingestion application is reachable at http://<master_load_balancer>/marathon/ui/#/apps/ingestion

If the application reports as “running”, you’re fine. If it is stuck in a deploy-wait-cycle, check stderr.

Step 6: Smack stands for Spark (yet again…) – data digestion

The final step is running the data digestion. In Kafka, we have a distributed high performance message broker, but we need to do something with the data as Kafka itself does not offer any analytic capabilities whatsoever. We use a Spark streaming job that is also included in the example (de.codecentric.dcos_intro.spark.SparkJob). To keep things uncomplicated, this example does not really make use of the true power of Spark but plainly writes the tweets it reads from Kafka into a Cassandra table.

To deploy a Spark job using the DC/OS CLI, the jar must be available in a place where DC/OS can access it (https://dcos.io/docs/1.7/usage/tutorials/spark/). A public file on AWS S3 will do the trick for us. Once uploaded, the job can be triggered by issuing

dcos spark run --submit-args=' \ 
--class de.codecentric.dcos_intro.spark.SparkJob \
https://s3.eu-central-1.amazonaws.com/dcos-intro/dcos-intro-assembly-1.0.jar \ 
topic node-0.cassandra.mesos 9042 \ 
broker-0.kafka.mesos:9092'

The Spark Mesos dispatcher has a WebUI that lets us monitor the job status. It is accessible at
http://<master_load_balancer>/service/spark.

Once the job is running, we can log into Cassandra using the same docker command as above when we created the keyspace and query the tweets table (extract):

cassandra-table

This shows the results of a Cassandra query

Conclusion and outlook

Congratulations, you got the SMACK stack up and running and deployed an application.
This post gave you basic steps to setup the stack on AWS and run simple ingestion and digestion jobs. This is great to explore the features of DC/OS and how it enables you to run a fast data application, but of course there are many things left out – our example application is neither fast data nor big data.

If you want to set up the stack productively, you might also want to:

  • check out the brand new DC/OS homepage
  • read up on Marathon
  • explore service discovery via Mesos DNS
  • think about the number of Kafka and Cassandra nodes as well as their memory requirements
  • restrict access to DC/OS – with this template, the master nodes are accessible to everyone on the internet
  • explore the options of the DC/OS CLI (dcos --help from within the Vagrant box)
  • have a look of other official DC/OS packages in the Mesosphere universe

In order to clean up, you can delete the Stack on AWS in the CloudFormation console. This will destroy all created resources.

Florian Troßbach

Florian Troßbach has his roots in classic Java Enterprise development. After a brief detour in the world of classic SAP, he joined codecentric as an IT Consultant and focusses on Fast Data and the SMACK stack.

Share on FacebookGoogle+Share on LinkedInTweet about this on TwitterShare on RedditDigg thisShare on StumbleUpon

Kommentare

  • 1. June 2016 von Anuj Kumar

    I followed your blog to get an initial setup of SMACK on AWS and I am able to get Mesos, Cassandra, Kafka and Spark up and running.

    I have a slightly different use case than yours, where I need to publish on Kafka from outside of the mesos container using a NIFI workflow. This means I need a public broker URL for Kafka. I looked into the kafka services configuration and also searched on internet but I couldnt find the public broker URL that I can use within my NIFI workflow.
    Could you please point me to some direction as to how I can achieve it?

    • Florian Troßbach

      16. June 2016 von Florian Troßbach

      This is gonna be tricky because Kafka is very peculiar with host names. You could put all your brokers in a public subnet, but even that is gonna be tricky with naming if you don’t want to route all inter-broker communication across the internet. Confluent’s REST-Proxy might be something for you.

Comment

Your email address will not be published. Required fields are marked *