Apache Spark, Amazon S3 and Apache Mesos

Apache Spark, Amazon S3 and Apache Mesos

14 Kudos

The hardest problem with having nice technologies at your service is making them collaborate.

Amazon S3

Let’s begin with Amazon S3. As the name suggests, it is a Simple Storage Service which has only one purpose: to give you unlimited cloud-based storage. If you are accessing S3 from EC2 instances (or, in general, from Amazon services) then you don’t pay anything besides the cost for hosting the data; on the other hand, if you download the data outside the Amazon network, you will also pay for the outbound data transfer. This piece of tech is quite easy to understand: files on the cloud.

Apache Spark and Mesos

Apache Spark is an engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. In practical terms, it is an implementation of the MapReduce concept extended to support interactive queries and stream processing. At the moment I am writing this article, the reference version is Apache Spark 1.5.1.

Apache Mesos, instead, is a cluster manager, especially useful for running heterogeneous tasks. There are many discussions comparing Mesos to Yarn so I will not go in details here.

How is Spark running on Mesos?

There is already the official guide that thoroughly explains how to configure Spark to work with Mesos but I want here to linger on how a Spark application is executed on Mesos, in practical terms. So, the Spark configuration is essentially split in two files: conf/spark-defaults.conf and conf/spark-env.sh. The first file contains the defaults values that are only passed to spark-submit, the second one is sourced when running spark programs in general. Thereof, if we read the documentation we find that we have to set the spark.executor.uri in the following way

or just add that the following line the spark-defaults.conf file

So what happens when we run our Spark application on Mesos? Well, Mesos does not really know what Spark is (to a certain extent), thus it wants to “start” Spark in all the slaves and let Spark handle the rest. Mesos does not know where Spark is, therefore it will fetch the “Spark’s executor” from the given URI, then will uncompress the tarball and run Spark.

However, setting the spark.executor.uri this is not the only way to run Spark via Mesos. There is another option, which is setting the spark.mesos.executor.home parameter. In this case, we have to choose a local path (local to all the nodes), in which Spark has already been installed. Therefore, the same path must exist and be valid for all the nodes in the cluster. In this case, each slave will just execute Spark from the given path.

How can I configure Amazon S3?

Now that we have a working Mesos+Spark cluster and we got the basics, we have to configure Amazon S3. Around this topic, there is a lot of confusion on the Internet on the different protocol and which one to use and how. You can read the S3 wiki by Apache to understand the differences between the protocols. Bottom line is this one: use only s3a. At one condition, though: Use s3a protocol ONLY with Hadoop 2.7.1 or higher! I really mean this. If you do not have Hadoop 2.7.1 installed, then stop reading and upgrade it right away. I mean it. The long-story short is that the s3a protocol has been added to Hadoop 2.6.0 so, in practice, you can use it with such a version. However, it is far from being stable (and I wish they had never released it). It costed my company more than one week of investigations and slowdowns to figure this out. Anyway, let’s go back to the real topic. At this point you may wonder: let’s just use s3a://bucketname/mydata and I am set, right? No. Unfortunately, Spark with Hadoop-2.6.0 does not include the s3a filesystem nor the Amazon AWS SDK.

Nevertheless, do not lose hope! We can still inject the jar files we need during Spark’s execution. The best way to do that is to add the following two lines to spark-defaults.conf (as the use of SPARK_CLASSPATH is deprecated)

You have, of course to substitute /path/to/hadoop/ with the real path in which you installed Hadoop on your system. Again an important obervation: this path is local to all the nodes in the cluster. Therefore, it must exist in all the machines.

Ok, now you have the libraries, the filesystem, the protocols but you are still missing one key element: the access and secret keys. In order to access Amazon S3, you need both to authenticate to the server. On the Internet there are thousands of different comments, suggestions, methods on how to do that. Of course, you can export those in the execution scripts but this is rather stupid: you will have to adapt each and every script to export those keys before executing a job. The best way is the following: create a hdfs-site.xml and place it in the Spark’s conf directory. The file can look as the following:

This hdfs-site.xml must reside in all the nodes. Now the tricky part is the following: if you read the initial part of this post, you will be remembering that Mesos can run Spark in two ways: executing it from the Spark’s executor URI or executing it from the given local path. If you decided for the first way, then you will have to repack the Spark’s tarball and place the hdfs-site.xml in the conf folder. If, instead, you have chosen the second option, you will have to copy the hdfs-site.xml to all the nodes in the Spark’s conf folder.

Tiptop! We are good to go!

We are almost done

At this point we have the keys, protocols, libraries and everything we need. If you start your application now, it will most likely work. Anyway, this is not yet over. There are a couple of default settings which are slowing down your processing or even jeopardising the stability of your jobs. The most subtle one is spark.hadoop.fs.s3a.connection.maximum which is, by default, set to 15. This translates to the following: if you have more than 15 cores, you will only have 15 connections open at the same time. Therefore all the other N-15 cores are not fetching data. Even worse, if you are processing large unsplittable files (e.g., GZIP), then you will get timeout errors. Why? Well, because Spark wants to access those files but there will be only 15 actually working. If the GZIP files are big enough (let’s say a couple of GBs), then the other to-be-opened connections will timeout compromising the whole job itself. The best way is to set this value to the maximum number of cores in your cluster (or just to MAX_INT to cut to the chase).

Known bugs

There is one small bug which I really do not know how to fix: sometimes it happens that a job hangs with one task being “stale” forever. This means that the job does not crash nor proceeds. The only solution I have found is to set spark.speculation=true which is recovering from this stale situation. However, be careful: if you are writing on a database then you will end up having multiple insertions because speculation mode re-execute slow tasks and prefers the one that terminates first (you can tune the other speculation’s settings to try to mitigate this phenomenon).

October 29, 2015 0 comments Read More
Merge CSV files keeping one header

Merge CSV files keeping one header

10 Kudos

There are typical situations in life that require lots of engineering to be addressed. Other situations don’t. This is exactly one of those commands you may want to learn by heart but, for those who can’t, that’s why blogs exist.

Suppose you have collected statistics for several days and you have lots of files having the same structure, e.g.:


and you want to create one big file that keeps the header from just the first one and appends all the files so that you can quickly load it into your favourite tool for further analysis.
Here is the trick:

FNR is the number of records read by awk (per file), NR is the number of lines read overall. Therefore, the condition FNR==1 && NR>1 evaluates to true only if the line being evaluated is the first line of the file and we already read at least one line overall (so it is not the first line we read). What happens when it is true? {next;}, hence the line gets ignored.

I am pretty sure that some of you just came here, copied&paste the line above, tested and found out that the command never ends. Well, if this is the case the file you generated is probably taking all of your free disk space. Yummie. Why? Probably because the output file already existed before and now it matches the wildcard condition of awk. So, make sure that the output file doesn’t exist when executing the command or, at least, doesn’t end in your shell expansion.

May 19, 2015 1 comment Read More
OpenAFS and Mac OS X Yosemite, El Capitan

OpenAFS and Mac OS X Yosemite, El Capitan

90 Kudos

Long story short: Kerberos on the Mac, starting with Yosemite, does not support anymore weak ciphers (such as DES).

AFS, on the other hand, works with DES.. therefore.. no go. The best thing to do would be to try to migrate OpenAFS to support different ciphers but this requires maintenance on the AFS server, risking damage and data loss.

Therefore this (quite ugly, but still..) solution seems to overcome the limitation of the Kerberos installation provided by default with Mac OS X.

Step 1: System cleanup

If you have installed OpenAFS or already configured kerberos on your machine, uninstall everything and delete /Library/Preferences/edu.mit.Kerberos.

Please reboot.

Step 2: Install heimdal kerberos

Download and install this: http://www.h5l.org/dist/src/heimdal-1.5.3.dmg

This is the heimdal kerberos, the vanilla version. Therefore, this supports the aforementioned weak DES cipher.

Step 3: Install OpenAFS for Yosemite

As may already know, there is no official OpenAFS version for Yosemite. I compiled and uploaded one for you. You can download it here: https://dl.dropboxusercontent.com/u/355313/openafs/OpenAFS-1.6.10-2-gb9a15b-dirty-Yosemite.dmg

Download, open and install it.

Step 4: Configure ’em all

First of all, let’s configure Kerberos. You should already have the configuration for Kerberos. Make sure you have it in the correct path that, on Yosemite, is /etc/krb5.conf. Then make sure you add the allow_weak_crypto = true line to the libdefaults section.

Then configure OpenAFS.

  • Configure AFS by editing the ThisCell and CellServDB files accordingly. Don’t reboot now.
  • OpenAFS requires a kernel extensions to work properly. Unfortunately (yes, again), unsigned kernel extensions cannot be loaded on boot in Yosemite. However, this problem can be solved by using modifying the boot parameter of the kernel:
  • Now, reboot the mac
  • When everything is restored, make sure you apply the necessary settings and add AFS icon to the menu bar for quicker access:
    • Go to System Preferences > OpenAFS.
    • AFS Menu: checked
    • Backgrounder: checked
    • Use aklog: checked

Step 5: How to connect

Each time you want to use AFS, you must do the following:

  • open Terminal.app
  • issue

Everything should be working.

If you read this guide and something didn’t work, make sure you followed each step in the precise order they are written. If something is still not working properly, just drop a line in the comments and we will try to sort it out.


After upgrading Yosemite to 10.10.3 or, in general, after every system update I noticed that I have to reinstall OpenAFS or, at least, re-issue the nvram command to let unsigned kernel extensions to be loaded again.

Update (take-two)

Your File System® offers a Yosemite-compatible version of OpenAFS, shipped with Heimdal Kerberos version. You can access the download page clicking here. The benefit of using this version is that the kext file is signed, therefore no need to set the nvram parameters to allow unsigned extensions to be executed.
Please notice that in this case, you have to create the krb5.conf in /private/var/db/yfs/etc.

Update (El Capitan)

Starting with Mac OS X El Capitan (10.11), Your File System® published a new client, which you can download here. If you are upgrading from any other version, please mount the image and uninstall OpenAFS completely (you can find the scripts in the Extras folder, within the DMG). After rebooting the machine, install the AuriStor client and configure Kerberos in the same way as before (see bottom of this post for a sample). After that:

    • Open a Terminal and open with sudo /etc/yfs/cellservdb.conf. Remove the content of the file and add following lines

  • Edit /etc/yfs/thiscell.conf
  • Reboot your Mac
  • Open the Terminal.app and get a Kerberos ticket (refer to Step 5) and then “aklog”

A sample of a working /etc/krb5.conf file could be:

December 26, 2014 58 comments
Convert swiss national grid coordinates (CH1903) to WGS1984

Convert swiss national grid coordinates (CH1903) to WGS1984

13 Kudos

The World Geodetic System (or WGS) is a worldwide standard for the cartography. The well-known concepts (such as latitude and longitude, for example) are part of this standard.

Yet, in Switzerland, there is a different national standard that is called CH1903 (or Landesvermessung 1903, LV03).

Implementations, in the most widespread programming languages, to convert coordinates from the Swiss standard to WGS1984 are provided by swisstopo. However, the Python one is missing. You can grab it here.

Click here to get the code!

October 13, 2014 0 comments Read More
[HOW-TO] Homebrew: remove a formula and all its dependencies

[HOW-TO] Homebrew: remove a formula and all its dependencies

27 Kudos

Since homebrew does not officially support an automated way to do that, I created a small zsh function to remove a formula and all its dependencies.

I use zsh on the Mac so the function works with it. Feel free to adjust it if you are using bash or something else.

brew-remove-with-deps() {

  if [ "x$formula" = "x" ]
     echo "Invalid empty parameter"
     echo "Removing" "$formula" "and all its deps.."
     brew rm $formula
     brew rm $(join <(brew leaves) <(brew deps "$formula"))

And, just for testing purposes:

[email protected] ~ % brew-remove-with-deps sloccount
Removing sloccount and all its deps..
Uninstalling /usr/local/Cellar/sloccount/2.26...
Uninstalling /usr/local/Cellar/md5sha1sum/0.9.5...

Et voilà.

[1] – Stackoverflow – Uninstall / remove a Homebrew package including all its dependencies

April 16, 2013 0 comments Read More