Start using Spark with Ignition!

Apache Spark is a cluster computing framework designed to provide ease of use, fast processing and general-purpose pipelines when compared to traditional systems like Apache Hadoop. Here at Chaordic we are using it to build scalable products, like Personalized Emails, replacing a stack comprising of Hadoop Jobs, databases and services like Amazon Dynamo, MySQL, SQS, Redis and internal systems built on Jetty.

Even Spark being easy to use and very powerful, we’ve learned better ways to use Spark and condensed them in a open-source project called Ignition, consisting of:

  • Ignition-Core, a library with Spark and Scala goodies, including a command-line cluster management tool
  • Ignition-Template, a SBT template for Spark and Scala which includes Ignition-Core, gives some examples and suggests some best-practices to work with Spark

Starting the Ignition!

This tutorial will show first how to run a Spark Job locally and then one for real in the cloud (AWS).

Downloading and running a local job setup

1
2
3
git clone --recursive https://github.com/chaordic/ignition-template
cd ignition-template
./sbt 'run WordCountSetup'

Note: it may take a while to download all dependencies for the first time. Be patient.

This job will download files from a S3 bucket containing some Project Gutenberg books and will give the top 1000 words on them. It’s something like 400MB of data which is small enough to be runnable locally but long enough to make you bored if your internet or computer is slow =)

Running a job setup in an AWS cluster

Pre-requisites

You will need:

  • An AWS account with Access Key ID and Secret Access Key, see this doc
  • In each region (this tutorial assumes us-east-1, N. Virginia) that you’ll run the cluster, you need an ec2 key pair. We recommend following the convention and calling the key id “ignition_key” and saving the key file in ~/.ssh/ignition_key.pem with the correct permissions.
  • Python and PIP

For instance in a Ubuntu system, you can do this setup with the following commands on a shell:

1
2
3
4
5
export AWS_ACCESS_KEY_ID=<your key id>
export AWS_SECRET_ACCESS_KEY=<you secret access key>
chmod 400 ~/.ssh/ignition_key.pem
sudo apt-get install pip
sudo pip install -r ignition-template/core/tools/requirements.txt

Running it!

Disclaimer: you need to know what you are doing. Running this example will incur charges. If you cancel the setup during the spot request phase (e.g internet connection fails and the script aborts), requests may be left and they need to be cancelled manually. The price you use below will be the one Amazon may charge per spot machine plus a on-demand master. If you change the Amazon Region, you may also face data transfer charges. Make sure you can and want to afford it.

1
2
cd ignition-template
core/tools/cluster.py launch my-spark-cluster 2 --spot-price 0.1 --instance-type r3.xlarge

This will launch a cluster consisting of a on-demand master (m3.xlarge by default) and 2 slaves at a price of USD$ 0.1 in the us-east-1b AZ. If you can’t get machines check the price on Amazon Console and perhaps change the Availability zone (e.g -z us-east-1c)

Then you can run the Word Count example:

1
core/tools/cluster.py jobs run my-spark-cluster WordCountSetup 20G 

To check the job progress see the interface in the master ip address, port 8080

To get the master ip address, use:

1
core/tools/cluster.py get-master my-spark-cluster 

Running in the cluster should be much faster than running locally. The last parameter is the amount of memory for this job. As this machine is a 30G one, 20G for the job is a reasonable value. If you use bigger machines, a r3.2xlarge for instance, we recommend using more workers per machine (e.g --worker-instances=2) so that one machine will be split in many workers.

After playing with the cluster, you must destroy it:

1
core/tools/cluster.py destroy my-spark-cluster 

Note: this destroy will default to region us-east-1, if you have launched you cluster in another region, you need to specify it with the --region parameter.

For more options and to see the defaults, use the option -h for each command or sub command:

1
2
3
4
core/tools/cluster.py -h
core/tools/cluster.py launch -h
core/tools/cluster.py jobs -h
core/tools/cluster.py jobs run -h

Next steps

See the provided README file for more information and explore the source code!

There are still more examples and documentation to come. Stay tuned!