Tuesday, July 17, 2012

Running your Scalding jobs in Eclipse

In a nutshell, Scalding is a Cascading DSL in Scala. If you know what that line means, skip to the meat below else read the next section for a small bit of background. Note: If you are reading this again, I have updated the below sections to rely on Maven instead of SBT and I have included a link to my sample project to help get you started and fixed some serious omissions when I revisited my own blog post to setup a new scalding project.

Introduction

The Hadoop ecosystem is full of wonderful libraries/frameworks/tools to make it easier to query your big data sets without having to write raw Map/Reduce jobs. Cascading(http://www.cascading.org) is one such framework that, simply put, provides APIs to define work flows that process your large datasets. Cascading provides facilities to think about data in terms of pipes through which flows tuples that can be filtered and processed with operations. Couple this with the fact that pipes can be joined together or split apart to produce new pipes and that Flows (which connects data sources and data sinks) can be tied together, you can create some pretty powerful data work flows. 
Cascading is a wonderful API which provides the ability to do all these great things but because it's Java and the language is verbose, it's always a bit hard to get started with Cascading from scratch. I've been using it for years and I find that each new project requires me to go back to a previous one and copy/paste some boilerplate code. I think others had similar problems and hence came out with numerous DSL (Domain Specific Language) written in Ruby, Clojure (Cascalog), Scala (Scalding) etc to wrap Cascading to make it easier to write these flows.

I won't pretend to be a Scalding expert so I advise you to visit their site (https://github.com/twitter/scalding/) but what I do know is that it's a Scala DSL around Cascading with some slight tweaks to make it easy to build big data processing flows in Scala. The API is designed to look like the collections API so the same code that works on a small list of data could be used to also work on a stream of billions of tuples. I wanted to play with Scalding so I read the wiki page, downloaded it and copied the tutorial but then I wondered, how can I run this in Eclipse? Mainly because it provides me the ability to write, debug and run (locally mainly) my jobs without having to hit the command line for some tool. At the end of the day, it's a JVM language so it must be able to run in Eclipse right?

Maven + Scalding + Eclipse, Oh My!

I don't have much of an opinion about SBT and can't really say much good or bad about it but I do know Maven is popular and I tend to like it for managing my project dependencies and assembly descriptors etc. It also reduces the amount of stuff to install when setting up a new laptop or bringing new team members up to speed on this technology so I wanted to get this working with as few moving parts as possible.

Pre-Requisites:

  1. Eclipse
  2. Maven
  3. Scala Plugin for Eclipse

Running Scalding in Eclipse

Perhaps the simplest way to get started is to clone my sample project from git and modify as necessary. Once cloned, simply run
mvn eclipse:eclipse
to generate the eclipse project files and everything should build as expected. The sample job is the word count job found from the scalding tutorial.
Once you have a working eclipse project, to run the scalding job:
  1. Create a new runtime configuration:
    Main class: com.twitter.scalding.Tool
    Program Args: <Fully Qualified Path to Job Class> <Other CLI Args>
    Example: org.hokiesuns.scaldingtest.WordCountJob --local --input ./input/input.txt --output ./output.txt
    VM Args: -Xmx3072m
    
To create a job jar that can be submitted to your hadoop cluster, simply run
mvn package
which will generate a fat jar with all the dependencies. This job jar can be submitted to your cluster by executing
 hadoop jar scaldingsample-0.0.1-SNAPSHOT.jar org.hokiesuns.scaldingtest.WordCountJob --hdfs --input <some path> --output <some path> 
I just started using Scalding and got this working in Eclipse. If there are any problems or inaccuracies, please post a comment and I'll update my steps. Happy scalding-ing!