Tuesday, July 17, 2012

Running your Scalding jobs in Eclipse

In a nutshell, Scalding is a Cascading DSL in Scala. If you know what that line means, skip to the meat below else read the next section for a small bit of background. Note: If you are reading this again, I have updated the below sections to rely on Maven instead of SBT and I have included a link to my sample project to help get you started and fixed some serious omissions when I revisited my own blog post to setup a new scalding project.

Introduction

The Hadoop ecosystem is full of wonderful libraries/frameworks/tools to make it easier to query your big data sets without having to write raw Map/Reduce jobs. Cascading(http://www.cascading.org) is one such framework that, simply put, provides APIs to define work flows that process your large datasets. Cascading provides facilities to think about data in terms of pipes through which flows tuples that can be filtered and processed with operations. Couple this with the fact that pipes can be joined together or split apart to produce new pipes and that Flows (which connects data sources and data sinks) can be tied together, you can create some pretty powerful data work flows. 
Cascading is a wonderful API which provides the ability to do all these great things but because it's Java and the language is verbose, it's always a bit hard to get started with Cascading from scratch. I've been using it for years and I find that each new project requires me to go back to a previous one and copy/paste some boilerplate code. I think others had similar problems and hence came out with numerous DSL (Domain Specific Language) written in Ruby, Clojure (Cascalog), Scala (Scalding) etc to wrap Cascading to make it easier to write these flows.

I won't pretend to be a Scalding expert so I advise you to visit their site (https://github.com/twitter/scalding/) but what I do know is that it's a Scala DSL around Cascading with some slight tweaks to make it easy to build big data processing flows in Scala. The API is designed to look like the collections API so the same code that works on a small list of data could be used to also work on a stream of billions of tuples. I wanted to play with Scalding so I read the wiki page, downloaded it and copied the tutorial but then I wondered, how can I run this in Eclipse? Mainly because it provides me the ability to write, debug and run (locally mainly) my jobs without having to hit the command line for some tool. At the end of the day, it's a JVM language so it must be able to run in Eclipse right?

Maven + Scalding + Eclipse, Oh My!

I don't have much of an opinion about SBT and can't really say much good or bad about it but I do know Maven is popular and I tend to like it for managing my project dependencies and assembly descriptors etc. It also reduces the amount of stuff to install when setting up a new laptop or bringing new team members up to speed on this technology so I wanted to get this working with as few moving parts as possible.

Pre-Requisites:

  1. Eclipse
  2. Maven
  3. Scala Plugin for Eclipse

Running Scalding in Eclipse

Perhaps the simplest way to get started is to clone my sample project from git and modify as necessary. Once cloned, simply run
mvn eclipse:eclipse
to generate the eclipse project files and everything should build as expected. The sample job is the word count job found from the scalding tutorial.
Once you have a working eclipse project, to run the scalding job:
  1. Create a new runtime configuration:
    Main class: com.twitter.scalding.Tool
    Program Args: <Fully Qualified Path to Job Class> <Other CLI Args>
    Example: org.hokiesuns.scaldingtest.WordCountJob --local --input ./input/input.txt --output ./output.txt
    VM Args: -Xmx3072m
    
To create a job jar that can be submitted to your hadoop cluster, simply run
mvn package
which will generate a fat jar with all the dependencies. This job jar can be submitted to your cluster by executing
 hadoop jar scaldingsample-0.0.1-SNAPSHOT.jar org.hokiesuns.scaldingtest.WordCountJob --hdfs --input <some path> --output <some path> 
I just started using Scalding and got this working in Eclipse. If there are any problems or inaccuracies, please post a comment and I'll update my steps. Happy scalding-ing!

8 comments:

  1. Was there anything special you needed to do on your Scalding job to get it to run under Hadoop (pseudo-distributed)? I am trying to do the same thing but it looks like the --hdfs does not seem to affect (regardless of the --hdfs or --local the logs say flow started: local) and it cannot find the input file in HDFS (I have tried specifying hdfs://, absolute and relative). I did copy the following jars over to my hadoop/lib directory from my application classpath (scala-library.jar, scalding_2.9.2.jar, cascading-core-2.0.2.jar, cascading-hadoop-2.0.2.jar, maple-0.2.2.jar, cascading-local-2.0.2.jar, jgrapht-jdk1.6-0.8.1.jar and guava-10.0.1.jar). In my scalding class, I made it extend Job(args) and then just put in a linear sequence of commands. I've also tried a companion object with a main method in it. I am able to call it with the hadoop jar command like you showed, but I cannot get it to see the file. Thanks in advance for any help you can provide.

    ReplyDelete
  2. Hey Sujit,

    Unfortunately, I haven't tried to run this in any other mode but local. I'm surprised that this isn't working but will have to try and run this myself to see if there is anything I can glean but honestly my scalding knowledge is not very good. I was hoping to do more with it but didn't have the time :-(

    Thanks for reading through my post and posting this.. maybe someone else who stumbles across this will have some comment but I'll certainly give it a try myself.

    Cheers
    Amit

    ReplyDelete
  3. Thank you. I will also ask on the mailing list.

    ReplyDelete
  4. Can you help me in this error:



    hadoop jar scaldingsample-0.0.1-SNAPSHOT.jar org.hokiesuns.scaldingtest.WordCountJob --hdfs --input /NOTICE --output /out


    Warning: $HADOOP_HOME is deprecated.

    Exception in thread "main" java.lang.NoSuchMethodException: org.hokiesuns.scaldingtest.WordCountJob.main([Ljava.lang.String;)
    at java.lang.Class.getMethod(Class.java:1624)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:150)

    ReplyDelete
    Replies
    1. Oh no! I have an error in my blog. I will fix that.. my apologies. Not having tested immediately (I'll have to check it out and re-test), I think it's something like
      hadoop jar scaldingsample-0.0.1-SNAPSHOT.jar com.twitter.scalding.Tool org.hokiesuns.scaldingtest.WordCountJob --hdfs --input --output

      Delete
  5. Hello ,
    what do you mean by "Create a new runtime configuration" is that a maven or eclipse setting ?

    ReplyDelete
    Replies
    1. It's an Eclipse thing. Run menu open Run Configurations

      Delete
  6. I've just decided to create a blog, which I have been wanting to do for a while. Thanks for this post, it's really useful! Management Jobs in London

    ReplyDelete