Sunday, July 10, 2011

Using Hudson as a cron/workflow manager

Every organization has scheduled jobs, whether it be daily database backups or hourly execution of log processing. Managing and running scheduled jobs is always a pain and the de-facto standard for scheduling such jobs is cron/anacron because it's easy and available on any Unix/Linux system. I'm sure many of you are familiar with the problems of using cron (lack of any job-flow control, logging, failure reporting etc) and having dealt with such problems at Zvents where I am a lead search/analytics engineer, I installed Hudson specifically as a cron server to solve these problems. For months prior, I used Hudson as a build server and it worked very well; however, it never crossed my mind to use Hudson as a cron server until I talked to someone else about this problem.

David Strauss wrote an interesting blog post (http://fourkitchens.com/blog/2010/05/09/drop-cron-use-hudson-instead) that I stumbled across when looking for materials for this post and it sums up what I was going to say beautifully so I won't repeat the major points here; however, there are a few other advantages to using Hudson as a cron which segues into the second part of my title, "workflow manager". To me, Hudson is a "framework" within which I can simply execute jobs and use the pre-built components necessary for things like archiving, notification on failure, concurrent execution etc. One feature that I don't think is talked about very much is the "workflow" like abilities you get by having jobs invoked when another job is complete.

As a user of Hadoop for processing large quantities of data, the community has been abuzz with workflow management tools and this has been a topic at various Hadoop meetups I have attended. People have often asked who uses the various workflow tools (Oozie vs Azkaban vs Cascading etc) and my somewhat abnormal suggestion has been to use Hudson to invoke scripts that wrap the invocation of Hadoop jobs. Hudson allows you to construct a simple workflow by invoking builds upon the completion of other builds. If you are looking for a non-transactional workflow manager to run your scheduled jobs, then consider using Hudson as it's battle tested and complete with features and a robust plugin architecture to develop your own add-ons in case one doesn't exist for your needs.

At Zvents, I have Hudson kick off an hourly log processing job for revenue recognition purposes which takes about 10-15 minutes to run. Upon completion, it kicks off a few other "builds" that operates on the results to produce some reports that aren't as critical should they take longer than an hour to complete. This allows the core revenue recognition jobs to always start and end within the hour and queue up the downstream jobs to complete when they do. While using Hudson as a workflow manager isn't as simple as say Azkaban, which uses configuration files to represent the dependencies between various jobs, it does the job fairly well for simple workflows. Also given that it's a build server, the integration with version control systems (SVN, GIT etc) makes it possible to be able to version important and have Hudson download from SVN the latest stuff before starting a build.

In short, Hudson is more than just a build system; it's a complete cron system with the abilities to setup simple workflows.