Monday, September 26, 2011

Using JMX to modify the parameters of a running Solr instance

Apache's Solr is a great software package to build your company's search engine around because of it's active community, great contributions, and solid foundations with Lucene at its core. If you are new to this package, then check out my earlier post on setting up Solr with Eclipse .

Normally, a company using Solr to power it's search engine will have a front end written in a web framework such as Ruby on Rails that makes calls to Solr through it's REST interface. One example is (my employer), where our front end is RoR that makes multiple calls to Solr to power various portions of the site. One common problem we encountered at Zvents was that a particular search would yield weird results and we'd execute the query passing the "&debugQuery=true" flag to study why that result appeared where it did. Afterwards; however, came the hard part.. tuning the ranking functions and refreshing the page to see if the results look better. There are two ways to do this:
  1. Pass the modified set of ranking function parameters (bf, pf and/or qf) to Solr via the URL string
  2. Modify solrconfig.xml and restart Solr to see what the results look like.
While #1 is certainly better than #2, both have their problems for the following reasons:
  • Restarting Solr can be slow and is not scalable. Each change requires a restart and this may take several minutes just to see the changes. Also it forces the person doing the tuning (say the search scientist/expert) to be able to understand the operational side (i.e. where Solr resides, how to restart it etc). This may be problematic for some people who are a averse to dealing with such operational details.
  • While passing the ranking function parameters via the URL works, it makes the URL significantly longer (which is a personal annoyance that most browsers have single line URL boxes) and this doesn't allow you to see the results of your change in the UI built by your front end team (without a special interface or requiring someone from that team to make changes to the UI code).
At Zvents, our initial solution to this involved building a special UI hosted within the Solr web application context, a JSP page + corresponding servlet with text boxes for each major ranking function parameter (qf, pf, bf) per handler. Our search experts would access this page and make the necessary changes which, upon submission of the form, would update the respective Solr handler's in-memory configuration map and thus the changes could be viewed in the UI built by our front end team. This was a great solution and certainly works; however, when Solr introduced JMX, it made more sense that such functionality live inside an MBean that exposed the ability to change these parameters using JConsole or any other UI with similar functionality.

Solr 2306- Modify default solrconfig parameters via JMX

When Solr added JMX support, it was a great way for administrators to be able to use a well known set of tools to understand what is going on inside of the JVM running Solr. Not only can you use JConsole to understand what the VM is doing but also what is going on inside of Solr.

This patch extends the built-in JMX support to also allow the default parameters (exposed via the <lst name="defaults">) to be exposed as modifiable attributes. When these parameters are modified, the changes are immediately noticed upon the issued search query without requiring any core reloads. To use this patch, follow the steps below and post any questions/comments on this JIRA issue

Step One: Checkout Solr, apply patch and build
svn co solr_trunk
cd solr_trunk
patch -p0 < <PATH_TO_PATH>/tuning_patch2.patch
cd solr
ant dist
Step Two: Setup and run example Solr instance
cd example
cp ../dist/apache-solr-4.0-SNAPSHOT.war webapps/solr.war
java -jar start.jar
Step Three: Load sample data
cd exampledocs
./ *.xml
Step Four: Launch jconsole!

In JConsole, choose the "MBeans" tab and along the left side, expand "/solr", "browse/", "org.apache.solr.component.SearchHandler", "Attributes". From here, you can select an attribute and change it's value.

Step Five: Modify parameters

For purposes of demoing the functionality of this patch, let's modify the "fl" parameter which describes the default set of field values to return. This can be overridden by passing &fl= on the URL; however, let's modify this parameter via JConsole to see it's effect. First though, let's look at some data by accessing http://localhost:8983/solr/browse

Notice that there is at least a name, price, and in-stock fields displayed. Using JConsole (make sure you followed the previous step), select the "fl" attribute and change it's value to only show "score,name"

and refresh your web browser pointing at the browse handler.

Notice that only the name is displayed along the top with the other fields that were initially present no longer there! Try this for other parameters and if you have JConsole pointing at your own search engine, try modifying the qf, pf, and bf parameters to see their effects. Of course keep in mind that to persist the parameter changes, make the appropriate changes in your solrconfig.xml file.


Through a fairly simple, contrived demo using the examples provided in the standard Solr distribution, you were able to change some default handler parameters for a particular search handler and immediately see its effects. Solr's JMX extensions provide a great way to peek inside of a running Solr instance without any major performance impacts. This patch takes this one step further and allows for real-time tuning of a running Solr instance. If you like this patch, and have login credentials on Apache's JIRA, please vote this up to help ensure that it gets rolled into a future release of Solr!

Sunday, July 10, 2011

Using Hudson as a cron/workflow manager

Every organization has scheduled jobs, whether it be daily database backups or hourly execution of log processing. Managing and running scheduled jobs is always a pain and the de-facto standard for scheduling such jobs is cron/anacron because it's easy and available on any Unix/Linux system. I'm sure many of you are familiar with the problems of using cron (lack of any job-flow control, logging, failure reporting etc) and having dealt with such problems at Zvents where I am a lead search/analytics engineer, I installed Hudson specifically as a cron server to solve these problems. For months prior, I used Hudson as a build server and it worked very well; however, it never crossed my mind to use Hudson as a cron server until I talked to someone else about this problem.

David Strauss wrote an interesting blog post ( that I stumbled across when looking for materials for this post and it sums up what I was going to say beautifully so I won't repeat the major points here; however, there are a few other advantages to using Hudson as a cron which segues into the second part of my title, "workflow manager". To me, Hudson is a "framework" within which I can simply execute jobs and use the pre-built components necessary for things like archiving, notification on failure, concurrent execution etc. One feature that I don't think is talked about very much is the "workflow" like abilities you get by having jobs invoked when another job is complete.

As a user of Hadoop for processing large quantities of data, the community has been abuzz with workflow management tools and this has been a topic at various Hadoop meetups I have attended. People have often asked who uses the various workflow tools (Oozie vs Azkaban vs Cascading etc) and my somewhat abnormal suggestion has been to use Hudson to invoke scripts that wrap the invocation of Hadoop jobs. Hudson allows you to construct a simple workflow by invoking builds upon the completion of other builds. If you are looking for a non-transactional workflow manager to run your scheduled jobs, then consider using Hudson as it's battle tested and complete with features and a robust plugin architecture to develop your own add-ons in case one doesn't exist for your needs.

At Zvents, I have Hudson kick off an hourly log processing job for revenue recognition purposes which takes about 10-15 minutes to run. Upon completion, it kicks off a few other "builds" that operates on the results to produce some reports that aren't as critical should they take longer than an hour to complete. This allows the core revenue recognition jobs to always start and end within the hour and queue up the downstream jobs to complete when they do. While using Hudson as a workflow manager isn't as simple as say Azkaban, which uses configuration files to represent the dependencies between various jobs, it does the job fairly well for simple workflows. Also given that it's a build server, the integration with version control systems (SVN, GIT etc) makes it possible to be able to version important and have Hudson download from SVN the latest stuff before starting a build.

In short, Hudson is more than just a build system; it's a complete cron system with the abilities to setup simple workflows.