Monday, August 12, 2013

Three Reasons to Break Your “Golden Handcuffs”




It’s an amazing statistic. Two-thirds of American workers are dissatisfied with their current job and since most people spend one-third of their day working, it is an absurdly high percentage of life spent being unhappy. Golden handcuffs only make things worse because the cost of leaving can have serious financial consequences.

This is the situation I faced throughout 2012 and early 2013. The startup I worked for was acquired by a large company and I was one of the few core people given a meaningful retention package (a.k.a. golden handcuffs). To be truthful, I had been dissatisfied for a while and was looking to leave but knowing about the impending acquisition along with this added financial incentive (40% bump in salary, 20% bonus, sizable stock grants not to mention a rising stock price) made it extremely expensive to just walk away.

When our acquisition finally closed in early 2012, I thought that I’d be part of something special and enjoy the perks of being a member of a new team considered to be the innovative arm of a successful business organization. Unfortunately, within the first quarter, some key people and close colleagues of mine left. The most shocking announcement came when the key person who led our acquisition announced his resignation a few months later leaving us without a leader and champion. The remainder of 2012 ended up being unproductive as my team and I dealt with increasing politics, fighting production fires, and constantly training a stream of new people on the complexities of our system.

Steve Jobs said in his inspiring 2005 Stanford commencement address, "If today were the last day of my life, would I want to do what I am about to do today? And whenever the answer has been "No" for too many days in a row, I know I need to change something.” The last time I said “no” to this question was when I resigned in March 2013 and as I reflected on this decision, I identified three key reasons why I gave up the remainder of my retention package.


Culture Fit

While this section could be an entire post given its importance, I’ll break down the differences between the pre- and post-acquisition culture into three main points:
  1. Openness and transparency 
  2. Hiring philosophy 
  3. Trust / Freedom from hierarchy 

Openness and transparency

This is very important for a (small) startup as everyone is working towards the same goal and has a vested interest in the outcome. When we were a startup, we had regular meetings where the CEO and leadership team gave updates on the status of various aspects of the business. At the end of each sprint, we would gather in our common area to demo the different features being shipped. While we had different mailing lists for product, engineering, and operations, generally anyone was welcome to be a member of any mailing list.

After the acquisition and once the startup-minded leadership left, it was clear that the new management wanted to protect their various territories and close off the flow of information between teams. For example, the new Director of Engineering (a long time alumnus from the acquiring company) removed product management from the engineering mailing lists and told our Director of Product (who was a member of the acquisition) to stay out of engineering’s affairs which didn't foster the happy partnership that is supposed to exist between the two groups. A short time later, he removed client services/account management, and for a while, most of engineering, from the operations mailing list.

What ensued was added stress and chaos as communication that was once open among teams started to break down. Operation alerts normally visible to client services and engineering went unnoticed because our operations team was too busy fighting other fires. Customer-facing teams were clueless about decisions and changes made to the website by engineering. Regular team-wide status meetings stopped and because we were a small team in a larger organization, it was never clear what the direction of our team was and why we were even there.

Hiring philosophy

Startup hiring philosophy is always to "hire slowly, fire quickly." Personally, I abide by this along with “hire for smarts not for (particular) skills” and hire people smarter than me (it’s too easy the other way around).

As a startup, we had high standards and we asked tough phone and in-person coding interview questions to ensure that we were only bringing in smart people capable of delivering big things to move the company forward. After our acquisition; however, it was clear that we had to hire indiscriminately to spend the “use it or lose it” quarterly budget. This significantly lowered the hiring standards (despite management's claim to the contrary) and eventually those of us who stuck to our beliefs on conducting tough interviews were excluded from the interview panel. 

The results were immediately obvious. While the team tripled in size, the increase in output was not even close. Rather than hiring a team of curious individuals who strived to understand the existing software and our customer needs, we hired people with backgrounds specific to the current needs and were pigeon-holed into very specific roles without the ability to move around and learn new things. They would do as they were told (somewhat robotically) by their manager without questioning. This created a poor atmosphere as those who blindly did as they were told, regardless of the quality and impact to the product, were held in higher regard over those who perhaps challenged management but did the majority of the work visible to customers and business stakeholders.

Trust and freedom from (rigid) hierarchy

All organizations will have some hierarchy (i.e. there’s a CEO and a management team); however, this hierarchy should not prevent the free flow of ideas and no idea should be discouraged because it didn't come from the top. 

When I joined my startup as a new engineer, I was free to make decisions around how to implement assigned features and even refactored significant portions of our codebase if necessary. My manager, also an individual contributor, was too busy to concern himself with the minutiae of what I was doing and trusted that I would ask for help when necessary. Since all code commits were automatically emailed to a mailing list (i.e. openness and transparency), he would let me know if anything was off so I could address it. This made for a great working relationship as I was able to quickly learn our complicated system and gain confidence to make bigger technical decisions.

Unfortunately, after our acquisition, this empowerment quickly went away despite the fact that I had a significant amount of experience with our product, customers, and the technology stack. My manager felt that because he was my superior and titled as the Director of Engineering/Technology that he could dictate technical direction despite my objections as a technical lead. While I was perceived and considered by others to be a technical lead, I was not given the freedom to actually make technical decisions about the product which I led. The breaking point came when I pushed to use a Scala based Hadoop library which made writing Map-Reduce jobs easier. I urged a new colleague to learn and implement a simple task assigned to him using this Scala based library which I saw he enjoyed as it provided him an opportunity to learn something new and resolve a nagging JIRA task at the same time. A win-win all around.

Within a few weeks, my manager told me that we were not to use Scala and that we had to use Java. He considered this Scala library “too new” with no books on the subject, a risk to our technology stack (despite its increasing popularity) and that hiring developers who know Scala was hard. Although in the end, he assured me that Scala was being considered by the “god-like” architecture team for future projects. To me this translated to “Don’t take risks. Stick to Java because there are tons of people whom we can hire and we are unable to attract people who are smart and willing to work outside of their comfort zone in Scala.”

Not being challenged

The technology landscape changes too rapidly to be stuck doing the same things in the same technology stack or worse (as can be the case with an acquisition) being valued only because you can maintain the current (and possibly dated) technology stack.

My biggest fear was being comfortable in an organization that did not promote innovation where I was working with colleagues who did not motivate and challenge me. There were no brown bag lunches on interesting research in computer science or cool open source projects that could be applied to the problems we, as a business unit, needed to solve. Introduction of new frameworks, databases, and technologies were seen as “risks” because they were too new and we couldn’t hire people smart enough to work on them. I rarely saw any of my colleagues attend meetups on new technologies or present their work at any major technical conference. Furthermore, we didn’t employ any core “seeds” (i.e. Doug Cutting, Guido van Rossum, James Gosling etc.) who could attract the kind of talent needed to really propel a company/group forward.

Startups (and those who work at them) are risk takers and promote taking reasonable risks as the rewards can be huge and the pace of work can be challenging and gratifying. For example, in 2006, rather than using the traditional PHP to build the website, our founders used Rails 0.x, a huge risk at the time. Furthermore, the founders used (the newly released) Apache Solr to power the site’s search capabilities and later in 2007 our CTO said that in order to scale like Google, we needed to use/build the infrastructure that Google had, so he introduced the use of Hadoop (relatively new at the time) and hired engineers to build our own NoSQL database. When I joined a few months later, this project was maturing and slowly became core to our analytics/reporting stack. We also hired consultants to train our engineering team on Hadoop which was huge as now Big Data is everywhere and centered around Hadoop (and its numerous related technologies).In four years at a startup I learned more than what some people take more than ten years to learn at a big company. In short, I needed to be with people who challenged me and where I could constantly learn and grow.

Opportunity cost

Opportunity cost refers to the highest value of an alternative choice (or path) not taken. Generally every decision will have an opportunity cost (sometimes well known, other times hypothetically derived) and you have to decide whether or not that cost is too high. I knew the tangible benefits of staying (salary, bonus, stock grants); however, the opportunity cost of staying was high since I was not growing nor learning. Furthermore, my professional network wasn't expanding and the chances to work on things that were worthy of writing and presenting were few and far between.

I also realized that the costs of simply being a senior/lead engineer (with specific focus in search and big data) in a technically weak organization was quickly increasing. In 2009, when I first attended the SF Hadoop meetups, I felt that I was generally ahead of others (in terms of experience and working knowledge of the related technologies) as I was explaining stuff to those who were new to Hadoop. Over time, I saw everyone else catching up and soon, I was the one listening to others explain things to me like a newbie.

I knew that remaining in such an organization would cost me more than my package was worth in the long run both because of a continued decline in technical currency and the decline in the quality of my professional network, my two most valuable professional assets.

Conclusion

Walking away from my retention package remains one of the hardest decisions of my life. At a relatively young age and early stage in my career, I didn't want to find myself amongst the 20% of Americans completely disengaged with their work. I knew that the opportunity cost of staying in a place where the environment was toxic along with not working with the latest and greatest solving interesting problems was too high. I believe that the path I am on now will lead me to greater rewards - intellectually if not just financially; however, I have to pave that path myself which is both exciting and yet daunting at the same time.

* Image courtesy of http://flavorscientist.com/2012/10/13/a-one-year-birthday-the-story-of-flavorscientist-com/

Thursday, November 8, 2012

Using Solr’s PostFiltering to collect score statistics

Solr 4.0 (quietly?) introduced an interface called PostFilter which basically allows you to "post filter" the results that match the main query and all filters (i.e. "filter queries"). There are a few good articles about this:

However, I wanted to contribute to the article space on this topic to bring more attention to this awesome feature as it didn’t seem to be heavily advertised and I think it’s one of the more powerful features introduced in Solr 4.0. The best understood example of this in Solr 4 is to filter results based on a user specified distance criteria (i.e. find me results within X miles of me) and can also be used to implement ACL as per both Yonik and Erik’s articles. My use of PostFiltering is to collect score statistics to ultimately calculate search score distributions which I’ll describe later in the article.

Background

If you’ve ever peeked under the covers to see how Solr and Lucene work, especially with respect to results generation and filtering, you’ll notice a few things:

  1. Solr filter queries (fq) are all independently generated up front. Basically each fq will yield a BitSet (of size = number of total documents each entry with 0/1 to dictate that this document passed the filter)
    • This makes filter query caching important as having this cached can improve search performance so that these BitSets can be retrieved from memory as opposed to being constantly re-generated.
  2. All filter query results are set intersected to yield a final BitSet dictating which documents passed/failed the set of filters. Multiple fq parameters are ANDed together.
  3. When "executing" the query in Lucene, a game of "leap frog" is played advancing both the query results "object" and the filter results "object" until both land on the same document ID at which point, this document is scored and added to the final "resultset" object.
This execution strategy works well for most filtered queries, especially for those that don’t have any user context (such as some access level or lat/long) and also works well because most filter queries are cheap to generate (i.e. include/exclude documents containing some term value or small range of term values). When filters become expensive to generate, you need a special way to post process results which is precisely what this new mechanism does.

PostFilter overview

Simply put, a "post filter" is a filter query implementation that extends the PostFilter interface (who in turn extends the ExtendedQuery interface). To get started:

  • Write a class that extends the ExtendedQueryBase class implementing the PostFilter interface.
  • Set the "cost", with higher cost filters being executed last (cost must be > 100).
    • Optionally, when implementing the post filter, you can specify whether or not its results are cached although most of the time caching the results of a post filter doesn’t make sense..
  • Add an instance of this filter to the list of filters somehow. (for example, in the prepare() method of some search component)
Most of the work happens in your implementation of the DelegatingCollector which extends Lucene’s Collector class. Let’s think about this for a second. If you go back to the Lucene source code, you’ll basically see it goes something like:
While there is a nextDocument
 collector.collect(docId) //This in turn calls the scorer.
End
This is how the post filtering happens because the aforementioned game of "leap frog" between the filter and the main query happens in the "while there is a next document" phase. In the delegate collector, if some condition is not met, you can choose to *skip* the collection process by choosing not to invoke super.collect()! This is the where the magic of "post filtering" happens!

ResultSet Score Statistics

So you may be wondering why does all this matter for computing the score statistics. At Zvents.com (where I work), we have a federated search product where you can search for something (say "music") and the results are a blended mix of results from our events, venues, restaurants, performers, and movies categories. Each category is its own index (Solr core) and hence must be searched independently with the results blended together. The problem is that the scoring formulas for each category are vastly different hence the score ranges themselves can differ wildly (The #1 venue could have score of 30 while the #1 movie could have score of 10) so the question becomes how to normalize these scores to properly blend them.

Our solution to this was to use the Z score ( (score – avg) / stddev) which tells you how many standard deviations from the mean this score is (assuming a normal distribution which is fairly common). Now you can blend these scores together since now each score has a particular meaning. The challenge though is how to produce these simple statistics such as the mean and standard deviation so these scores can be properly blended.

Implementation

Solr PostFilter to the rescue! Recall that the delegate collector is invoked for each document that matches the query and filters and also recall that since this DelegatingCollector extends Collector you can also implement a custom scorer. So now if you implement a custom scorer that wraps the original scorer being set and capture the score on each collected document, you can now calculate the statistics necessary to produce a normalized score later. For us, "later" is in our federated search component that queries the multiple cores and blends the results together computing a normalized score based on these score statistics stored in the results object.

If you visit http://www.github.com/Zvents/score_stats_component, you’ll find a completely usable implementation of what was described earlier. There are unit tests so you can understand what it’s doing. Basically what happens is:

  1. The ScoreStatsComponent will create an instance of the ScoreStatsPostFilter and add it to the list of already existing filters (or create a new list and add).
  2. The ScoreStatsPostFilter will return an instance of DistributionCalcCollector in the getCollector() method and the DistributionCalcCollector will return a delegate scorer DistributionCalcScorer which will keep track of the necessary statistics from invoking the scorer being wrapped.
  3. In the process() method of the ScoreStatsComponent, the calculated statistics are extracted by searching the filter list for the ScoreStatsPostFilter instance and returned in the results.
Below is an example of what gets returned when you enable this component:
 <lst name="scoreStats">
  <float name="min">0.45898002</float>
  <float name="max">37.544556</float>
  <float name="avg">19.001766</float>
  <float name="stdDev">10.813288</float>
  <long name="numDocs">100</long>
  <float name="sumSquaredScores">47799.43</float>
  <float name="sumScores">1900.1766</float>
 </lst>

Notes/Gotchas/Outstanding issues

Since these statistics are generated during query execution, there are a few situations when these statistics won’t be generated even if explicitly asked:

  1. When the query hits the queryResultCache and returns the results straight from the cache. To remedy this, setup a parallel "scoreStatsCache" of at least the same size/parameters. The component will check the statistics to see if they are valid (i.e. not NaN) and if the statistics are invalid, consult the "scoreStatsCache" using the same cache key generation as core Solr to fetch the statistics for the query. a. To define this user cache, simply add this to your solrConfig.xml:
        <cache name="scoreStatsCache" class="solr.LRUCache"
                         size="512"
                         initialSize="512"
                         autowarmCount="0"/>
    
  2. When explicitly sorting the results that don’t include the "score" field. Scoring is an expensive process and if the user is requesting that the documents be explicitly sorted by a field not including the score, then the scorer doesn’t get called and hence the statistics won’t be computed.
    • There are scores generated but I am not sure how to get at them in a manner similar to what this component does already. If you have a solution, please submit a pull request!
  3. The statistics on the documents with the grouping functionality turned on will report statistics for the unrolled/pre-grouped set of documents. I tried to get this to work on the post-grouped set but due to how the grouping was implemented, I had a tough time making this work. If there are suggestions on how to make this work, then please submit a pull request.

Conclusion

Using Solr 4.0’s new PostFilter interface, you can implement custom scoring and collecting that previously could only be done by modifying the core Solr/Lucene source. In my opinion, this is one of the more powerful and yet underrated features of Solr 4 but it can open up more possibilities when implementing your search engine. This method of filtering is great when standard filters are expensive to compute. Often times, this can mean computationally expensive (in the case of distance filters) or if you have to consult some external data source for information (filter results to show only users who are "online").

I want to thank the Solr community for helping me come to this conclusion of using Solr’s PostFIlter interface. I posted a question about this question of computing score stats on the mailing list receiving an answer the next day about using the PostFilter interface. This was a huge help and I am hoping that this article helps others in the same way. As always, if there are questions/comments/inaccuracies or things that needs clarifying, please drop a comment and I’ll do my best to respond promptly!

Tuesday, July 17, 2012

Running your Scalding jobs in Eclipse

In a nutshell, Scalding is a Cascading DSL in Scala. If you know what that line means, skip to the meat below else read the next section for a small bit of background. Note: If you are reading this again, I have updated the below sections to rely on Maven instead of SBT and I have included a link to my sample project to help get you started and fixed some serious omissions when I revisited my own blog post to setup a new scalding project.

Introduction

The Hadoop ecosystem is full of wonderful libraries/frameworks/tools to make it easier to query your big data sets without having to write raw Map/Reduce jobs. Cascading(http://www.cascading.org) is one such framework that, simply put, provides APIs to define work flows that process your large datasets. Cascading provides facilities to think about data in terms of pipes through which flows tuples that can be filtered and processed with operations. Couple this with the fact that pipes can be joined together or split apart to produce new pipes and that Flows (which connects data sources and data sinks) can be tied together, you can create some pretty powerful data work flows. 
Cascading is a wonderful API which provides the ability to do all these great things but because it's Java and the language is verbose, it's always a bit hard to get started with Cascading from scratch. I've been using it for years and I find that each new project requires me to go back to a previous one and copy/paste some boilerplate code. I think others had similar problems and hence came out with numerous DSL (Domain Specific Language) written in Ruby, Clojure (Cascalog), Scala (Scalding) etc to wrap Cascading to make it easier to write these flows.

I won't pretend to be a Scalding expert so I advise you to visit their site (https://github.com/twitter/scalding/) but what I do know is that it's a Scala DSL around Cascading with some slight tweaks to make it easy to build big data processing flows in Scala. The API is designed to look like the collections API so the same code that works on a small list of data could be used to also work on a stream of billions of tuples. I wanted to play with Scalding so I read the wiki page, downloaded it and copied the tutorial but then I wondered, how can I run this in Eclipse? Mainly because it provides me the ability to write, debug and run (locally mainly) my jobs without having to hit the command line for some tool. At the end of the day, it's a JVM language so it must be able to run in Eclipse right?

Maven + Scalding + Eclipse, Oh My!

I don't have much of an opinion about SBT and can't really say much good or bad about it but I do know Maven is popular and I tend to like it for managing my project dependencies and assembly descriptors etc. It also reduces the amount of stuff to install when setting up a new laptop or bringing new team members up to speed on this technology so I wanted to get this working with as few moving parts as possible.

Pre-Requisites:

  1. Eclipse
  2. Maven
  3. Scala Plugin for Eclipse

Running Scalding in Eclipse

Perhaps the simplest way to get started is to clone my sample project from git and modify as necessary. Once cloned, simply run
mvn eclipse:eclipse
to generate the eclipse project files and everything should build as expected. The sample job is the word count job found from the scalding tutorial.
Once you have a working eclipse project, to run the scalding job:
  1. Create a new runtime configuration:
    Main class: com.twitter.scalding.Tool
    Program Args: <Fully Qualified Path to Job Class> <Other CLI Args>
    Example: org.hokiesuns.scaldingtest.WordCountJob --local --input ./input/input.txt --output ./output.txt
    VM Args: -Xmx3072m
    
To create a job jar that can be submitted to your hadoop cluster, simply run
mvn package
which will generate a fat jar with all the dependencies. This job jar can be submitted to your cluster by executing
 hadoop jar scaldingsample-0.0.1-SNAPSHOT.jar org.hokiesuns.scaldingtest.WordCountJob --hdfs --input <some path> --output <some path> 
I just started using Scalding and got this working in Eclipse. If there are any problems or inaccuracies, please post a comment and I'll update my steps. Happy scalding-ing!

Thursday, May 10, 2012

Manipulating URLs with long query strings using Chrome

Website URLs, especially search engine ones as well as REST API with lots of parameters, are getting longer and longer while browsers haven't kept up with this and keep their URL bars short. When debugging lengthy URLs such as those produced when using Apache's Solr, most of the time, I end up copying/pasting the URL into a text editor that supports line wrapping and then making the necessary changes. As you can imagine, this is incredibly time consuming and error prone which made me wonder, "Why isn't there a browser extension to make it easier to edit URLs with long query strings?" After much searching, I couldn't find one so I wrote one!

Chrome Extension

  1. Git clone the repository URL( https://github.com/ANithian/url_edit_extension )
  2. In Chrome, enter chrome://extensions into the URL bar
  3. Make sure the "Developer Mode" box is checked
  4. Click "Load Unpacked Extension" and navigate to where you cloned the repo in step 1. After installing the extension, you should see an icon:







Next, visit a URL that contains a long query string (http://www.zvents.com/search?swhat=music&swhen=tomorrow&swhere=Berkeley%2CCA&commit=Search&st_select=any&search=true&svt=text&srss=). Then click the icon to bring up the URL editor:

To edit the parameter value for "swhat", double click on the value for "swhat" and change it to "comedy" and then click the "Update" button. You should see the URL change to http://www.zvents.com/search?swhat=music&swhen=tomorrow&swhere=Berkeley%2CCA&commit=Search&st_select=any&search=true&svt=text&srss=

That's pretty much it! The only thing you may notice is that the extension may flicker which I believe is a bug in Chrome that should be addressed in v19 (http://code.google.com/p/chromium/issues/detail?id=58816). If you have any suggestions/feedback on this extension, I'd love to hear it. Please submit an issue/feature/bug request at the github location (https://github.com/ANithian/url_edit_extension/issues). Happy editing!

Monday, February 13, 2012

Bundler + Maven for your JRuby projects!

I recently came across a blog post describing the first version of an integration between Bundler and maven ( http://jpkutner.blogspot.com/2011/04/ease-of-bundler-power-of-maven.html ). Please do take a look at his post for a complete understanding of what he did but to summarize:

Gemfile Snippet:

gem "mvn:org.slf4j:slf4j-simple", "1.6.1"

Ruby File Snippet (outside a Rails console context):

require 'java' require 'rubygems' require 'bundler/setup'
require 'mvn:org.slf4j:slf4j-simple'
logger = org.slf4j.LoggerFactory.getLogger("world")
...

This looked really promising but there were a few things that didn't quite look right, namely this code line:
require 'mvn:org.slf4j:slf4j-simple'
It didn't look "ruby"ish to me and knowing the way file systems work, it's hard to create a file/folder with ":" in the name so the "gem" file name would differ from the require line which seems to break convention. Reading further and looking at his modified Bundler source, I saw that this was using the maven_gemify library that is included in JRuby (try require 'rubygems/maven_gemify' at the jirb prompt). This was really cool in that something was already written to integrate the two; however, inspecting further, a few things I didn't like:
  • It used a custom written/third party maven plugin to resolve/download dependencies which is what Maven does out of the box.
  • It packaged the jars into the gem directly
My second point could be somewhat moot as I am relatively new to Maven and I haven't worked with deploying a Java project that relies on Maven. My issue with packaging the jars directly into the gem is that in my development environment, I have both a local Maven repository and a local Ruby gems repository. I thought that rather than downloading the same jar that may exist in my Maven repo again, why not have the ruby gem simply point to (i.e. require) the jar located in in my Maven repo?

Upon reading the followup to the Joe Kutner's blog (http://jpkutner.blogspot.com/2011/09/bundlermaven-workaround.html), he stopped development of his integration in favor of a workaround that, while I am sure works, didn't quite satisfy my desire for a clean integration between Maven and Bundler. While writing this post, I also stumbled across http://codingiscoding.wordpress.com/2012/02/08/ruby-bundler-maven-gemfile-maven-plugin/ which seems to reverse what I am proposing (i.e. calling Bundler from Maven).

Revamped maven_gemify library

The first thing I did was revamp the maven_gemify library to eliminate this third party dependency and rely on the local maven repository. This was done using a fairly brute force approach by creating a temp pom.xml that defines the repositories/dependencies and then programmatically invoke the dependency plugin to obtain the locations of the jars. Finally, it writes the wrapper ruby script to require these jars and packages this file into a gem. When gemifying a maven dependency, you specify the name as "mvn::" and pass the version as the second argument. The corresponding ruby gemname will replace the characters "." and ":" with _. An example is "mvn:org.apache.lucene:lucene-core" becomes "org_apache_lucene_lucene-core".

Bundler Integration

The hardest part of this was the bundler integration; however, after a few iterations on it (thanks to the help of what Joe Kutner did with his attempts), I was able to isolate the changes to a few files (namely lib/bundler/dsl.rb and lib/bundler/source.rb with the creation of lib/bundler/maven_gemify2.rb). There are two ways to pull in Maven dependencies:
  1. gem "mvn:org.apache.lucene:lucene-core","3.5.0", :mvn=>"default"
  2. mvn "default" do
    gem "mvn:org.apache.lucene:lucene-core","3.5.0"
    end
The only difference between the two approaches is that in #2, you can specify several Maven dependencies from the same repository. To avoid having to specify the default Maven repository, I allow for the keyword "default" to represent the standard Maven repo URL otherwise, you can specify the URL of the repository.

Note: Maven-based gems will be auto-required just as regular gems are upon Rails load or require 'bundler/setup'

Summary

Gemfile Snippet:

gem "mvn:org.apache.lucene:lucene-core", "3.5.0", :mvn=>"default"

Ruby File Snippet (outside a Rails console context):

require 'java'
require 'rubygems'
require 'bundler/setup'
require 'org_apache_lucene_lucene-core' #Notice the more "rubyish" way of requiring the gem. The directory name of the gem's contents is the same.
d=org.apache.lucene.store.SimpleFSDirectory.new(java.io.File.new("."))
...

Next Steps
  • Grab a copy of my modified Bundler ( https://github.com/anithian/bundler) and give it a shot. The rake install will generate the gem and you can install it directly. I have some examples in the README file as well as those above. The Maven dsl will only work with JRuby but the modified Bundler should still behave properly without JRuby.
  • Testing of the integration with more cases. I couldn't get the rspecs to run properly on my system and couldn't tell if it was a JRuby thing or a Windows thing.
  • Validation that what I have proposed in both the integration and the revamped maven-gemify plugin. Some feedback on how others properly deploy Java applications using Maven would be helpful. Does the application's classpath point to jars in the Maven repo?
  • Right now, when re-executing Bundler on your project, something is amiss in the use of the lock file and the internal cache isn't checked for Maven dependencies so it'll go through the gemification process each time. Since my maven_gemify2 library relies on Maven's dependency plugin directly, the jars won't re-download (as it would have already downloaded to your repo) but still it would be good to make the re-execution behavior consistent.

Monday, September 26, 2011

Using JMX to modify the parameters of a running Solr instance

Apache's Solr is a great software package to build your company's search engine around because of it's active community, great contributions, and solid foundations with Lucene at its core. If you are new to this package, then check out my earlier post on setting up Solr with Eclipse http://hokiesuns.blogspot.com/2010/01/setting-up-apache-solr-in-eclipse.html .

Normally, a company using Solr to power it's search engine will have a front end written in a web framework such as Ruby on Rails that makes calls to Solr through it's REST interface. One example is Zvents.com (my employer), where our front end is RoR that makes multiple calls to Solr to power various portions of the site. One common problem we encountered at Zvents was that a particular search would yield weird results and we'd execute the query passing the "&debugQuery=true" flag to study why that result appeared where it did. Afterwards; however, came the hard part.. tuning the ranking functions and refreshing the page to see if the results look better. There are two ways to do this:
  1. Pass the modified set of ranking function parameters (bf, pf and/or qf) to Solr via the URL string
  2. Modify solrconfig.xml and restart Solr to see what the results look like.
While #1 is certainly better than #2, both have their problems for the following reasons:
  • Restarting Solr can be slow and is not scalable. Each change requires a restart and this may take several minutes just to see the changes. Also it forces the person doing the tuning (say the search scientist/expert) to be able to understand the operational side (i.e. where Solr resides, how to restart it etc). This may be problematic for some people who are a averse to dealing with such operational details.
  • While passing the ranking function parameters via the URL works, it makes the URL significantly longer (which is a personal annoyance that most browsers have single line URL boxes) and this doesn't allow you to see the results of your change in the UI built by your front end team (without a special interface or requiring someone from that team to make changes to the UI code).
At Zvents, our initial solution to this involved building a special UI hosted within the Solr web application context, a JSP page + corresponding servlet with text boxes for each major ranking function parameter (qf, pf, bf) per handler. Our search experts would access this page and make the necessary changes which, upon submission of the form, would update the respective Solr handler's in-memory configuration map and thus the changes could be viewed in the UI built by our front end team. This was a great solution and certainly works; however, when Solr introduced JMX, it made more sense that such functionality live inside an MBean that exposed the ability to change these parameters using JConsole or any other UI with similar functionality.

Solr 2306- Modify default solrconfig parameters via JMX

When Solr added JMX support, it was a great way for administrators to be able to use a well known set of tools to understand what is going on inside of the JVM running Solr. Not only can you use JConsole to understand what the VM is doing but also what is going on inside of Solr.

This patch extends the built-in JMX support to also allow the default parameters (exposed via the <lst name="defaults">) to be exposed as modifiable attributes. When these parameters are modified, the changes are immediately noticed upon the issued search query without requiring any core reloads. To use this patch, follow the steps below and post any questions/comments on this JIRA issue https://issues.apache.org/jira/browse/SOLR-2306.

Step One: Checkout Solr, apply patch and build
svn co http://svn.apache.org/repos/asf/lucene/dev/trunk solr_trunk
cd solr_trunk
patch -p0 < <PATH_TO_PATH>/tuning_patch2.patch
cd solr
ant dist
Step Two: Setup and run example Solr instance
cd example
cp ../dist/apache-solr-4.0-SNAPSHOT.war webapps/solr.war
java -jar start.jar
Step Three: Load sample data
cd exampledocs
./post.sh *.xml
Step Four: Launch jconsole!
jconsole


In JConsole, choose the "MBeans" tab and along the left side, expand "/solr", "browse/", "org.apache.solr.component.SearchHandler", "Attributes". From here, you can select an attribute and change it's value.

Step Five: Modify parameters

For purposes of demoing the functionality of this patch, let's modify the "fl" parameter which describes the default set of field values to return. This can be overridden by passing &fl= on the URL; however, let's modify this parameter via JConsole to see it's effect. First though, let's look at some data by accessing http://localhost:8983/solr/browse



Notice that there is at least a name, price, and in-stock fields displayed. Using JConsole (make sure you followed the previous step), select the "fl" attribute and change it's value to only show "score,name"


and refresh your web browser pointing at the browse handler.


Notice that only the name is displayed along the top with the other fields that were initially present no longer there! Try this for other parameters and if you have JConsole pointing at your own search engine, try modifying the qf, pf, and bf parameters to see their effects. Of course keep in mind that to persist the parameter changes, make the appropriate changes in your solrconfig.xml file.

Conclusion

Through a fairly simple, contrived demo using the examples provided in the standard Solr distribution, you were able to change some default handler parameters for a particular search handler and immediately see its effects. Solr's JMX extensions provide a great way to peek inside of a running Solr instance without any major performance impacts. This patch takes this one step further and allows for real-time tuning of a running Solr instance. If you like this patch, and have login credentials on Apache's JIRA, please vote this up to help ensure that it gets rolled into a future release of Solr!

Sunday, July 10, 2011

Using Hudson as a cron/workflow manager

Every organization has scheduled jobs, whether it be daily database backups or hourly execution of log processing. Managing and running scheduled jobs is always a pain and the de-facto standard for scheduling such jobs is cron/anacron because it's easy and available on any Unix/Linux system. I'm sure many of you are familiar with the problems of using cron (lack of any job-flow control, logging, failure reporting etc) and having dealt with such problems at Zvents where I am a lead search/analytics engineer, I installed Hudson specifically as a cron server to solve these problems. For months prior, I used Hudson as a build server and it worked very well; however, it never crossed my mind to use Hudson as a cron server until I talked to someone else about this problem.

David Strauss wrote an interesting blog post (http://fourkitchens.com/blog/2010/05/09/drop-cron-use-hudson-instead) that I stumbled across when looking for materials for this post and it sums up what I was going to say beautifully so I won't repeat the major points here; however, there are a few other advantages to using Hudson as a cron which segues into the second part of my title, "workflow manager". To me, Hudson is a "framework" within which I can simply execute jobs and use the pre-built components necessary for things like archiving, notification on failure, concurrent execution etc. One feature that I don't think is talked about very much is the "workflow" like abilities you get by having jobs invoked when another job is complete.

As a user of Hadoop for processing large quantities of data, the community has been abuzz with workflow management tools and this has been a topic at various Hadoop meetups I have attended. People have often asked who uses the various workflow tools (Oozie vs Azkaban vs Cascading etc) and my somewhat abnormal suggestion has been to use Hudson to invoke scripts that wrap the invocation of Hadoop jobs. Hudson allows you to construct a simple workflow by invoking builds upon the completion of other builds. If you are looking for a non-transactional workflow manager to run your scheduled jobs, then consider using Hudson as it's battle tested and complete with features and a robust plugin architecture to develop your own add-ons in case one doesn't exist for your needs.

At Zvents, I have Hudson kick off an hourly log processing job for revenue recognition purposes which takes about 10-15 minutes to run. Upon completion, it kicks off a few other "builds" that operates on the results to produce some reports that aren't as critical should they take longer than an hour to complete. This allows the core revenue recognition jobs to always start and end within the hour and queue up the downstream jobs to complete when they do. While using Hudson as a workflow manager isn't as simple as say Azkaban, which uses configuration files to represent the dependencies between various jobs, it does the job fairly well for simple workflows. Also given that it's a build server, the integration with version control systems (SVN, GIT etc) makes it possible to be able to version important and have Hudson download from SVN the latest stuff before starting a build.

In short, Hudson is more than just a build system; it's a complete cron system with the abilities to setup simple workflows.