Thursday, November 8, 2012

Using Solr’s PostFiltering to collect score statistics

Solr 4.0 (quietly?) introduced an interface called PostFilter which basically allows you to "post filter" the results that match the main query and all filters (i.e. "filter queries"). There are a few good articles about this:

However, I wanted to contribute to the article space on this topic to bring more attention to this awesome feature as it didn’t seem to be heavily advertised and I think it’s one of the more powerful features introduced in Solr 4.0. The best understood example of this in Solr 4 is to filter results based on a user specified distance criteria (i.e. find me results within X miles of me) and can also be used to implement ACL as per both Yonik and Erik’s articles. My use of PostFiltering is to collect score statistics to ultimately calculate search score distributions which I’ll describe later in the article.

Background

If you’ve ever peeked under the covers to see how Solr and Lucene work, especially with respect to results generation and filtering, you’ll notice a few things:

Solr filter queries (fq) are all independently generated up front. Basically each fq will yield a BitSet (of size = number of total documents each entry with 0/1 to dictate that this document passed the filter)
- This makes filter query caching important as having this cached can improve search performance so that these BitSets can be retrieved from memory as opposed to being constantly re-generated.
All filter query results are set intersected to yield a final BitSet dictating which documents passed/failed the set of filters. Multiple fq parameters are ANDed together.
When "executing" the query in Lucene, a game of "leap frog" is played advancing both the query results "object" and the filter results "object" until both land on the same document ID at which point, this document is scored and added to the final "resultset" object.

This execution strategy works well for most filtered queries, especially for those that don’t have any user context (such as some access level or lat/long) and also works well because most filter queries are cheap to generate (i.e. include/exclude documents containing some term value or small range of term values). When filters become expensive to generate, you need a special way to post process results which is precisely what this new mechanism does.

PostFilter overview

Simply put, a "post filter" is a filter query implementation that extends the PostFilter interface (who in turn extends the ExtendedQuery interface). To get started:

Write a class that extends the ExtendedQueryBase class implementing the PostFilter interface.
Set the "cost", with higher cost filters being executed last (cost must be > 100).
- Optionally, when implementing the post filter, you can specify whether or not its results are cached although most of the time caching the results of a post filter doesn’t make sense..
Add an instance of this filter to the list of filters somehow. (for example, in the prepare() method of some search component)

Most of the work happens in your implementation of the DelegatingCollector which extends Lucene’s Collector class. Let’s think about this for a second. If you go back to the Lucene source code, you’ll basically see it goes something like:

While there is a nextDocument
 collector.collect(docId) //This in turn calls the scorer.
End

This is how the post filtering happens because the aforementioned game of "leap frog" between the filter and the main query happens in the "while there is a next document" phase. In the delegate collector, if some condition is not met, you can choose to *skip* the collection process by choosing not to invoke super.collect()! This is the where the magic of "post filtering" happens!

ResultSet Score Statistics

So you may be wondering why does all this matter for computing the score statistics. At Zvents.com (where I work), we have a federated search product where you can search for something (say "music") and the results are a blended mix of results from our events, venues, restaurants, performers, and movies categories. Each category is its own index (Solr core) and hence must be searched independently with the results blended together. The problem is that the scoring formulas for each category are vastly different hence the score ranges themselves can differ wildly (The #1 venue could have score of 30 while the #1 movie could have score of 10) so the question becomes how to normalize these scores to properly blend them.

Our solution to this was to use the Z score ( (score – avg) / stddev) which tells you how many standard deviations from the mean this score is (assuming a normal distribution which is fairly common). Now you can blend these scores together since now each score has a particular meaning. The challenge though is how to produce these simple statistics such as the mean and standard deviation so these scores can be properly blended.

Implementation

Solr PostFilter to the rescue! Recall that the delegate collector is invoked for each document that matches the query and filters and also recall that since this DelegatingCollector extends Collector you can also implement a custom scorer. So now if you implement a custom scorer that wraps the original scorer being set and capture the score on each collected document, you can now calculate the statistics necessary to produce a normalized score later. For us, "later" is in our federated search component that queries the multiple cores and blends the results together computing a normalized score based on these score statistics stored in the results object.

If you visit http://www.github.com/Zvents/score_stats_component, you’ll find a completely usable implementation of what was described earlier. There are unit tests so you can understand what it’s doing. Basically what happens is:

The ScoreStatsComponent will create an instance of the ScoreStatsPostFilter and add it to the list of already existing filters (or create a new list and add).
The ScoreStatsPostFilter will return an instance of DistributionCalcCollector in the getCollector() method and the DistributionCalcCollector will return a delegate scorer DistributionCalcScorer which will keep track of the necessary statistics from invoking the scorer being wrapped.
In the process() method of the ScoreStatsComponent, the calculated statistics are extracted by searching the filter list for the ScoreStatsPostFilter instance and returned in the results.

Below is an example of what gets returned when you enable this component:

 <lst name="scoreStats">
  <float name="min">0.45898002</float>
  <float name="max">37.544556</float>
  <float name="avg">19.001766</float>
  <float name="stdDev">10.813288</float>
  <long name="numDocs">100</long>
  <float name="sumSquaredScores">47799.43</float>
  <float name="sumScores">1900.1766</float>
 </lst>

Notes/Gotchas/Outstanding issues

Since these statistics are generated during query execution, there are a few situations when these statistics won’t be generated even if explicitly asked:

When the query hits the queryResultCache and returns the results straight from the cache. To remedy this, setup a parallel "scoreStatsCache" of at least the same size/parameters. The component will check the statistics to see if they are valid (i.e. not NaN) and if the statistics are invalid, consult the "scoreStatsCache" using the same cache key generation as core Solr to fetch the statistics for the query. a. To define this user cache, simply add this to your solrConfig.xml:
```
    <cache name="scoreStatsCache" class="solr.LRUCache"
                     size="512"
                     initialSize="512"
                     autowarmCount="0"/>
```
When explicitly sorting the results that don’t include the "score" field. Scoring is an expensive process and if the user is requesting that the documents be explicitly sorted by a field not including the score, then the scorer doesn’t get called and hence the statistics won’t be computed.
- There are scores generated but I am not sure how to get at them in a manner similar to what this component does already. If you have a solution, please submit a pull request!
The statistics on the documents with the grouping functionality turned on will report statistics for the unrolled/pre-grouped set of documents. I tried to get this to work on the post-grouped set but due to how the grouping was implemented, I had a tough time making this work. If there are suggestions on how to make this work, then please submit a pull request.

Conclusion

Using Solr 4.0’s new PostFilter interface, you can implement custom scoring and collecting that previously could only be done by modifying the core Solr/Lucene source. In my opinion, this is one of the more powerful and yet underrated features of Solr 4 but it can open up more possibilities when implementing your search engine. This method of filtering is great when standard filters are expensive to compute. Often times, this can mean computationally expensive (in the case of distance filters) or if you have to consult some external data source for information (filter results to show only users who are "online").

I want to thank the Solr community for helping me come to this conclusion of using Solr’s PostFIlter interface. I posted a question about this question of computing score stats on the mailing list receiving an answer the next day about using the PostFilter interface. This was a huge help and I am hoping that this article helps others in the same way. As always, if there are questions/comments/inaccuracies or things that needs clarifying, please drop a comment and I’ll do my best to respond promptly!

Tuesday, July 17, 2012

Running your Scalding jobs in Eclipse

In a nutshell, Scalding is a Cascading DSL in Scala. If you know what that line means, skip to the meat below else read the next section for a small bit of background. Note: If you are reading this again, I have updated the below sections to rely on Maven instead of SBT and I have included a link to my sample project to help get you started and fixed some serious omissions when I revisited my own blog post to setup a new scalding project.

Introduction

The Hadoop ecosystem is full of wonderful libraries/frameworks/tools to make it easier to query your big data sets without having to write raw Map/Reduce jobs. Cascading(http://www.cascading.org) is one such framework that, simply put, provides APIs to define work flows that process your large datasets. Cascading provides facilities to think about data in terms of pipes through which flows tuples that can be filtered and processed with operations. Couple this with the fact that pipes can be joined together or split apart to produce new pipes and that Flows (which connects data sources and data sinks) can be tied together, you can create some pretty powerful data work flows.
Cascading is a wonderful API which provides the ability to do all these great things but because it's Java and the language is verbose, it's always a bit hard to get started with Cascading from scratch. I've been using it for years and I find that each new project requires me to go back to a previous one and copy/paste some boilerplate code. I think others had similar problems and hence came out with numerous DSL (Domain Specific Language) written in Ruby, Clojure (Cascalog), Scala (Scalding) etc to wrap Cascading to make it easier to write these flows.

I won't pretend to be a Scalding expert so I advise you to visit their site (https://github.com/twitter/scalding/) but what I do know is that it's a Scala DSL around Cascading with some slight tweaks to make it easy to build big data processing flows in Scala. The API is designed to look like the collections API so the same code that works on a small list of data could be used to also work on a stream of billions of tuples. I wanted to play with Scalding so I read the wiki page, downloaded it and copied the tutorial but then I wondered, how can I run this in Eclipse? Mainly because it provides me the ability to write, debug and run (locally mainly) my jobs without having to hit the command line for some tool. At the end of the day, it's a JVM language so it must be able to run in Eclipse right?

Maven + Scalding + Eclipse, Oh My!

I don't have much of an opinion about SBT and can't really say much good or bad about it but I do know Maven is popular and I tend to like it for managing my project dependencies and assembly descriptors etc. It also reduces the amount of stuff to install when setting up a new laptop or bringing new team members up to speed on this technology so I wanted to get this working with as few moving parts as possible.

Pre-Requisites:

Eclipse
Maven
Scala Plugin for Eclipse

Running Scalding in Eclipse

Perhaps the simplest way to get started is to clone my sample project from git and modify as necessary. Once cloned, simply run

mvn eclipse:eclipse

to generate the eclipse project files and everything should build as expected. The sample job is the word count job found from the scalding tutorial.
Once you have a working eclipse project, to run the scalding job:

Create a new runtime configuration:

Main class: com.twitter.scalding.Tool
Program Args: <Fully Qualified Path to Job Class> <Other CLI Args>
Example: org.hokiesuns.scaldingtest.WordCountJob --local --input ./input/input.txt --output ./output.txt
VM Args: -Xmx3072m

To create a job jar that can be submitted to your hadoop cluster, simply run

mvn package

which will generate a fat jar with all the dependencies. This job jar can be submitted to your cluster by executing

 hadoop jar scaldingsample-0.0.1-SNAPSHOT.jar org.hokiesuns.scaldingtest.WordCountJob --hdfs --input <some path> --output <some path>

I just started using Scalding and got this working in Eclipse. If there are any problems or inaccuracies, please post a comment and I'll update my steps. Happy scalding-ing!

Thursday, May 10, 2012

Manipulating URLs with long query strings using Chrome

Website URLs, especially search engine ones as well as REST API with lots of parameters, are getting longer and longer while browsers haven't kept up with this and keep their URL bars short. When debugging lengthy URLs such as those produced when using Apache's Solr, most of the time, I end up copying/pasting the URL into a text editor that supports line wrapping and then making the necessary changes. As you can imagine, this is incredibly time consuming and error prone which made me wonder, "Why isn't there a browser extension to make it easier to edit URLs with long query strings?" After much searching, I couldn't find one so I wrote one!

Chrome Extension

Git clone the repository URL( https://github.com/ANithian/url_edit_extension )
In Chrome, enter chrome://extensions into the URL bar
Make sure the "Developer Mode" box is checked
Click "Load Unpacked Extension" and navigate to where you cloned the repo in step 1. After installing the extension, you should see an icon:

Next, visit a URL that contains a long query string (http://www.zvents.com/search?swhat=music&swhen=tomorrow&swhere=Berkeley%2CCA&commit=Search&st_select=any&search=true&svt=text&srss=). Then click the icon to bring up the URL editor:

To edit the parameter value for "swhat", double click on the value for "swhat" and change it to "comedy" and then click the "Update" button. You should see the URL change to http://www.zvents.com/search?swhat=music&swhen=tomorrow&swhere=Berkeley%2CCA&commit=Search&st_select=any&search=true&svt=text&srss=

That's pretty much it! The only thing you may notice is that the extension may flicker which I believe is a bug in Chrome that should be addressed in v19 (http://code.google.com/p/chromium/issues/detail?id=58816). If you have any suggestions/feedback on this extension, I'd love to hear it. Please submit an issue/feature/bug request at the github location (https://github.com/ANithian/url_edit_extension/issues). Happy editing!

Monday, February 13, 2012

Bundler + Maven for your JRuby projects!

I recently came across a blog post describing the first version of an integration between Bundler and maven ( http://jpkutner.blogspot.com/2011/04/ease-of-bundler-power-of-maven.html ). Please do take a look at his post for a complete understanding of what he did but to summarize:

Gemfile Snippet:

gem "mvn:org.slf4j:slf4j-simple", "1.6.1"

Ruby File Snippet (outside a Rails console context):

require 'java' require 'rubygems' require 'bundler/setup'

require 'mvn:org.slf4j:slf4j-simple'

logger = org.slf4j.LoggerFactory.getLogger("world")

...

This looked really promising but there were a few things that didn't quite look right, namely this code line:

require 'mvn:org.slf4j:slf4j-simple'

It didn't look "ruby"ish to me and knowing the way file systems work, it's hard to create a file/folder with ":" in the name so the "gem" file name would differ from the require line which seems to break convention. Reading further and looking at his modified Bundler source, I saw that this was using the maven_gemify library that is included in JRuby (try require 'rubygems/maven_gemify' at the jirb prompt). This was really cool in that something was already written to integrate the two; however, inspecting further, a few things I didn't like:

It used a custom written/third party maven plugin to resolve/download dependencies which is what Maven does out of the box.
It packaged the jars into the gem directly

My second point could be somewhat moot as I am relatively new to Maven and I haven't worked with deploying a Java project that relies on Maven. My issue with packaging the jars directly into the gem is that in my development environment, I have both a local Maven repository and a local Ruby gems repository. I thought that rather than downloading the same jar that may exist in my Maven repo again, why not have the ruby gem simply point to (i.e. require) the jar located in in my Maven repo?

Upon reading the followup to the Joe Kutner's blog (http://jpkutner.blogspot.com/2011/09/bundlermaven-workaround.html), he stopped development of his integration in favor of a workaround that, while I am sure works, didn't quite satisfy my desire for a clean integration between Maven and Bundler. While writing this post, I also stumbled across http://codingiscoding.wordpress.com/2012/02/08/ruby-bundler-maven-gemfile-maven-plugin/ which seems to reverse what I am proposing (i.e. calling Bundler from Maven).

Revamped maven_gemify library

The first thing I did was revamp the maven_gemify library to eliminate this third party dependency and rely on the local maven repository. This was done using a fairly brute force approach by creating a temp pom.xml that defines the repositories/dependencies and then programmatically invoke the dependency plugin to obtain the locations of the jars. Finally, it writes the wrapper ruby script to require these jars and packages this file into a gem. When gemifying a maven dependency, you specify the name as "mvn::" and pass the version as the second argument. The corresponding ruby gemname will replace the characters "." and ":" with _. An example is "mvn:org.apache.lucene:lucene-core" becomes "org_apache_lucene_lucene-core".

Bundler Integration

The hardest part of this was the bundler integration; however, after a few iterations on it (thanks to the help of what Joe Kutner did with his attempts), I was able to isolate the changes to a few files (namely lib/bundler/dsl.rb and lib/bundler/source.rb with the creation of lib/bundler/maven_gemify2.rb). There are two ways to pull in Maven dependencies:

gem "mvn:org.apache.lucene:lucene-core","3.5.0", :mvn=>"default"
mvn "default" do
gem "mvn:org.apache.lucene:lucene-core","3.5.0"
end

The only difference between the two approaches is that in #2, you can specify several Maven dependencies from the same repository. To avoid having to specify the default Maven repository, I allow for the keyword "default" to represent the standard Maven repo URL otherwise, you can specify the URL of the repository.

Note: Maven-based gems will be auto-required just as regular gems are upon Rails load or require 'bundler/setup'

Summary