Scalable thoughts...: May 2013

Apache Solr is one of my top favorite tools. This open-source enterprise search engine is very powerful, scalable, resilient and functional. In this article, my purpose is to show how to configure 'did you mean' feature. I used to say to coworkers that ‘did you mean’ is the most cost-benefit Solr feature. This happens because, as you will see, it’s very easy to setup and provide to target user the feeling that your search mechanism is actually very smart.

Some background

Solr ‘did you mean’ support can be configured using SpellCheckComponent which is built on top Lucene implementations of org.apache.lucene.search.spell.StringDistance interface. This interface provides the getDistance(String string1, String string2) method. These implementations just compare two parameterized strings and returns a (double) factor. If factor tends to 1.0, then words is more similar to each other.

A dummy test that I use to do is write a simple Java main program, take some sample words of the index and try Lucene implementations.

After test, I have some idea of how is the average value of accuracy parameter which we will configure for SpellCheckerComponent. This is important because if you are very restricted (accuracy much close to 1.0) then Solr could not suggest any word. A low value of accuracy could not be fine in fact, because Solr could suggest words completely different.

The output code snippet is:

After all, what we have to do is to setup Solr to use one of these Lucene implementation behind of scenes, through SpellcheckComponent configuration.

Example setup

Now, hands on! I have used the same schema.xml default Solr installation and have indexed sample documents located in /example/exampledocs folder. The first snippet show the solrconfig.xml setup:

Before Solr 4.x, another index had been created by Solr to compute spellcheck. Currently, Solr can use the same index as dictionary (solr.DirectSolrSpellChecker config). This option reduces index size, which is more than desirable. :)

Another point is when we analyse output from the above example, it shows us clearly that if we set an accuracy parameter higher than 0.75, only the JaroWinklerDistance implementation will return suggestions. So you have to pay attention on this accuracy value setup.

After configuring solrconfig.xml, we can test the query. It looks like:

And response:

An important thing is the use of spellcheck.collationExtendedResults and spellcheck.maxCollationTries parameter. If these parameters are missing, then Solr could combine words without returning results and suggest the collation. It's not desirable behavior. For example, if you test querying ipod and sansung, even Solr suggesting samsung correctly, the collation list will not considering this option because these words in the same query would not capable to return document results.

Complexity

Distance algorithms implemented by Lucene like LevensteinDistance class (Solr default choice) have complexity O(m.n), where m and n are the length of compared Strings. If you consider that word length average in English language is about 5 characters, then this distance provides a very acceptable performance. If on your tests you notice performance is poor, then you may execute load and/or stress testing on your other queries, because your overall system performance should not be acceptable.

Conclusions

Provide 'did you mean' experience to users can be achieved easily. Indeed it is a powerful way to minimize the "zero-result" issue. You can learn more about Solr spellcheck following Solr wiki: http://wiki.apache.org/solr/SpellCheckComponent.

That's all! See you next post.

Scalable thoughts...

Blog Search

Friday, May 31, 2013

Configure Solr Did You Mean

Tuesday, May 28, 2013

Welcome aboard!