Blog Search

Sunday, June 9, 2013

Clustering documents with Solr and Carrot2

It has been very common hearing about machine learning techniques which try to infer similarities and labels among data. Carrot2 is a open-source clustering framework that reaches it. My intent in this post is to configure Solr to use Carrot2 for, given a search result, clustering similar documents in groups and suggest a label to these groups. But first, we will present some previous background that will help you to understand clustering mechanism.

Some background: Unsupervised learning

This is a kind of learning which for input dataset, the corresponding class (or label) is unknown. Therefore, techniques tend to learn how much similar documents are, considering its attributes and then clustering them in similar groups. It's like the way we human beings use to think: we are capable to associate similar objects, people and other stuff by analysing characteristics. That said, clustering algorithms work in an unsupervised way, that means, there is no label in training set to guide clustering algorithm to the right path (minimizing error rate).
This is how Carrot2 algorithms work: in a unsupervised learning way, implementations try to group similar documents and suggest a label for them.

Carrot2

As we have already said, Carrot2 is a clustering framework. It contains some unsupervised learning algorithm implementations. And can be easily integrated to Solr through ClusteringComponent. This structure allows us to setup a Solr clustering searchComponent and then plug in some requestHandlers. At the end, a request is made and according to search results, documents are clustered in groups (based on their similarity).

Example setup

A test dataset is available right below. There are sample some documents and we can play with parameters:



Now let's setup solrconfig.xml in order to configure clustering component:



We have chosen the BisectingKMeansClustering class for example, which is an implementation of the widely known k-Means clustering algorithm. Basically, this algorithm randomly choose n-centroids ( which n is the clusterCount). By centroid, we mean the instance (document) which best represents a certain group. Therefore, while algorithm iterates, it calculates new centroids, according to documents shown in it's training phase. Until no changes in centroids occurs, or the maxIterations is reached. Another characteristics of k-Means is that it is a hard-clustering algorithm, which means that a document can be grouped at just one single group.

Once clustering searchComponent has been already set, let's declare it in requestHandler definition:



Assuming that you already have indexed data sample, let's request a simple search:



So, let's take a look at response:



Interesting! Solr and Carrot2 have grouped similar documents (which are represented by id fields) and suggested a label that should best represent the group. In the case of label "Pencil" this is more clear. The other group (which are not pencils, if you notice the title contents) were labeled as "Expensive", which according to algorithm it's the most likely label, considering input dataset. If you try tunning clusterCount parameter in solrconfig.xml, you should notice that documents which were in different groups before can appear in same group.

Remarks

Carrot2 is a really simple clustering framework. It has a lot of open-source and also comercial implementations. Apache Solr has built-in integration with Carrot2.

On the other side, Carrot2 computes clusters considering search result documents. It means that, for performance restrictions, you cannot retrieve all document dataset after a request. Therefore, as less documents returned, less quality of clustering. If you try to increase the Solr rows parameter, in a large dataset, you will figure that response time can be compromised. At this point, Carrot2 clustering.documents, that could avoid this situation, is not available yet. You also should take care about Solr field filters and analysers, which can noise label suggestion.

At last, if you need a more accurate clustering suggestion with high time response, you should consider to do that in index time, not in query time. Maybe you can also check clustering algorithms provided by Mahout scalable machine learning library, which certainly we shall discuss in future articles.

That's all. See you next post.

Tuesday, June 4, 2013

Automated tests in four J-steps

Hi there! It’s really important in software development also thinking on automated functional tests for many reasons, which is not intended to discuss it today. However, it’s very common QA team (at least I have seen a lot) using complex automation tools in order to accomplish some simple tests. These tools could make the creation and test evolution very hard to follow.

Our purpose in this article is to simply build an example with the four J's: automate integration test with Jersey, as a REST client and JUnit as our main test framework, publish test documentation with Javadoc and at last, setup Jenkins to show JUnit test reports and Javadoc documentation in it's dashboard. As you will see, this configuration is very easy to create and maintain. You should think twice before using a more powerful test strategy. This simple configuration case could cover the most part of automated test needs.

Example setup

In order to show the power of J's, we have implemented a simple webservice test and deployed it at Jenkins environment. Therefore, it is expected to be a very practical example, and that is why we have illustrated it in four J-steps.

J-step #1: Jersey

Jersey is a RESTful library, reference implementation of JAX-RS specification. We have easily implemented webservice request client in code snippet below:



The code purpose is to make a request to url and save response in some parameterized class instance, using Java generics mechanism. We have also specified that the webservice returns a xml object that will be unmarshalling behind the scenes to our Java object.

J-step #2: Javadoc

We expect maven to generate test javadoc (supposing we are required to write test documentation using Javadoc). To make it happen, let's setup maven-javadoc-plugin in pom.xml:



Javadoc plugin for maven allows us to customize tags, so we can write all automated test documentation in javadoc. I think this is a great advantage, despite the fact QA team is always looking for tools that could provide some structured way of test documentation. Javadoc in automated test can reach this goal. In configuration we have created customized javadoc annotations: @input and @assert. Whenever these annotations appear in code, they will be mapped as respective head tag values in javadoc.

J-step #3: Junit

So, we already have configured javadoc plugin, then let's create our first (and documented, for sure!) test class:



This test is really, really, and really simple. A request is done using requestWebService() method and put response in a Summation instance. You can implement Summation class using classes inside JAXB(another "J") javax.xml.bind package. The method getResult() actually contains the result of sum operation, provided by webservice. Finally, it can be used it to compare with expected result using JUnit assertEquals.

J-step #4: Jenkins

Let's deploy the structure in our continuous integration environment. The last part is setting up Jenkins. Here, Javadoc and JUnit report views in dashboard can be enabled filling form below:



We just have to input maven generated test report and javadoc directories. Approaching like this, we can easily visualize JUnit reports and test documentation in Jenkins dashboard. The test javadoc written before should look like this:



As expected, @assert and @input annotations were converted to "Test assertion" and "Test input", respectively.

Remarks

Integration test is an essential part of all software development cycle. However, sometimes it is lacked because of inherent complexity of doing this activity. This approach provides you a simple way to write automated webservice tests without a "silver bullet" tool, which sometimes is very hard to maintain. I mean, there are a lot of interesting and powerful test tools, but sometimes it's required us to develop things really simple under time and/or budget restrictions.

Another advantage is when deploying test cases, test reports and javadoc at Jenkins, we are concentrating these artifacts in the same environment. This scenario can contribute to test maintainability.

That's acceptable for now. See you next post.