Blog Search

Sunday, June 9, 2013

Clustering documents with Solr and Carrot2

It has been very common hearing about machine learning techniques which try to infer similarities and labels among data. Carrot2 is a open-source clustering framework that reaches it. My intent in this post is to configure Solr to use Carrot2 for, given a search result, clustering similar documents in groups and suggest a label to these groups. But first, we will present some previous background that will help you to understand clustering mechanism.

Some background: Unsupervised learning

This is a kind of learning which for input dataset, the corresponding class (or label) is unknown. Therefore, techniques tend to learn how much similar documents are, considering its attributes and then clustering them in similar groups. It's like the way we human beings use to think: we are capable to associate similar objects, people and other stuff by analysing characteristics. That said, clustering algorithms work in an unsupervised way, that means, there is no label in training set to guide clustering algorithm to the right path (minimizing error rate).
This is how Carrot2 algorithms work: in a unsupervised learning way, implementations try to group similar documents and suggest a label for them.

Carrot2

As we have already said, Carrot2 is a clustering framework. It contains some unsupervised learning algorithm implementations. And can be easily integrated to Solr through ClusteringComponent. This structure allows us to setup a Solr clustering searchComponent and then plug in some requestHandlers. At the end, a request is made and according to search results, documents are clustered in groups (based on their similarity).

Example setup

A test dataset is available right below. There are sample some documents and we can play with parameters:



Now let's setup solrconfig.xml in order to configure clustering component:



We have chosen the BisectingKMeansClustering class for example, which is an implementation of the widely known k-Means clustering algorithm. Basically, this algorithm randomly choose n-centroids ( which n is the clusterCount). By centroid, we mean the instance (document) which best represents a certain group. Therefore, while algorithm iterates, it calculates new centroids, according to documents shown in it's training phase. Until no changes in centroids occurs, or the maxIterations is reached. Another characteristics of k-Means is that it is a hard-clustering algorithm, which means that a document can be grouped at just one single group.

Once clustering searchComponent has been already set, let's declare it in requestHandler definition:



Assuming that you already have indexed data sample, let's request a simple search:



So, let's take a look at response:



Interesting! Solr and Carrot2 have grouped similar documents (which are represented by id fields) and suggested a label that should best represent the group. In the case of label "Pencil" this is more clear. The other group (which are not pencils, if you notice the title contents) were labeled as "Expensive", which according to algorithm it's the most likely label, considering input dataset. If you try tunning clusterCount parameter in solrconfig.xml, you should notice that documents which were in different groups before can appear in same group.

Remarks

Carrot2 is a really simple clustering framework. It has a lot of open-source and also comercial implementations. Apache Solr has built-in integration with Carrot2.

On the other side, Carrot2 computes clusters considering search result documents. It means that, for performance restrictions, you cannot retrieve all document dataset after a request. Therefore, as less documents returned, less quality of clustering. If you try to increase the Solr rows parameter, in a large dataset, you will figure that response time can be compromised. At this point, Carrot2 clustering.documents, that could avoid this situation, is not available yet. You also should take care about Solr field filters and analysers, which can noise label suggestion.

At last, if you need a more accurate clustering suggestion with high time response, you should consider to do that in index time, not in query time. Maybe you can also check clustering algorithms provided by Mahout scalable machine learning library, which certainly we shall discuss in future articles.

That's all. See you next post.

1 comment:

  1. Thank you very much for this post, i've followed your approach and do get the results as bellow:




    Services

    1.4112504570065116

    doc5
    doc3
    doc11




    Other Topics

    0.0
    true

    doc6
    doc1
    doc7
    doc8
    doc4
    doc2
    doc10
    doc9
    doc12
    doc13




    However, i'm trying to view it using Carrot2 Workbench, where i can only one solid donut with "Other Topics" only, can you help me troubleshooting this please?

    Thank you,
    Apsara

    ReplyDelete