Recommendation task is an important issue when dealing with e-commerce, news and other websites in general. In fact, the main goal is to provide relevant information to our target users, related to what they are searching. In this context, we will discuss how to achieve it by retrieving indexed data ("wishlist" per user) from Solr and use it to build a recommender using Apache Mahout to infer related items based on users' preferences.
Some background: Apache Mahout
Mahout is a scalable machine learning framework which supports recommendation, collaborative filtering, clustering and classification tasks by providing a widely well-known set of algorithm implementations. As Mahout is scalable, it easily supports distributed computing by using Hadoop as background map-reduce framework.
Through its DataModel interface, Mahout provides us some mechanisms to process data stored in file, database, memory which turns easily integration with different sources, not just Hadoop. Therefore we have choosen Solr as data source, just for simplicity.
As we intend to do recommendation task, we will setup an user-based recommendation Mahout environment for tests. For now, we are not worried about performance issues related to available data size. That said, next section will use some toy sample to illustrate the framework functionality.
By user-based recommendation we mean that recommended items to some user are inferred from other similar users preferences.
Example setup: user-based recommendation
First of all, we have indexed some data in Solr. Each document can be seen as an user wishlist, where userid field represents the unique user identification and itemlist field is a list of product ids. So if we execute a Solr query then we get a response like this:
This sample data is the same exposed in early chapters of Mahout in Action. The main difference here is that we dismiss preference values, because in this case we are not considering user evaluation values about items. This is called boolean preferences, that means, if there is an itemID associated to an userID then preference exists and its value does not matter.
Code snippet below creates a model as a GenericBooleanPrefDataModel instance from data indexed in Solr search engine. This model is basically the data structure used by recommendation algorithm in next steps.
After create a DataModel object we are able to build a recommender:
The code snippet above does all the magic: Nearest neighborhood means that recommended items for some user will be calculated according to the (log-likelihood) similarity between this user and users contained in model. The GenericBooleanPrefUserBasedRecommender class is the appropriate recommender for boolean preferences.
The LogLikelihoodSimilarity class does not need preferences values. Other similarity metrics like Euclidean distance and Pearson correlation throw IllegalArgumentException for boolean preferences.
After executing this code, the recommender answer is:
It means that user #4 has already liked items 101,103,104,106 and we have recommended the item 102 to him, according to our user-based recommendation setup.
Anonymous users
If user is not indexed (i.e. new user, not-logged or not-registered user), you can change code and consider PlusAnonymousUserDataModel class, which represents an user from whom we do not have any information about. Here we assume that some random user have marked items 102 and 105 in some kind of wishlist.
The anonymousDataModel object wraps our previous loaded DataModel object. PreferenceArray contains preferred items and some temporary userId to represent this anonymous user. The output is:
Which represents recommended items to this new user, based on previous users selected items.
Remarks
We have illustrated how we can retrieve Solr data and build Mahout DataModel in order to make some good user-based recommendations. It is a quite simple modeling the input and then build a recommender. We have focused here in items with no preference values and that's why we have applied log-likelihood similarity implementation which is the only similarity metrics that does not require these values to output some response.
You have already noticed that recommendation performance and response quality metrics (accuracy, precision, recall and so on) evaluation were out of our scope. These topics will be certainly explored later.
Some background: Apache Mahout
Mahout is a scalable machine learning framework which supports recommendation, collaborative filtering, clustering and classification tasks by providing a widely well-known set of algorithm implementations. As Mahout is scalable, it easily supports distributed computing by using Hadoop as background map-reduce framework.
Through its DataModel interface, Mahout provides us some mechanisms to process data stored in file, database, memory which turns easily integration with different sources, not just Hadoop. Therefore we have choosen Solr as data source, just for simplicity.
As we intend to do recommendation task, we will setup an user-based recommendation Mahout environment for tests. For now, we are not worried about performance issues related to available data size. That said, next section will use some toy sample to illustrate the framework functionality.
By user-based recommendation we mean that recommended items to some user are inferred from other similar users preferences.
Example setup: user-based recommendation
First of all, we have indexed some data in Solr. Each document can be seen as an user wishlist, where userid field represents the unique user identification and itemlist field is a list of product ids. So if we execute a Solr query then we get a response like this:
This sample data is the same exposed in early chapters of Mahout in Action. The main difference here is that we dismiss preference values, because in this case we are not considering user evaluation values about items. This is called boolean preferences, that means, if there is an itemID associated to an userID then preference exists and its value does not matter.
Code snippet below creates a model as a GenericBooleanPrefDataModel instance from data indexed in Solr search engine. This model is basically the data structure used by recommendation algorithm in next steps.
After create a DataModel object we are able to build a recommender:
The code snippet above does all the magic: Nearest neighborhood means that recommended items for some user will be calculated according to the (log-likelihood) similarity between this user and users contained in model. The GenericBooleanPrefUserBasedRecommender class is the appropriate recommender for boolean preferences.
The LogLikelihoodSimilarity class does not need preferences values. Other similarity metrics like Euclidean distance and Pearson correlation throw IllegalArgumentException for boolean preferences.
After executing this code, the recommender answer is:
It means that user #4 has already liked items 101,103,104,106 and we have recommended the item 102 to him, according to our user-based recommendation setup.
Anonymous users
If user is not indexed (i.e. new user, not-logged or not-registered user), you can change code and consider PlusAnonymousUserDataModel class, which represents an user from whom we do not have any information about. Here we assume that some random user have marked items 102 and 105 in some kind of wishlist.
The anonymousDataModel object wraps our previous loaded DataModel object. PreferenceArray contains preferred items and some temporary userId to represent this anonymous user. The output is:
Which represents recommended items to this new user, based on previous users selected items.
Remarks
We have illustrated how we can retrieve Solr data and build Mahout DataModel in order to make some good user-based recommendations. It is a quite simple modeling the input and then build a recommender. We have focused here in items with no preference values and that's why we have applied log-likelihood similarity implementation which is the only similarity metrics that does not require these values to output some response.
You have already noticed that recommendation performance and response quality metrics (accuracy, precision, recall and so on) evaluation were out of our scope. These topics will be certainly explored later.