Evaluating recommender systems

This vignette is an introduction to the R package recometrics for evaluating recommender systems built with implicit-feedback data, assuming that the recommendation models are based on low-rank matrix factorization (example such packages: cmfrec, rsparse, recosystem, among many others), or assuming that it is possible to compute a user-item score as a dot product of user and item factors/components/attributes.

Implicit-feedback data

Historically, many models for recommender systems were designed by approaching the problem as regression or rating prediction, by taking as input a matrix \(\mathbf{X}_{ui}\) denoting user likes and dislikes of items in a scale (e.g. users giving a 1-to-5 star rating to different movies), and evaluating such models by seeing how well they predict these ratings on hold-out data.

In many cases, it is impossible or very expensive to obtain such data, but one has instead so called “implicit-feedback” records: that is, observed logs of user interactions with items (e.g. number of times that a user played each song in a music service), which do not signal dislikes in the same way as a 1-star rating would, but can still be used for building and evaluating recommender systems.

In the latter case, the problem is approached more as ranking or classification instead of regression, with the models being evaluated not by how well they perform at predicting ratings, but by how good they are at scoring the observed interactions higher than the non-observed interactions for each user, using metrics more typical of information retrieval.

Generating a ranked list of items for each user according to their predicted score and comparing such lists against hold-out data can nevertheless be very slow (might even be slower than fitting the model itself), and this is where recometrics comes in: it provides efficient routines for calculating many implicit-feedback recommendation quality metrics, which exploit multi-threading, SIMD instructions, and efficient sorted search procedures.

Matrix factorization models

The perhaps most common approach towards building a recommendation model is by trying to approximate the matrix \(\mathbf{X}_{mn}\) as the product of two lower-dimensional matrices \(\mathbf{A}_{mk}\) and \(\mathbf{B}_{nk}\) (with \(k \ll m\) and \(k \ll n\)), representing latent user and item factors/components, respectively (which are the model parameters to estimate) - i.e. \[ \mathbf{X} \approx \mathbf{A} \mathbf{B}^T \] In the explicit-feedback setting (e.g. movie ratings), this is typically done by trying to minimize squared errors with respect to the observed entries in \(\mathbf{X}\), while in implicit-feedback settings this is typically done by turning the \(\mathbf{X}\) matrix into a binary matrix which has a one if the observation is observed and a zero if not, using the actual values (e.g. number of times that a song was played) instead as weights for the positive entries, thereby looking at all entries rather than just the observed (non-zero) values - e.g.: \[ \min_{\mathbf{A}, \mathbf{B}} \sum_{u=1}^{m} \sum_{i=1}^{n} x_{ui} (I_{x_{ui}>0} - \mathbf{a}_u \cdot \mathbf{b}_i)^2 \]

The recommendations for a given user are then produced by calculating the full products between that user vector \(\mathbf{a}_u\) and the \(\mathbf{B}\) matrix, sorting these predicted scores in descending order.

For a better overview of implicit-feedback matrix factorization, see the paper Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.

Evaluating recommendation models

Such matrix factorization models are commonly evaluated by setting aside a small amount of users as hold-out for evaluation, fitting a model to all the remaining users and items. Then, from the evaluation users, a fraction of their interactions data is set as a hold-out test set, while their latent factors are computed using the rest of the data and the previously fitted model from the other users.

Then, top-K recommendations for each user are produced, discarding the non-hold-out items with which their latent factors were just determined, and these top-K lists are compared against the hold-out test items, seeing how well they do at ranking them near the top vs. how they rank the remainder of the items.


This package can be used to calculate many recommendation quality metrics given the user and item factors and the train-test data split that was used, including:

(For more details about the metrics, see the package documentation: ?calc.reco.metrics)

NOT covered by this package:


Now a practical example using the library cmfrec and the MovieLens100K data, taken from the recommenderlab package.

Note that this is an explicit-feedback dataset about movie ratings. Here it will be converted to implicit-feedback by setting movies with a rating of 4 and 5 stars as the positive (observed) data, while the others will be set as negative (unobserved).

Loading the data

This section will load the MovieLens100K data and filter out observations with a rating of less than 4 stars in order to have something that resembles implicit feedback.

Creating a train-test split

Now leaving aside a random sample of 100 users for model evaluation, for whom 30% of the data will be left as a hold-out test set.

Establishing baselines

In order to determine if a personalized recommendation model is bringing value or not, it’s logical to compare such model against the simplest possible ways of making recommendations, such as:

This section creates such baselines to compare against.

Fitting models

This section will fit a few models in order to have different ranked lists to evaluate:

All of these models are taken from the cmfrec package - see its documentation for more details about the models.

Important: for the explicit-feedback models, it’s not possible to use the same train-test split strategy as for the implicit-feedback variants, as the training data contains only 4 and 5 stars, which does not signal any dislikes and thus puts these models at a disadvantage. As such, here the user factors will be obtained from the full data (train+test), which gives them a quite unfair advantage compared to the other models.

In theory, one could also split the full ratings data, and filter out low-star ratings in the test set only, but that would still distort a bit the metrics for implicit-feedback models. Alternatively, one could adjust the WRMF model to take low-star ratings as more negative entries with higher weight (e.g. giving them a value of -1 and a weight of 5 minus rating), which is supported by e.g. cmfrec. Note however that the only metric in this package that can accomodate such a scenatio (implicit feedback plus dislikes) is the \(NDCG@K\) metric.

Other models

The MovieLens100K data used here comes with metadata/attributes about the users (gender, occupation, age, among others) and the items (genre and year of release), which so far have not been incorporated into these models.

One simple way of adding this secondary information into the same WRMF model is through the concept of “Collective Matrix Factorization”, which does so by also factorizing the side information matrices, but using the same user/item factors - see the documentation of cmfrec for more details about this approach.

As well, one typical trick in the explicit-feedback variant is to add a fixed bias/intercept for each user and item, which is also possible to do in the WRMF model by making some slight modifications to the optimization procedure.

This section will fit additional variations of the WRMF model to compare against:

First processing the data as required for the new models:

Now fitting the models:

Calculating metrics

Finally, calculating recommendation quality metrics for all these models:

These metrics are by default returned as a data frame, with each user representing a row and each metric a column - example:

p_at_5 tp_at_5 r_at_5 ap_at_5 tap_at_5 ndcg_at_5 hit_at_5 rr_at_5 roc_auc pr_auc
31 0 0 0.0000000 0.0000000 0 0.000000 0 0 0.8600427 0.0203154
33 0 0 0.0000000 0.0000000 0 0.000000 0 0 0.8000000 0.0200586
38 0 0 0.0000000 0.0000000 0 0.000000 0 0 0.7723270 0.0684697
43 1 1 0.1136364 0.1136364 1 0.826241 1 1 0.9383715 0.5255828
61 0 0 0.0000000 0.0000000 0 0.000000 0 0 0.8702474 0.0067806

Comparing models

In order to compare models, one can instead summarize these metrics across users:

p_at_5 tp_at_5 r_at_5 ap_at_5 tap_at_5 ndcg_at_5 hit_at_5 rr_at_5 roc_auc pr_auc
Random 0.012 0.0120000 0.0024653 0.0010162 0.0057333 0.0110063 0.06 0.0286667 0.4886724 0.0159867
Non-personalized 0.198 0.1985000 0.0538081 0.0363141 0.1458167 0.2026604 0.53 0.3813333 0.8780423 0.1264918
Weighted-Lambda 0.040 0.0400000 0.0075464 0.0045249 0.0245667 0.0425558 0.14 0.0881667 0.7608287 0.0557133
Hybrid-Explicit 0.264 0.2671667 0.0729474 0.0521933 0.2054917 0.2786540 0.62 0.4610000 0.7408111 0.1518263
WRMF (a.k.a. iALS) 0.360 0.3746667 0.1328329 0.1017514 0.3020083 0.3755885 0.72 0.5631667 0.9417446 0.2652038
bWRMF 0.334 0.3473333 0.1220779 0.0908624 0.2780500 0.3511758 0.70 0.5361667 0.9379822 0.2558722
CWRMF 0.362 0.3771667 0.1322509 0.0978266 0.2985861 0.3758180 0.76 0.5685000 0.9442536 0.2614919
bCWRMF 0.348 0.3613333 0.1211798 0.0913341 0.2876361 0.3594521 0.72 0.5543333 0.9415006 0.2565175

From these metrics, the best-performing model overall seems to be CWRMF (collective version of WRMF or iALS model, which incorporates side information about users and items), but it does not dominate across all metrics.

It is hard to conclude for example whether adding item biases to the WRMF or the CWRMF model is an improvement, as some metrics improve while others deteriorate, and this is where specific properties about the dataset and the desired recommendation goals have to come in mind (e.g. one might decide that \(AP@K\) is simply the most informative metric and make a decision based on it, or perhaps look at more specialized metrics).


To keep in mind: