Comprehensive Comparison of Reference Managers: Mendeley vs. Zotero vs. Docear

Which one is the best reference management software? That’s a question any student or researcher should think about quite carefully, because choosing the best reference manager may save lots of time and increase the quality of your work significantly. So, which reference manager is best? Zotero? Mendeley? Docear? …? The answer is: “It depends”, because different people have different needs. Actually, there is no such thing as the ‘best’ reference manager but only the reference manager that is best for you (even though some developers seem to believe that their tool is the only truly perfect one).

In this Blog-post, we compare Zotero, Mendeley, and Docear and we hope that the comparison helps you to decide which of the reference managers is best for you. Of course, there are many other reference managers. Hopefully, we can include them in the comparison some day, but for now we only have time to compare the three. We really tried to do a fair comparison, based on a list of criteria that we consider important for reference management software. Of course, the criteria are subjectively selected, as are all criteria by all reviewers, and you might not agree with all of them. However, even if you disagree with our evaluation, you might find at least some new and interesting aspects as to evaluate reference management tools. You are very welcome to share your constructive criticism in the comments, as well as links to other reviews. In addition, it should be obvious that we – the developers of Docear – are somewhat biased. However, this comparison is most certainly more objective than those that Mendeley and other reference managers did ;-).

Please note that we only compared about 50 high-level features and used a simple rating scheme in the summary table. Of course, a more comprehensive list of features and a more sophisticated rating scheme would have been nice, but this would have been too time consuming. So, consider this review as a rough guideline. If you feel that one of the mentioned features is particularly important to you, install the tools yourself, compare the features, and share your insights in the comments! Most importantly, please let us know when something we wrote is not correct. All reviewed reference tools offer lots of functions, and it might be that we missed one during our review.

Please note that the developers of all three tools constantly improve their tools and add new features. Therefore, the table might be not perfectly up-to-date. In addition, it’s difficult to rate a particular functionality with only one out of three possible ratings (yes; no; partly). Therefore, we highly suggest to read the detailed review, which explains the rationale behind the ratings.

The  table above provides an overview of how Zotero, Mendeley, and Docear support you in various tasks, how open and free they are, etc. Details on the features and ratings are provided in the following sections. As already mentioned, if you notice a mistake in the evaluation (e.g. missed a key feature), please let us know in the comments.


If you don’t want to read a lot, just jump to the summary

We believe that a reference manager should offer more features than simple reference management. It should support you in (1) finding literature, (2) organizing and annotating literature, (3) drafting your papers, theses, books, assignments, etc., (4) managing your references (of course), and (5) writing your papers, theses, etc. Additionally, many – but not all – students and researchers might be interested in (6) socializing and collaboration, (7) note, task, and general information management, and (8) file management. Finally, we think it is important that a reference manager (9) is available for the major operating systems, (10) has an information management approach you like (tables, social tags, search, …), and (11) is open, free, and sustainable (see also What makes a bad reference manager).


What makes a bad reference manager?

Update 2013-11-11: For some statistical data read On the popularity of reference managers, and their rise and fall
Update 2014-01-15: For a detailed review of Docear and other tools, read Comprehensive Comparison of Reference Managers: Mendeley vs. Zotero vs. Docear

At time of writing these lines, there are 31 reference management tools listed on Wikipedia and there are many attempts to identify the best ones, or even the best one (e.g. here, herehere, here, here, here, here, here, … [1]). Typically, reviewers gather a list of features and analyze which reference managers offer most of these features, and hence are the best ones. Unfortunately, each reviewer has its own preferences about which features are important, and so have you: Are many export formats more important than a mobile version? Is it more important to have metadata extraction for PDF files than an import for bibliographic data from academic search engines? Would a thorough manual be more important than free support? How important is a large number of citation styles? Do you need a Search & Replace function? Do you want to create synonyms for term lists (whatever that means)? …?

Let’s face the truth: it’s impossible to determine which of the hundred potential features you really need.

So how can you find the best reference manager? Recently we had an ironic look at the question what the best reference managers are. Today we want to have a more serious analysis, and propose to first identify the bad reference managers, instead of looking for the very best ones. Then, if the bad references managers are found, it should be easier to identify the best one(s) from the few remaining.

What makes a bad – or evil –  reference manager? We believe that there are three no-go ‘features’ that make a reference manager so bad (i.e. so harming in the long run) that you should not use it, even if it possesses all the other features you might need.

1. A “lock-in feature” that prevents you from ever switching to a competitor tool 

A reference manager might offer exactly the features you need, but how about in a few years? Maybe your needs are changing, other reference managers are just becoming better than your current tool, or your boss is telling you that you have to use a specific tool. In this case it is crucial that your current reference manager doesn’t lock you in and allows switching to your new favorite reference managers. Otherwise, you will have a serious problem. You might have had the perfect reference manager for the past one or two years. But then you are bound to the now not-so-perfect tool for the rest of your academic life. To being able to switch to another reference manager, your reference manager should be offering at least one of the following three functions (ideally the first one).

  1. Your data should be stored in a standard format that other reference managers can read
  2. Your reference manager should be able to export your data in a standard format
  3. Your reference manager allows direct access to your data, so other developers can write import filters for it.


New paper: “A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation”

Yesterday, we published a pre-print on the shortcomings of current research-paper recommender system evaluations. One of the findings was that results of offline and online experiments sometimes contradict each other. We did a more detailed analysis on this issue and wrote a new paper about it. More specifically, we conducted a comprehensive evaluation of a set of recommendation algorithms using (a) an offline evaluation and (b) an online evaluation. Results of the two evaluation methods were compared to determine whether and when results of the two methods contradicted each other. Subsequently, we discuss differences and validity of evaluation methods focusing on research paper recommender systems. The goal was to identify which of the evaluation methods were most authoritative, or, if some methods are unsuitable in general. By ‘authoritative’, we mean which evaluation method one should trust when results of different methods contradict each other.

Bibliographic data: Beel, J., Langer, S., Genzmehr, M., Gipp, B. and Nürnberger, A. 2013. A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation. Proceedings of the Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys) at the ACM Recommender System Conference (RecSys) (2013), 7–14.

Our current results cast doubt on the meaningfulness of offline evaluations. We showed that offline evaluations could often not predict results of online experiments (measured by click-through rate – CTR) and we identified two possible reasons.

The first reason for the lacking predictive power of offline evaluations is the ignorance of human factors. These factors may strongly influence whether users are satisfied with recommendations, regardless of the recommendation’s relevance. We argue that it probably will never be possible to determine when and how influential human factors are in practice. Thus, it is impossible to determine when offline evaluations have predictive power and when they do not. Assuming that the only purpose of offline evaluations is to predict results in real-world settings, the plausible consequence is to abandon offline evaluations entirely.


Which one is the best reference management software?

Update 2013-10-14: For a more serious analysis read What makes a bad reference manager?
Update 2013-11-11: For some statistical data read On the popularity of reference managers, and their rise and fall
Update 2014-01-15: For a detailed review, read Comprehensive Comparison of Reference Managers: Mendeley vs. Zotero vs. Docear

<irony>Have you ever wondered what the best reference management software is? Well, today I found the answer on RefWorks’ web site: The best reference manager is RefWorks! Look at the picture below. It might be a little bit confusing but we did the math: Refworks is best and beats EndNote, EndNote Web, Reference Manager, Zotero, and Mendeley in virtually all categories.

Comparison of reference management software - Refworks is the best reference manager

Source: RefWorks


Three new research papers (for TPDL’13) about user demographics and recommender evaluations, sponsored recommendations, and recommender persistance

After three demo-papers were accepted for JCDL 2013, we just received notice that another three posters were accepted for presentation at TPDL 2013 on Malta in September 2013. They cover some novel aspects of recommender systems relating to re-showing recommendations multiple times, considering user demographics when evaluating recommender systems, and investigating the effect of labelling recommendations. However, you can read the papers yourself, as we publish them as pre-print:

Paper 1: The Impact of Users’ Demographics (Age and Gender) and other Characteristics on Evaluating Recommender Systems (Download PDF | Doc)

In this paper we show the importance of considering demographics and other user characteristics when evaluating (research paper) recommender systems. We analyzed 37,572 recommendations delivered to 1,028 users and found that elderly users clicked more often on recommendations than younger ones. For instance, users with an age between 20 and 24 achieved click-through rates (CTR) of 2.73% on average while CTR for users between 50 and 54 was 9.26%. Gender only had a marginal impact (CTR males 6.88%; females 6.67%) but other user characteristics such as whether a user was registered (CTR: 6.95%) or not (4.97%) had a strong impact. Due to the results we argue that future research articles on recommender systems should report demographic data to make results better comparable.


Evaluations in Information Retrieval: Click Through Rate (CTR) vs. Mean Absolute Error (MAE) vs. (Root) Mean Squared Error (MSE / RMSE) vs. Precision

As you may know, Docear offers literature recommendations and as you may know further, it’s part of my PhD to find out how to make these recommendations as good as possible. To accomplish this I need to know what a ‘good’ recommendation is. So far we have been using Click Through Rates (CTR) to evaluate different recommendation algorithms. CTR is a common performance measure in online advertisement. For instance, if a recommendation is shown 1000 times and clicked 12 times, then the CTR is 1,2% (12/1000).  That means if an algorithm A has a CTR of 1% and algorithm B has a CTR of 2%, B is better.

Recently, we submitted a paper to a conference. The paper summarized the results of some evaluations we did with different recommendation algorithms. The paper was rejected. Among others, a reviewer criticized the CTR as a too simple evaluation metric. We should rather use metrics that are common in information retrieval such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or Precision (i.e. Mean Average Precision, MAE).

The funny thing is, CTR, MAE, MSE, RMSE and Precision are basically all the same, at least in a binary classification problem (recommendation relevant / clicked vs. recommendation irrelevant / not clicked). The table shows an example. Assume, you show ten recommendations to users (Rec1…Rec10). Then is the ‘Estimate’ for each recommendation ‘1’, i.e. it’s clicked by a user. The ‘Actual‘ value describes if a user actually clicked on a recommendation (‘1) or not (‘0’). The ‘Error’ is either 0 (if the recommendation actually was clicked) or 1 (if it was not clicked). The mean absolute error (MAE) is simply the sum of all errors (6 in the example) devided by the number of total recommendations (10 in the example). Since we have only zeros and ones, it makes no difference if they are squared or not. Consequently, the mean squared error (MSE) is identical to MAE. In addition, precision and mean average precision (MAP) is identical to CTR; precision (and CTR) is exactly 1-MAE (or 1-MSE), and also RMSE perfectly correlates with the other values because it’s simply the root square of MSE (or MAE).

Click Through Rate (CTR) vs. Mean Absolute Error (MAE) vs Mean Squared Error (MSE) vs Root Mean Squared Error (RMSE) vs Precision

In a binary evaluation (relevant / not relevant) in information retrieval, there is no difference in the significance between Click Through Rate (CTR), Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Precision.