We kindly ask you to participate in a brief study about Docear’s recommender system. Your participation will help us to improve the recommender system, and to secure long-term funding for the development of Docear in general! If you are willing to invest 15 minutes of your time, then please continue reading. Participate Read more…
Like the previous years, we offer the opportunity for Bachelor students from the US, UK or Canada to do a paid internship at Docear in summer 2014 (if you are from Germany, please read here). The internship should last for 8-12 weeks with the earliest start date being May and Read more…
New paper: “A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation”
Yesterday, we published a pre-print on the shortcomings of current research-paper recommender system evaluations. One of the findings was that results of offline and online experiments sometimes contradict each other. We did a more detailed analysis on this issue and wrote a new paper about it. More specifically, we conducted a comprehensive evaluation of a set of recommendation algorithms using (a) an offline evaluation and (b) an online evaluation. Results of the two evaluation methods were compared to determine whether and when results of the two methods contradicted each other. Subsequently, we discuss differences and validity of evaluation methods focusing on research paper recommender systems. The goal was to identify which of the evaluation methods were most authoritative, or, if some methods are unsuitable in general. By ‘authoritative’, we mean which evaluation method one should trust when results of different methods contradict each other.
Bibliographic data: Beel, J., Langer, S., Genzmehr, M., Gipp, B. and Nürnberger, A. 2013. A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation. Proceedings of the Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys) at the ACM Recommender System Conference (RecSys) (2013), 7–14.
Our current results cast doubt on the meaningfulness of offline evaluations. We showed that offline evaluations could often not predict results of online experiments (measured by click-through rate – CTR) and we identified two possible reasons.
The first reason for the lacking predictive power of offline evaluations is the ignorance of human factors. These factors may strongly influence whether users are satisfied with recommendations, regardless of the recommendation’s relevance. We argue that it probably will never be possible to determine when and how influential human factors are in practice. Thus, it is impossible to determine when offline evaluations have predictive power and when they do not. Assuming that the only purpose of offline evaluations is to predict results in real-world settings, the plausible consequence is to abandon offline evaluations entirely.
As you might know, Docear has a recommender system for research papers, and we are putting a lot of effort in the improvement of the recommender system. Actually, the development of the recommender system is part of my PhD research. When I began my work on the recommender system, some years ago, I became quite frustrated because there were so many different approaches for recommending research papers, but I had no clue which one would be most promising for Docear. I read many many papers (far more than 100), and although there were many interesting ideas presented in the papers, the evaluations… well, most of them were poor. Consequently, I did just not know which approaches to use in Docear.
Meanwhile, we reviewed all these papers more carefully and analyzed how exactly authors conducted their evaluations. More precisely, we analyzed the papers for the following questions.
- To what extent do authors perform user studies, online evaluations, and offline evaluations?
- How many participants do user studies have?
- Against which baselines are approaches compared?
- Do authors provide information about algorithm’s runtime and computational complexity?
- Which metrics are used for algorithm evaluation, and do different metrics provide similar rankings of the algorithms?
- Which datasets are used for offline evaluations
- Are results comparable among different evaluations based on different datasets?
- How consistent are online and offline evaluations? Do they provide the same, or at least similar, rankings of the evaluated approaches?
- Do authors provide sufficient information to re-implement their algorithms or replicate their experiments?
Three of our submissions to the ACM/IEEE Joint Conference on Digital Libraries (JCDL) were accepted. They relate to recommender systems, reference management, and pdf metadata extraction:
In this demo-paper we introduce Docear4Word. Docear4Word enables researchers to insert and format their references and bibliographies in Microsoft Word, based on BibTeX and the Citation Style Language (CSL). Docear4Word features over 1,700 citation styles (Harvard, IEEE, ACM, etc.), is published as open source tool on http://docear.org, and runs with Microsoft Word 2002 and later on Windows XP and later. Docear4Word is similar to the MS-Word add-ons that reference managers like Endnote, Zotero, or Citavi offer with the difference that it is being developed to work with the de-facto standard BibTeX and hence to work with almost any reference manager.
We are experiencing a very high server load due to several reasons (many people are using our services, we are doing some extensive research analyses, etc.). Therefore we decided to deactivate the metadata retrieval and recommendations for a while, hopefully only a few days. We will let you know as Read more…