As you might know, Docear has a recommender system for research papers, and we are putting a lot of effort in the improvement of the recommender system. Actually, the development of the recommender system is part of my PhD research. When I began my work on the recommender system, some years ago, I became quite frustrated because there were so many different approaches for recommending research papers, but I had no clue which one would be most promising for Docear. I read many many papers (far more than 100), and although there were many interesting ideas presented in the papers, the evaluations… well, most of them were poor. Consequently, I did just not know which approaches to use in Docear.
Meanwhile, we reviewed all these papers more carefully and analyzed how exactly authors conducted their evaluations. More precisely, we analyzed the papers for the following questions.
- To what extent do authors perform user studies, online evaluations, and offline evaluations?
- How many participants do user studies have?
- Against which baselines are approaches compared?
- Do authors provide information about algorithm’s runtime and computational complexity?
- Which metrics are used for algorithm evaluation, and do different metrics provide similar rankings of the algorithms?
- Which datasets are used for offline evaluations
- Are results comparable among different evaluations based on different datasets?
- How consistent are online and offline evaluations? Do they provide the same, or at least similar, rankings of the evaluated approaches?
- Do authors provide sufficient information to re-implement their algorithms or replicate their experiments?