Research

We research and apply machine learning, text mining, natural language processing, the blockchain and other technologies, in areas including recommender systems, search engines, news analysis, information extraction, plagiarism detection, and machine translation. Domains we are particularly interested in include digital libraries, open science, digital humanities, eHealth, law, FinTech, tourism, mobility.

On this page, you find our publications sorted by research area. Many publications are listed in multiple sections.

Recommender Systems

Literature Surveys

Beel, Joeran, Bela Gipp, Stefan Langer, and Corinna Breitinger. “Research Paper Recommender Systems: A Literature Survey.” International Journal on Digital Libraries (2015):1–34. doi:10.1007/s00799-015-0156-0. 2016

Beel, Joeran, Stefan Langer, Marcel Genzmehr, Bela Gipp, Corinna Breitinger, and Andreas Nürnberger. “Research Paper Recommender System Evaluation: A Quantitative Literature Survey.” In Proceedings of the Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys) at the ACM Recommender System Conference (RecSys). ACM International Conference Proceedings Series (ICPS), 2013.

Recommendation Approaches

Beel, Joeran, Siddharth Dinesh, Philipp Mayr, Zeljko Carevic, and Jain Raghvendra. “Stereotype and Most-Popular Recommendations in the Digital Library Sowiport” In Proceedings of the 15th International Symposium of Information Science. 2017.

Siebert, Sophie, Siddarth Dinesh, and Stefan Feyer. “Extending a Research Paper Recommendation System with Bibliometric Measures.” In 5th International Workshop on Bibliometric-enhanced Information Retrieval (BIR) at the 39th European Conference on Information Retrieval (ECIR), 2017.

Beel, Joeran, Corinna Breitinger, and Stefan Langer. “Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?” In Proceedings of the 12th iConference. 2017.

Beel, Joeran. “Virtual Citation Proximity (VCP): Calculating Co-Citation-Proximity-Based Document Relatedness for Uncited Documents with Machine Learning [Proposal],” ResearchGate (2017). DOI: 10.13140/RG.2.2.18759.39842

Beel, Joeran, Corinna Breitinger, and Stefan Langer. “Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?” In Proceedings of the 12th iConference. 2017.

Bela Gipp and Joeran Beel. “Citation Proximity Analysis (CPA) – A new approach for identifying related work based on Co-Citation Analysis.” In Birger Larsen and Jacqueline Leta, editors, Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09), volume 2, pages 571–575, Rio de Janeiro (Brazil), July 2009. International Society for Scientometrics and Informetrics.

Bela Gipp, Adriana Taylor, and Joeran Beel. “Link Proximity Analysis – Clustering Websites by Examining Link Proximity.” In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, and I. Frommholz, editors, Proceedings of the 14th European Conference on Digital Libraries (ECDL’10): Research and Advanced Technology for Digital Libraries, volume 6273 of Lecture Notes of Computer Science (LNCS), pages 449–452, September 2010. Springer.

User Modeling

Beel, Joeran, Stefan Langer, and Bela Gipp. “TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections.” In Proceedings of the 12th iConference. 2017.

Beel, Joeran, Stefan Langer, Georgia M. Kapitsaki, Corinna Breitinger, and Bela Gipp. “Exploring the Potential of User Modeling based on Mind Maps.” In Proceedings of the 23rd Conference on User Modelling, Adaptation and Personalization (UMAP). Lecture Notes of Computer Science. Springer, 2015.

Beel, Joeran. “Towards Effective Research-Paper Recommender Systems and User Modeling based on Mind Maps.” PhD thesis. Otto-von-Guericke University Magdeburg, March 2015.

Beel, Joeran, Stefan Langer, Marcel Genzmehr, Bela Gipp. “Utilizing Mind-Maps for Information Retrieval and User Modelling.” In Proceedings of the 22nd Conference on User Modelling, Adaption, and Personalization (UMAP). Springer, 2014.

Evaluation

Joeran Beel. “It’s Time to Consider “Time” when Evaluating Recommender-System Algorithms [Proposal].” arXiv. 2018

Beel, Joeran, Corinna Breitinger, Stefan Langer, Andreas Lommatzsch, and Bela Gipp. Towards Reproducibility in Recommender-Systems Research. User Modeling and User-Adapted Interaction (UMUAI) 26, no. 1 (2016): 69–101. DOI: http://dx.doi.org/10.1007/s11257-016-9174-x

Beel, Joeran, and Stefan Langer. “A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems.” In Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries (TPDL), edited by Sarantos Kapidakis, Cezary Mazurek, and Marcin Werla, 9316:153–168. Lecture Notes in Computer Science, 2015.

Langer, Stefan, Joeran Beel. “The Comparability of Recommender System Evaluations and Characteristics of Docear’s Users.” In Proceedings of the Workshop on Recommender Systems Evaluation: Dimensions and Design (REDD), at the ACM Recommender Systems Conference (RecSys). 2014.

Beel, Joeran, Stefan Langer, Andreas Nürnberger, and Marcel Genzmehr. “The Impact of Demographics (Age and Gender) and Other User Characteristics on Evaluating Recommender Systems.” In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013), edited by Trond Aalberg, Milena Dobreva, Christos Papatheodorou, Giannis Tsakonas, and Charles Farrugia, 400–404. Valletta, Malta: Springer, 2013.

Beel, Joeran, Stefan Langer, Marcel Genzmehr, Bela Gipp, and Andreas Nürnberger. “A Comparative Analysis of Offline and Online Evaluations and Discussion of Research Paper Recommender System Evaluation.” In Proceedings of the Workshop on Reproducibility and Replication in Recommender Systems Evaluation (RepSys) at the ACM Recommender System Conference (RecSys). ACM International Conference Proceedings Series (ICPS), 2013.

Novel Aspects

Persistence

Beel, Joeran, Stefan Langer, Marcel Genzmehr, and Andreas Nürnberger. “Persistence in Recommender Systems: Giving the Same Recommendations to the Same Users Multiple Times.” In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013), edited by Trond Aalberg, Milena Dobreva, Christos Papatheodorou, Giannis Tsakonas, and Charles Farrugia, 8092:390–394. Lecture Notes of Computer Science (LNCS). Valletta, Malta: Springer, 2013.

Choice Overload

Felix Beierle, Akiko Aizawa and Joeran Beel. “Choice Overload in Research-Paper Recommender Systems.” submitted to the International Journal of Digital Libraries. 2018.

Beierle, Felix, Akiko Aizawa, and Joeran Beel. “Exploring Choice Overload in Related-Article Recommendations in Digital Libraries.” In 5th International Workshop on Bibliometric-enhanced Information Retrieval (BIR) at the 39th European Conference on Information Retrieval (ECIR), 2017.

Position Bias

Andrew Collins, Dominika Tkaczyk, Akiko Aizawa, and Joeran Beel. “A Study of Position Bias in Digital Library Recommender Systems.iConference 2018. To appear.

Labelling

Beel, Joeran, Stefan Langer, and Marcel Genzmehr. “Sponsored Vs. Organic (Research Paper) Recommendations and the Impact of Labeling.” In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013), edited by Trond Aalberg, Milena Dobreva, Christos Papatheodorou, Giannis Tsakonas, and Charles Farrugia, 395–399. Valletta, Malta, 2013.

Demographics

Beel, Joeran, Stefan Langer, Andreas Nürnberger, and Marcel Genzmehr. “The Impact of Demographics (Age and Gender) and Other User Characteristics on Evaluating Recommender Systems.” In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013), edited by Trond Aalberg, Milena Dobreva, Christos Papatheodorou, Giannis Tsakonas, and Charles Farrugia, 400–404. Valletta, Malta: Springer, 2013.

Langer, Stefan, Joeran Beel. “The Comparability of Recommender System Evaluations and Characteristics of Docear’s Users.” In Proceedings of the Workshop on Recommender Systems Evaluation: Dimensions and Design (REDD), at the ACM Recommender Systems Conference (RecSys). 2014.

Real-World Systems

Beel, Joeran, Bela Gipp, and Akiko Aizawa. “Mr. DLib: Recommendations-as-a-Service (RaaS) for Academia.” In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2017.

Beel, Joeran. “Real-World Recommender Systems for Academia: The Gain and Pain in Developing, Operating, and Researching them.” In 5th International Workshop on Bibliometric-enhanced Information Retrieval (BIR) at the 39th European Conference on Information Retrieval (ECIR), 2017. [short version, official], [long version, arxiv]

Langer, Stefan, and Joeran Beel. “Apache Lucene as Content-Based-Filtering Recommender System: 3 Lessons Learned.” In 5th International Workshop on Bibliometric-enhanced Information Retrieval (BIR) at the 39th European Conference on Information Retrieval (ECIR), 2017.

Feyer, Stefan, Sophie Siebert, Bela Gipp, Akiko Aizawa, and Joeran Beel. “Integration of the Scientific Recommender System Mr. DLib into the Reference Manager JabRef”. In Proceedings of the 39th European Conference on Information Retrieval (ECIR). 2017.

Langer, Stefan, Joeran Beel. “The Comparability of Recommender System Evaluations and Characteristics of Docear’s Users.” In Proceedings of the Workshop on Recommender Systems Evaluation: Dimensions and Design (REDD), at the ACM Recommender Systems Conference (RecSys). 2014.

Beel, Joeran, Stefan Langer, Marcel Genzmehr, and Andreas Nürnberger. “Introducing Docear’s Research Paper Recommender System.” In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’13), 459–460. ACM, 2013.

Datasets

RARD

RARD — the Related-Article Recommendation Dataset — is based on the digital library Sowiport and the recommendation-as-a-service provider Mr. DLib. The dataset contains information about 57.4 million recommendations that were displayed to the users of Sowiport. Information includes details on which recommendation approaches were used (e.g. content-based filtering, stereotype, most popular), what types of features were used in content-based filtering (simple terms vs. keyphrases), where the features were extracted from (title or abstract), and the time when recommendations were delivered and clicked. In addition, the dataset contains an implicit item-item rating matrix that was created based on the recommendation click logs. RARD enables researchers to train machine learning algorithms for research-paper recommendations, perform offline evaluations, and do research on data from Mr. DLib’s recommender system, without implementing a recommender system themselves. In the field of scientific recommender systems, our dataset is unique. To the best of our knowledge, there is no dataset with more (implicit) ratings available, and that many variations of recommendation algorithms. The dataset is published under the “Creative Commons Attribution 3.0 Unported (CC-BY)” license.

Beel, Joeran, Zeljko Carevic, Johann Schaible, and Gabor Neusch. “RARD: The Related-Article Recommendation Dataset.” D-Lib Magazine 23, no. 7/8 (July 2017): 1–14.

Docear RecSys

We released a dataset based on the recommender system in our reference management software Docear. The dataset contains the following sub-datasets.

Research Papers — The research papers dataset contains information about the research papers that Docear’s PDF Spider crawled and their citations. This includes information about 9.4 million documents, including 7.95 million citations and 1.8 million URLs to freely available academic PDFs on the Web. The dataset also provides citation positions, i.e. where in a document a citation occurs. This leads to 19.3 million entries in the dataset.

Mind-Maps / User libraries — The mind maps dataset contains information on 390,613 revisions of 52,202 mind-maps created by 12,038 users. The mind-maps themselves are not included in the dataset due to privacy reasons. Information includes the number of nodes, the documents that are linked, the id of the user who created the mind-map, and when mind-maps were created

Users — The user dataset contains information about 8,059 of 21,439 registered users, namely about those who activated recommendations and agreed to have their data analyzed and published. Among others, the dataset includes information about the users’ date of registration, gender and age (if provided during registration), usage intensity of Docear, when Docear was last started, when recommendations were last received, the number of created mind-maps, number of papers, how recommendations were labeled, the number of received recommendations, and click-through rates (CTR) on recommendations.

Recommendations — The recommendation dataset contains information on 308,146 recommendations that were delivered to 3,470 users between March 2013 and March 2014. This includes the date of creation and delivery, the time required to generate recommendations and corresponding user models, and information on the algorithm that generated the recommendations. Information on the algorithms is manyfold. We stored whether stop words were removed, which weighting scheme was applied, whether terms and/or citations were used for the user modelling process and 28 other variables that are described in more detail in the dataset’s readme file.

Beel, Joeran, Stefan Langer, Bela Gipp, Andreas Nürnberger. “The Architecture and Datasets of Docear’s Research Paper Recommender System.” D-Lib Magazine – The Magazine of Digital Library Research, vol. 20, 11/12, 2014.

Machine Translation

Machine Translation with Memory Augmented Neural Networks — work in progress

(Academic) Search

Joeran Beel, Bela Gipp, and Erik Wilde. “Academic Search Engine Optimization (ASEO): Optimizing Scholarly Literature for Google Scholar and Co.” Journal of Scholarly Publishing, 41 (2): 176–190, January 2010. University of Toronto Press.

Bela Gipp, Adriana Taylor, and Joeran Beel. “Link Proximity Analysis – Clustering Websites by Examining Link Proximity.” In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, and I. Frommholz, editors, Proceedings of the 14th European Conference on Digital Libraries (ECDL’10): Research and Advanced Technology for Digital Libraries, volume 6273 of Lecture Notes of Computer Science (LNCS), pages 449–452, September 2010. Springer.

Joeran Beel and Bela Gipp.“On the Robustness of Google Scholar Against Spam.” In Proceedings of the 21st ACM Conference on Hyptertext and Hypermedia (HT’10), pages 297–298, Toronto (CA), June 2010. ACM.

Joeran Beel and Bela Gipp. “Academic search engine spam and Google Scholar’s resilience against it.” Journal of Electronic Publishing, 13 (3), December 2010.

Joeran Beel and Bela Gipp. “Google Scholar’s Ranking Algorithm: The Impact of Citation Counts (An Empirical Study). In Andre Flory and Martine Collard, editors, Proceedings of the 3rd IEEE International Conference on Research Challenges in Information Science (RCIS’09), pages 439–446, Fez (Morocco), April 2009. IEEE.

Joeran Beel and Bela Gipp. “Google Scholar’s Ranking Algorithm: An Introductory Overview.” In Birger Larsen and Jacqueline Leta, editors, Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09), volume 1, pages 230–241, Rio de Janeiro (Brazil), July 2009. International Society for Scientometrics and Informetrics. ISSN 2175-1935.

Joeran Beel and Bela Gipp. “Google Scholar’s Ranking Algorithm: The Impact of Articles’ Age (An Empirical Study).” In Shahram Latifi, editor, Proceedings of the 6th International Conference on Information Technology: New Generations (ITNG’09), pages 160–164, Las Vegas (USA), April 2009. IEEE.

*-as-a-Service

Beel, Joeran, Bela Gipp, and Akiko Aizawa. “Mr. DLib: Recommendations-as-a-Service (RaaS) for Academia.” In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2017.

Joeran Beel, Bela Gipp, Stefan Langer, Marcel Genzmehr, Erik Wilde, Andreas Nürnberger, and Jim Pitman. “Introducing Mr. DLib, a Machine-readable Digital Library.” In Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries (JCDL‘11), 2011. ACM.

Reference Management

Beel, Joeran, Stefan Langer, and Marcel Genzmehr. “Docear4Word: Reference Management for Microsoft Word based on BibTeX and the Citation Style Language (CSL).” In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’13), 445–446. ACM, 2013.

Joeran Beel, Bela Gipp, Stefan Langer, and Marcel Genzmehr. “Docear: An Academic Literature Suite for Searching, Organizing and Creating Academic Literature.” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL ’11, pages 465–466, New York, NY, USA, 2011. ACM.

Joeran Beel, Bela Gipp, and Christoph Mueller. “’SciPlore MindMapping’ – A Tool for Creating Mind Maps Combined with PDF and Reference Management.” D-Lib Magazine, 15 (11), November 2009. Brief Online Article.

Event Detection

Weiler, Andreas and Beel, Joeran and Gipp, Bela and Grossniklaus, Michael. 2016. Stability Evaluation of Event Detection Techniques for Twitter. Proceedings of the 15th International Symposium on Intelligent Data Analysis (IDA’16), Lecture Notes in Computer Science (LNCS), 368–380, doi://10.1007/978-3-319-46349-0.

Scholarly Communication

Gipp, Bela, Corinna Breitinger, Norman Meuschke, and Joeran Beel. “CryptSubmit: Introducing Securely Timestamped Manuscript Submission and Peer Review Feedback using the Blockchain.” In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2017.

Joeran Beel and Bela Gipp. “The Potential of Collaborative Document Evaluation for Science.” In George Buchanan, Masood Masoodian, and Sally Jo Cunningham, editors, 11th International Conference on Asia-Pacific Digital Libraries (ICADL’08) Proceedings, volume 5362 of Lecture Notes in Computer Science (LNCS), pages 375–378, Heidelberg (Germany), December 2008. Springer.

Joeran Beel and Bela Gipp. “Collaborative Document Evaluation: An Alternative Approach to Classic Peer Review.” In Proceedings of the 5th International Conference on Digital Libraries (ICDL’08), volume 31, pages 410–413, Vienna (Austria), August 2008. ISSN 1307-6884.

Document Engineering

Joeran Beel and Stefan Langer. “An Exploratory Analysis of Mind Maps.” In Proceedings of the 11th ACM Symposium on Document Engineering (DocEng’11), Mountain View, California, USA, pages 81-84 2011. ACM.

Bibliometrics

Beel, Joeran. “Virtual Citation Proximity (VCP): Calculating Co-Citation-Proximity-Based Document Relatedness for Uncited Documents with Machine Learning [Proposal],” ResearchGate (2017). DOI: 10.13140/RG.2.2.18759.39842

Siebert, Sophie, Siddarth Dinesh, and Stefan Feyer. “Extending a Research Paper Recommendation System with Bibliometric Measures.” In 5th International Workshop on Bibliometric-enhanced Information Retrieval (BIR) at the 39th European Conference on Information Retrieval (ECIR), 2017.

Beel, Joeran, Corinna Breitinger, and Stefan Langer. “Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?” In Proceedings of the 12th iConference. 2017.

Bela Gipp and Joeran Beel. “Citation Based Plagiarism Detection – A New Approach to Identify Plagiarized Work Language Independently.” In Proceedings of the 21st ACM Conference on Hyptertext and Hypermedia (HT’10), pages 273–274, New York, NY, USA, June 2010. ACM.

Bela Gipp, Norman Meuschke, and Joeran Beel. “Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag.” In Proceedings of 11th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’11), pages 255–258, Ottawa, Canada, June 2011. ACM.

Bela Gipp, Adriana Taylor, and Joeran Beel. “Link Proximity Analysis – Clustering Websites by Examining Link Proximity.” In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, and I. Frommholz, editors, Proceedings of the 14th European Conference on Digital Libraries (ECDL’10): Research and Advanced Technology for Digital Libraries, volume 6273 of Lecture Notes of Computer Science (LNCS), pages 449–452, September 2010. Springer.

Bela Gipp and Joeran Beel. “Citation Proximity Analysis (CPA) – A new approach for identifying related work based on Co-Citation Analysis.” In Birger Larsen and Jacqueline Leta, editors, Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI’09), volume 2, pages 571–575, Rio de Janeiro (Brazil), July 2009. International Society for Scientometrics and Informetrics.

Information Extraction

Dominika Tkaczyk, Andrew Collins, Paraic Sheridan and Joeran Beel. “Evaluation and Comparison of Open Source Bibliographic Reference Parsers: A Business Use Case.” ACM Joint Conference on Digital Libraries (JCDL). 2018.

Dominika Tkaczyk, Andrew Collins and Joeran Beel. “A Method for Discovering and Extracting Author Contributions Information from Scientific Biomedical Publications.” Preprint. 2018.

Beel, Joeran, Stefan Langer, Marcel Genzmehr, and Christoph Müller. “Docears PDF Inspector: Title Extraction from PDF files.” In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’13), 443–444. ACM, 2013.

Lipinski, Mario, Kevin Yao, Joeran Beel, and Bela Gipp. “Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents.” In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries (JCDL’13), 385–386, 2013.

Joeran Beel, Bela Gipp, Ammar Shaker, and Nick Friedrich. “SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size).” In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, and I. Frommholz, editors, Research and Advanced Technology for Digital Libraries, Proceedings of the 14th European Conference on Digital Libraries (ECDL’10), volume 6273 of Lecture Notes of Computer Science (LNCS), pages 413–416, Glasgow (UK), September 2010. Springer.

Plagiarism Detection

Bela Gipp, Norman Meuschke, and Joeran Beel. “Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag.” In Proceedings of 11th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’11), pages 255–258, Ottawa, Canada, June 2011. ACM.

Bela Gipp and Joeran Beel. “Citation Based Plagiarism Detection – A New Approach to Identify Plagiarized Work Language Independently.” In Proceedings of the 21st ACM Conference on Hyptertext and Hypermedia (HT’10), pages 273–274, New York, NY, USA, June 2010. ACM.

Blockchain

Gipp, Bela, Corinna Breitinger, Norman Meuschke, and Joeran Beel. “CryptSubmit: Introducing Securely Timestamped Manuscript Submission and Peer Review Feedback using the Blockchain.” In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2017.

Gipp, Bela, Norman Meuschke, Joeran Beel, and Corinna Breitinger. “Using the Blockchain of Cryptocurrencies for Timestamping Digital Cultural Heritage”. Bulletin of IEEE Technical Committee on Digital Libraries (TCDL). 2017.

Media Bias

Hamborg, Felix, Norman Meuschke, Corinna Breitinger, and Bela Gipp. “news-please: A Generic News Crawler and Extractor.” In Proceedings of the 15th International Symposium of Information Science. Berlin, 2017.

Hamborg, Felix, Norman Meuschke, and Bela Gipp. “Matrix-based news aggregation: exploring different news perspectives.” In Digital Libraries (JCDL), 2017 ACM/IEEE Joint Conference on, 1–10. IEEE, 2017.

 

MathIR

Meuschke, Norman, Moritz Schubotz, Felix Hamborg, Tomas Skopal, and Bela Gipp. “Analyzing Mathematical Content to Detect Academic Plagiarism.” In Proceedings of the International Conference on Information and Knowledge Management (CIKM). Singapore, 2017.

Schubotz, Moritz, Leonard Krämer, Norman Meuschke, Felix Hamborg, and Bela Gipp. “Evaluating and Improving the Extraction of Mathematical Identifier Definitions.” In Experimental IR Meets Multilinguality, Multimodality, and Interaction, edited by Gareth J.F. Jones, Séamus Lawless, Julio Gonzalo, Liadh Kelly, Lorraine Goeuriot, Thomas Mandl, Linda Cappellato, and Nicola Ferro, 82–94. Cham: Springer International Publishing, 2017.

Datasets

CITREC Citations

The CITREC dataset contains the data of two formerly separate collections for citation analysis and it provides the tools necessary for performing evaluations of similarity measures. The first collection is the PubMed Central Open Access Subset (PMC OAS), the second is the collection used for the Genomics Tracks at the Text REtrieval Conferences (TREC)’06 and ’07 (overview paper for the TREC Gen collection). CITREC extends the PMC OAS and TREC Genomics collections by providing:

  1. citation and reference information that includes the position of in-text citations for documents in both collections;
  2. code and pre-computed scores for 35 citation-based and text-based similarity measures;
  3. two gold standards based on Medical Subject Headings (MeSH) descriptors and the relevance feedback gathered for the TREC Genomics collection;
  4. a web-based system (Literature Recommendation Evaluator – LRE) that allows evaluating similarity measures on their ability to identify documents that are relevant to user-defined information needs;
  5. tools to statistically analyze and compare the scores that individual similarity measures yield.

B. Gipp, N. Meuschke, and M. Lipinski, “CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central,” in Proceedings of the iConference 2015, Newport Beach, California, 2015.

RARD — Related Articles

RARD — the Related-Article Recommendation Dataset — is based on the digital library Sowiport and the recommendation-as-a-service provider Mr. DLib. The dataset contains information about 57.4 million recommendations that were displayed to the users of Sowiport. Information includes details on which recommendation approaches were used (e.g. content-based filtering, stereotype, most popular), what types of features were used in content-based filtering (simple terms vs. keyphrases), where the features were extracted from (title or abstract), and the time when recommendations were delivered and clicked. In addition, the dataset contains an implicit item-item rating matrix that was created based on the recommendation click logs. RARD enables researchers to train machine learning algorithms for research-paper recommendations, perform offline evaluations, and do research on data from Mr. DLib’s recommender system, without implementing a recommender system themselves. In the field of scientific recommender systems, our dataset is unique. To the best of our knowledge, there is no dataset with more (implicit) ratings available, and that many variations of recommendation algorithms. The dataset is published under the “Creative Commons Attribution 3.0 Unported (CC-BY)” license.

Beel, Joeran, Zeljko Carevic, Johann Schaible, and Gabor Neusch. “RARD: The Related-Article Recommendation Dataset.” D-Lib Magazine 23, no. 7/8 (July 2017): 1–14.

Docear RecSys

We released a dataset based on the recommender system in our reference management software Docear. The dataset contains the following sub-datasets.

Research Papers — The research papers dataset contains information about the research papers that Docear’s PDF Spider crawled and their citations. This includes information about 9.4 million documents, including 7.95 million citations and 1.8 million URLs to freely available academic PDFs on the Web. The dataset also provides citation positions, i.e. where in a document a citation occurs. This leads to 19.3 million entries in the dataset.

Mind-Maps / User libraries — The mind maps dataset contains information on 390,613 revisions of 52,202 mind-maps created by 12,038 users. The mind-maps themselves are not included in the dataset due to privacy reasons. Information includes the number of nodes, the documents that are linked, the id of the user who created the mind-map, and when mind-maps were created

Users — The user dataset contains information about 8,059 of 21,439 registered users, namely about those who activated recommendations and agreed to have their data analyzed and published. Among others, the dataset includes information about the users’ date of registration, gender and age (if provided during registration), usage intensity of Docear, when Docear was last started, when recommendations were last received, the number of created mind-maps, number of papers, how recommendations were labeled, the number of received recommendations, and click-through rates (CTR) on recommendations.

Recommendations — The recommendation dataset contains information on 308,146 recommendations that were delivered to 3,470 users between March 2013 and March 2014. This includes the date of creation and delivery, the time required to generate recommendations and corresponding user models, and information on the algorithm that generated the recommendations. Information on the algorithms is manyfold. We stored whether stop words were removed, which weighting scheme was applied, whether terms and/or citations were used for the user modelling process and 28 other variables that are described in more detail in the dataset’s readme file.

Beel, Joeran, Stefan Langer, Bela Gipp, Andreas Nürnberger. “The Architecture and Datasets of Docear’s Research Paper Recommender System.” D-Lib Magazine – The Magazine of Digital Library Research, vol. 20, 11/12, 2014.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.