Preview of Docear 1.1 with PDF Metadata Retrieval from Google Scholar

Published by Joeran Beel on

Thanks to all the generous donors, our student Christoph could work on an improved PDF metadata retrieval for Docear, and today it’s time to present the first preview. The new Docear 1.1 (preview) is able to extract the title of a PDF and fetch appropriate metadata from Google Scholar. Whenever you select a PDF in your mind-map and chose “Create or Update reference”, the following new dialog appears.

The dialog shows the file name of your PDF file, and the extracted title. In the background, the extracted title is sent to Google Scholar and metadata for the first three search results are shown in the dialog.  If the title was extracted incorrectly, you can manually correct it. You may also chose to use the PDF’s file name for the search. For instance, when you named your PDF already according to the title, select the radio button with the file name, and the file name is sent as search query to Google Scholar (you may also manually correct the file name before it’s sent to Google Scholar). Of course, all other options you already know are still available, such as creating a blank entry, or importing the XMP data of PDFs. Btw. Docear remembers your choice, i.e. when you select to create a blank entry, the option will be pre-selected when open that dialog the next time. It might happen, that your IP will be blocked by Google Scholar when you use the service too frequently. If this happens, a captcha should appear, and after solving it, you should be able to proceed. We did not yet test this thoroughly. Please let us know your experiences.

The precision of our metadata tool depends on two factors, A) the precision of the title extraction and B) the coverage of Google Scholar. According to a recent experiment, title extraction of our tool is around 70%. However, the final result very much depends on the format of your research articles. In my research field (i.e. recommender systems), I would say that our tool extracts the title correctly for about 90% of the articles in my personal library. In addition, almost all articles that are relevant for my research are indexed by Google Scholar (i would estimate, more than 90%). This means, for around 80% of my PDFs the correct metadata is retrieved fully automatically. Given that I provide the title manually, for even more than 90% the metadata may be retrieved. Please let us know your experience (and your research field).

Finally, as promised, the functionality is directly integrated into Docear’s desktop software. This means, you do not depend any more on Docear’s Web Services, which, admittedly, was not always the most reliable. In addition, the entire source code is open-source, and other projects are welcome to use it for their own purposes. If Docear’s licence (GPL) is not suitable for your project, let us know. We have no problem to additionally release the code under another open source licence such as Apache or whatsoever.

There are a few minor issues but overall, the PDF metadata retrieval should work quite well. We sincerely invite you to test it, and let us know any bugs you find, and any ideas how the process could be improved. However, please be aware that Christoph will be on vacation for two weeks. When he returns, he will fix the existing bugs and maybe implement a very simple function to rename PDFs based on their metadata (we cannot promise this).

Apart from the metadata retrieval we have improved how Docear handles BibTeX files from Mendeley and Zotero. Both use other BibTeX dialects than JabRef (our integrated reference manager). Zotero for instance does not escape colons and semicolons in file descriptions or file names. But these characters carry a special meaning in BibTeX file fields which makes these fields very hard to read automatically. Other differences are how absolute paths are stored, if backslash characters or hash characters are escaped, etc. We hope that Docear now deals better with these BibTeX files. We have also fixed an issue which could freeze Docear after converting a BibTeX file. If you run into an error like that, please tell me using our bug tracking forum and I will tell you how to resolve it.

The Changelog

New features or feature improvements:

  • Better PDF metadata retrieval
  • make saving of Docear mind map files safer
  • “Show in reference manager” works with multiple PDF files

Among others the following bugs have been fixed:

  • fixed Mendeley and Zotero BibTeX file updater Bug
  • broken BibTeX file freezes Docear on startup Bug Blocker
  • removing line breaks during annotation import did not work
  • open filetype specific application in duplicated reference warning dialog
  • MultiLineActionLabel link mark (underline) is not wrapped correctly

Download Docear 1.1 Preview, but please note that is a preview. It may contain unknown bugs that might, in the very worst case, damage some of your mind-maps, BibTeX or PDF files. Personally, we use this preview for productive use but we also make daily backups of our data. Hence, do not use this version with your real data, or make regular backups!

  • Windows 
  • Linux
  • Mac OS X
  • All OS ZIP

If you like what Christoph did, and if you are using LibreOffice, or OpenOffice, please also read our call for donation to develop an add-on for these two text processing tools.


3 Comments

Frank · 28th March 2014 at 18:50

“If Docear’s licence (GPL) is not suitable for your project, let us know. We have no problem to release the code under another open source licence such as Apache or whatsoever. ”

Personally would like that the code is released under the GPL, because my small Donation was intended to support Docear under the GPL.

For future donation campaigns please make the form of intended Licence clearer, or even better stick to the GPL. Because the GPL makes the publication of the source code obligatory and so could the code be reused/reviewed… by the programming community.

    Joeran [Docear] · 29th March 2014 at 10:35

    The source code is released under GPL. However, if there is a project using e.g. the Apache Licence, why should we not additionally release the source code that we wrote under Apache Licence? Or another open source licence? Maybe, I should have made that clearer. I will modify my post…

gutzhr · 2nd July 2014 at 00:13

How to revise the website of Google Scholar?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.