We need your help (i.e. a server) to build a repository for academic PDF files
It’s a while ago that we started crawling the Web for academic PDFs to index them and use them for Docear’s research paper recommender system. Meanwhile, we have collected quite a few PDFs. Unfortunately, in a foreseeable future, our servers’ disks will be full and the load of our servers is too high already (that’s why you sometimes won’t get recommendations in Docear – our servers simply are too busy).
Since our budget is tight and we don’t want to spend too much time for server administration neither, we are asking for your help: Do you have a server that you could spare? What we need is the following
- Storage space for backing up the indexed PDFs.
Right now we delete a PDF once it’s indexed. However, we would like to keep them in case we need to re-index them. Therefore, we need a server with lots of disk space (10 TB or more) and there will be about 20-30 GB traffic a day to backup the PDFs to the server. Otherwise there are no special requirements to that server.
- PDF-Caching Server for making the PDFs publicly available
We would love to cache the PDFs because many of them are deleted from the Web after a while. That means, we would like to enable our users downloading the PDFs from our/your server. So, the requirements for this server would be the same as for the backup server (10TB+ storage) plus the option to have an Apache running and maybe a Tomcat, too. In addition, there would be more traffic (the 20-30 GB upload plus the downloads from the users).
- PDF Indexing Server to download and index PDF files
Right now, the bottleneck of the entire system is the download and processing of the PDFs (this takes a few seconds per PDF). What we need is a really powerful server, especially in terms of CPU-power. The server would have to download PDFs 24/7 (URLs to the PDFs would be delivered from Docear’s main server), process them (convert to text, extract title, extract references), add the PDF to our Lucene index and make a backup of the PDF to the backup server. Accordingly, storage requirements are rather low (a few Gigabyte should be enough, an SSD would be awesome) but there will be about 40-60 GB traffic a day and the CPU load will be close to 1 most of the time.
Since we may use the facilities at our university, we would be able to host the server at our university’s data center. However, if you would host the server your self and just provide us with the log-in data, that would be even better :-).
If you think, you could help, please send us an email to firstname.lastname@example.org and let us know how we could give something in return (e.g. your logo on our website).