During my studies on awareness support for researchers, I often found that the way researchers work with scientific publications is somewhat less than perfect. In fact many of my interviewees said it is a mess (a fact that I can second based on my own experience). In many cases you are not aware of all the related and relevant publications that you should read for that paper you are working on. When doing literature research for your thesis or project, there is a good chance that you miss some important literature, simply because you did not go through all the references of all the papers that you read. There is evidence that todays researchers read far more papers than their 1970s fellows (Renear and Palmer 2009) while spending considerably less time on the single publications. Some people call this information overload others claim that it’s filter failure… Recommender systems seem to be one way of overcoming this issue but they need to be to the point.
When you are organizing scientific events or if you serve as reviewer for such events or journals you have to make sure that the submissions you receive meet several criteria such as correct usage of the provided template, sufficient references and adherence to the page limit. You also want to make sure that the submissions you deal with are not plagiarized. Neither should people have copied from publications from other researchers nor from their own ones without correct attribution. While there are several online services and software tool where you can check writings for plagiarism, most of them are unusable in a setting where you want to include APIs in your own systems and tools or if you deal with large corpora of text documents and are interested in extrinsic plagiarism detection. A similar problem exists when grading seminar, bachelor and master theses in an university setting. Lately, crowdsourced projects like VroniPlag or GuttenPlag have also found severe cases of plagiarism in PhD theses of public figures.
Driven by the above issues, we have brainstormed on how an infrastructure could look like to support the analysis and recommendation of scientific publications on a large scale. We wanted to build on tools that allow for massive scalability and the processing of big data. After some research, we decided to go for Apache Hadoop and its ecosystem as scaffolding for our tools and created the HCPA project (Hadoop Cluster for Large Scale Publication Analyses), which was awarded some money from the University of Paderborn’s young researcher fund. We then investigated which hardware would be suitable for the cluster, did some benchmarks and finally bought the necessary hardware for the cluster. We already had a small cluster of 4 nodes and bought 14 new nodes. We assembled all the hardware ourselves and initialized the software. Currently, our cluster consists of 18 worker nodes with 102 cores (12xCore2 Quad 2,8GHz, 6xIntel Xeon 2,9GHz, 96x AMD FX 6100 3,3GHz), 260GB RAM (3x8GB, 1x12GB, 14x16GB) and 115TB HDD (56x2TB, 1x500GB, 9x330GB). All worker nodes are running Ubuntu Server as operating system. The Hadoop framework is distributed incl. the configuration from the master node and runs on Java 1.6 (because of different string processing).
Tobias Varlemann is the lead developer in the HCPA project and is currently working on his master thesis with the title “Near Copy Detection in large text corpora”. He is blogging about the progress (in German) in our student blog. Moreover, I am supervising three other theses that are using our HCPA cluster: one is dealing with author name disambiguation, the second is calculating similarities between publications, implements an iterative clustering algorithm and deals with classification. The third one, finally, deals with faceted search interfaces (using Apache Solr) for the available data and providing recommendations based on the publications’ content and some scientometric information such as bibliographic coupling or co-citation. Moreover, my current project group PUSHPIN (Supporting Scholarly Awareness in Publications and Social Networks) is using the Hadoop infrastructure in combination with Twitter Storm to do metadata and reference extraction, recommendations and visual analytics of scientific data.
A. H. Renear and C. L. Palmer. Strategic reading, ontologies, and the future of scientific publishing. Science, 325(5942), August 2009.