Abstract

Background There is a huge diversity of microbial taxa, the majority of which have yet to be fully characterized or described. Plant, animal and fungal taxa are formally named and described in numerous vehicles. For prokaryotes, by constrast, all new validly described taxa appear in just one repository: the International Journal of Systematics and Evolutionary Microbiology (IJSEM). This is the official journal of record for bacterial names of the International Committee on Systematics of Prokaryotes (ICSP) of the International Union of Microbiological Societies (IUMS). It also covers the systematics of yeasts. This makes IJSEM an excellent candidate against which to test systems for the automated and semi-automated synthesis of published phylogenies. New information In this paper we apply computer vision techniques to automatically convert phylogenetic tree figure images from IJSEM back into re-usable, computable, phylogenetic data in the form of Newick strings and NEXML. Furthermore, we go on to use the extracted phylogenetic data to compute a formal phylogenetic MRP supertree synthesis, and we compare this to previous hypotheses of taxon relationships given by NCBI’s standard taxonomy tree. This is the world’s first attempt at automated supertree construction using data exclusively extracted by machines from published figure images. Additionally we reflect on how recent changes to UK copyright law have enabled this project to go ahead without requiring permission from copyright holders, and the related challenges and limitations of doing research on copyright-restricted material.

Highlights

  • A recent study estimated that there are more than 114,000,000 documents in the published scientific literature (Khabsa and Giles 2014)

  • We present the results of our efforts to extract phylogenetic data from images contained in the primary research literature

  • Due to copyright restrictions imposed by the publisher of International Journal of Systematics and Evolutionary Microbiology (IJSEM), we do not feel that we can safely share all of the 5,816 source PDFs or the 8,221 figure images we found in those PDFs, that are used or refered-to in this study

Read more

Summary

Introduction

A recent study estimated that there are more than 114,000,000 documents in the published scientific literature (Khabsa and Giles 2014). By constrast, all new validly described taxa appear in just one repository: the International Journal of Systematics and Evolutionary Microbiology (IJSEM). This is the official journal of record for bacterial names of the International Committee on Systematics of Prokaryotes (ICSP) of the International Union of Microbiological Societies (IUMS). This makes IJSEM an excellent candidate against which to test systems for the automated and semi-automated synthesis of published phylogenies

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call