Abstract

This paper describes a unique two-step methodology used to construct six linked bibliometric datasets covering the sequencing of Saccharomyces cerevisiae, Homo sapiens, and S us scrofa genomes. First, we retrieved all sequence submission data from the European Nucleotide Archive (ENA), including accession numbers associated with each species. Second, we used these accession numbers to construct queries to retrieve peer-reviewed scientific publications that first linked to these sequence lengths in the scientific literature. For each species, this resulted in two associated datasets: 1) A .csv file documenting the PMID of each article describing new sequences, all paper authors, all institutional affiliations of each author, countries of institution, year of first submission to the ENA, and the year of article publication, and 2) A .csv file documenting all institutions submitting to the ENA, number of nucleotides sequenced, number of submissions per institution in a given year, and years of submission to the database. In several upcoming publications, we utilise these datasets to understand how institutional collaboration shaped sequencing efforts, and to systematically identify important institutions and changes in network structures over time. This paper, therefore, should aid researchers who would like to use these data for future analyses by making the methodology that underpins it transparent. Further, by detailing our methodology, researchers may be able to utilise our approach to construct similar datasets in the future.

Highlights

  • This paper describes the methodology used to construct six novel datasets for the European Research Council funded project, Medical Translation in the History of Modern Genomics; a project exploring the history of scientific collaboration around DNA sequencing

  • The datasets contain information specific to the genomic sequencing of Saccharomyces cerevisiae, Homo sapiens, and Sus scrofa, and consist of data related to sequence submissions to public databases and co-authorship relations underpinning the description of those sequences in the scientific literature

  • Linking particular sequence submissions to peer-reviewed publications that first described these in the literature via Application Programme Interfaces (APIs) queries, which utilised sequence accession numbers to mine Europe PubMed Central and SCOPUS

Read more

Summary

Jake Lever USA

Stanford University, Stanford, Any reports and responses or comments on the article can be found at the end of the article. We adopted the suggestion by reviewer 1 and cited a new reference (Leonelli, 2016) in the reflection on strengths and weaknesses section. We added this to the reference list. Any further responses from the reviewers can be found at the end of the article

Introduction
Materials and methods
Findings
Leonelli S
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call