Petabyte-scale innovations at the European Nucleotide Archive

Guy Cochrane ,Mikyung Jang ,Steven Leonard ,Lawrence Bower ,Petra Ten Hoopen ,James K Bonfield ,Hamish Mcwilliam ,Rasko Leinonen ,Quan Lin ,Ruth Akhtar ,Nadeem Faruque ,Rodrigo López ,R W Vaughan ,Siamak Sobhany ,Gemma Hoad ,Szilveszter Juhos ,Gaurab Mukherjee ,Fehmi Demiralp ,Dariusz Lorenc ,Rajesh Radhakrishnan ,Vadim Zalunin ,Chris Hunter ,Stephen J Robinson ,Tim Hubbard ,Richard L Gibson ,Sheila Plaister ,Ewan Birney

doi:10.1093/nar/gkn765

Guy Cochrane , Mikyung Jang + Show 25 more

Open Access

https://doi.org/10.1093/nar/gkn765

Copy DOI

Abstract

Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.

Full Text