Abstract

It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party.

Highlights

  • It is important for public data repositories to promote the reuse of archived data

  • To promote the reuse of combined sets of data from multiple projects, public repositories have to provide a filtering feature in data searches, so that users can control the number of experiments and quality of the data in their searches

  • By calculating quantitative variables of sequencing data and integrating them with information on experiments and sample organisms, we enabled an appropriate size of subset to be obtained from multiple projects archived in the repository

Read more

Summary

Results

We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1,171,313 experiments, which can be used to evaluate the suitability of data for reuse. We visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples

Conclusions
Background
Discussion
Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call