Abstract

To reuse the enormous amounts of biomedical data available on the Web, there is an urgent need for good quality metadata. This is extremely important to ensure that data is maximally Findable, Accessible, Interoperable and Reusable. The Gene Expression Omnibus (GEO) allow users to specify metadata in the form of textual key: value pairs (e.g. sex: female). However, since there is no structured vocabulary or format available, the 44,000,000+ key: value pairs suffer from numerous quality issues. Using domain experts for the curation is not only time consuming but also unscalable. Thus, in our approach, MetaCrowd, we apply crowdsourcing as a means for GEO metadata quality assessment. Our results show crowdsourcing is a reliable and feasible way to identify similar as well as erroneous metadata in GEO. This is extremely useful for data consumers and producers for curating and providing good quality metadata.

Highlights

  • Advancements in molecular technologies have enabled extensive profiling of biological samples, resulting in massive amounts of data that can be analyzed to better understand living systems

  • We focused on experimental metadata from Gene Expression Omnibus (GEO), in particular on ‘sample records (Sample)’ records

  • We evaluated our approach for 1, 643 GEO metadata keys belonging to eight key categories: (i) cell line, (ii) disease, (iii) gender, (iv) genotype, (v) strain, (vi) time, (vii) tissue and (viii) treatment

Read more

Summary

Introduction

Advancements in molecular technologies have enabled extensive profiling of biological samples, resulting in massive amounts of data that can be analyzed to better understand living systems. Journals, funding agencies, and investigators all realize the value that these data have to reproduce published findings, validate their own results, and generate new and interesting hypotheses (Barrett et al, 2013b). Consider the work from Khatri and colleagues (Khatri et al, 2013), who used publicly available expression data from the Gene Expression Omnibus (GEO) (Edgar et al, 2002; Barrett et al, 2013a) to identify gene signatures that were predictive for tissue graft rejection. Their work required them to search for and tediously curate important sample characteristics (organism, tissue, protocol, etc) for deposited samples. This documentation, otherwise known as metadata, helps investigators understand the meaning and provenance of the data (Borgman, 2012). Incomplete, imprecise metadata makes it difficult to find datasets

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call