Abstract
BackgroundThis study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators.ResultsThere were 12,500 samples annotated with approximately 30 variables, in each of six disease categories – breast cancer, colon cancer, inflammatory bowel disease (IBD), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and Type 1 diabetes mellitus (DM). The annotators provided excellent variable coverage, with known values for over 98% of three critical variables: disease state, tissue, and sample type. There was 89% strict inter-annotator agreement and 92% agreement when using semantic and partial similarity measures.ConclusionWe show that it is possible to perform manual re-annotation of a large repository in a reliable manner.
Highlights
IntroductionLarge repositories of gene expression data are currently available and serve as online resources for researchers, including the Gene Expression Omnibus (GEO), the Center for Information Biology Gene Expression Database (CIBEX), the European Bioinformatics Institute’s ArrayExpress and the Stanford Tissue Microarray Database [1,2,3,4]
This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus
Repositories for gene expression data such as GEO allow for widespread distribution of gene expression measurements in order to: (1) validate experimental results, (2) enable progressive accumulation of data that may support, modify or further develop prior work, and (3) facilitate use of archived measurements to generate novel hypotheses that naturally develop from continuous updating of accumulated data
Summary
Large repositories of gene expression data are currently available and serve as online resources for researchers, including the Gene Expression Omnibus (GEO), the Center for Information Biology Gene Expression Database (CIBEX), the European Bioinformatics Institute’s ArrayExpress and the Stanford Tissue Microarray Database [1,2,3,4]. There are no standardized documentation rules, so phenotypic and/or protocol information resides in multiple documents and physical locations. Such information may be included as text describing the experiment or protocol, sample and sampling descriptions, or may be found only in the published journal article that may accompany the submission. The data have to contain sufficient detail to allow for appropriate combination of similar experimental subjects and protocols that may collectively facilitate the verification, support, or development of new hypotheses
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.