Predicting structured metadata from unstructured metadata.

Lisa Posch,Michel Dumontier,Olivier Gevaert,Maryam Panahiazar

doi:10.1093/database/baw080

Abstract

Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data—defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction.Database URL: http://www.yeastgenome.org/

Highlights

Enormous amounts of biomedical data have been and are being produced by investigators all over the world
Several databases were created in the process to house this data and make it available to the community at large, such as the NCBI databases for microarray data; Gene Expression Omnibus (GEO) [1] and sequence data; the VC The Author(s) 2016
The main contribution of this paper is to explore whether unstructured gene expression sample metadata contains information, which can be exploited for predicting structured metadata using traditional text mining methods based on term frequency-inverse document frequency (TF-IDF)

Summary

Introduction

Enormous amounts of biomedical data have been and are being produced by investigators all over the world. This is mainly due to advancements in molecular technologies that have enabled extensive profiling of biological samples and have unleashed a myriad of omics data such as gene expression, microRNA expression, DNA methylation and DNA mutation data. Realized that this data should be stored and shared with other investigators.

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database	Publication Date: Jan 1, 2016
Citations: 14	License type: cc-by

R Discovery Prime

R Discovery Prime

Predicting structured metadata from unstructured metadata.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database

Lead the way for us

Similar Papers

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)
Maryam Panahiazar ... Olivier Gevaert
Journal of biomedical informatics | VOL. 72
Maryam Panahiazar, et. al.Maryam Panahiazar ... Olivier Gevaert
16 Jun 2017
Journal of biomedical informatics | VOL. 72

Sentiment Analysis Using Modified LDA
Jingyi Ye ... Xiaojun Jing
-
Jingyi Ye, et. al.Jingyi Ye ... Xiaojun Jing
19 Dec 2017
19 Dec 2017

Correlated Latent Semantic Model for Unsupervised LM Adaptation
Yik-Cheung Tam ... Tanja Schultz
-
Yik-Cheung Tam, et. al.Yik-Cheung Tam ... Tanja Schultz
01 Apr 2007
01 Apr 2007

Point-cloud detection of buildings based on a latent Dirichlet allocation model with waveform data
Liu Zhiqing ... Zhou Yang
Remote Sensing Letters | VOL. 11
Liu Zhiqing, et. al.Liu Zhiqing ... Zhou Yang
26 Dec 2019
Remote Sensing Letters | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Predicting structured metadata from unstructured metadata.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database