Non-linear correlation of content and metadata information extracted from biomedical article datasets

Theodosios Theodosiou,Lefteris Angelis,Athena Vakali

doi:10.1016/j.jbi.2007.06.004

Theodosios Theodosiou, Lefteris Angelis + Show 1 more

Open Access

https://doi.org/10.1016/j.jbi.2007.06.004

Copy DOI

Journal: Journal of Biomedical Informatics	Publication Date: Jun 10, 2007
Citations: 28	License type: publisher-specific-oa

Affiliation: Aristotle University of Thessaloniki

Abstract

Biomedical literature databases constitute valuable repositories of up to date scientific knowledge. The development of efficient machine learning methods in order to facilitate the organization of these databases and the extraction of novel biomedical knowledge is becoming increasingly important. Several of these methods require the representation of the documents as vectors of variables forming large multivariate datasets. Since the amount of information contained in different datasets is voluminous, an open issue is to combine information gained from various sources to a concise new dataset, which will efficiently represent the corpus of documents. This paper investigates the use of the multivariate statistical approach, called Non-Linear Canonical Correlation Analysis (NLCCA), for exploiting the correlation among the variables of different document representations and describing the documents with only one new dataset. Experiments with document datasets represented by text words, Medical Subject Headings (MeSH) and Gene Ontology (GO) terms showed the effectiveness of NLCCA.

Full Text