Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation

Xinghua Lu,Bin Zheng,Atulya Velivelli,Chengxiang Zhai

doi:10.1197/jamia.m2051

Abstract

ObjectiveAcquiring and representing biomedical knowledge is an increasingly important component of contemporary bioinformatics. A critical step of the process is to identify and retrieve relevant documents among the vast volume of modern biomedical literature efficiently. In the real world, many information retrieval tasks are difficult because of high data dimensionality and the lack of annotated examples to train a retrieval algorithm. Under such a scenario, the performance of information retrieval algorithms is often unsatisfactory, therefore improvements are needed. DesignWe studied two approaches that enhance the text categorization performance on sparse and high data dimensionality: (1) semantic-preserving dimension reduction by representing text with semantic-enriched features; and (2) augmenting training data with semi-supervised learning. A probabilistic topic model was applied to extract major semantic topics from a corpus of text of interest. The representation of documents was projected from the high-dimensional vocabulary space onto a semantic topic space with reduced dimensionality. A semi-supervised learning algorithm based on graph theory was applied to identify potential positive training cases, which were further used to augment training data. The effects of data transformation and augmentation on text categorization by support vector machine (SVM) were evaluated. Results and ConclusionSemantic-enriched data transformation and the pseudo-positive-cases augmented training data enhance the efficiency and performance of text categorization by SVM.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation

Abstract

Talk to us

Similar Papers

More From: Journal of the American Medical Informatics Association

Lead the way for us

Journal: Journal of the American Medical Informatics Association	Publication Date: Aug 30, 2006
Citations: 53

Similar Papers

A Text Categorization Method Based on SVM and Improved K-Means
Rong Ze Xia ... Hu Li
Applied Mechanics and Materials | VOL. 427-429
Rong Ze Xia, et. al.Rong Ze Xia ... Hu Li
01 Sep 2013
Applied Mechanics and Materials | VOL. 427-429

Exploiting probabilistic topic models to improve text categorization under class imbalance
Enhong Chen ... Haiping Ma
Information Processing and Management | VOL. 47
Enhong Chen, et. al.Enhong Chen ... Haiping Ma
01 Sep 2010
Information Processing and Management | VOL. 47

A hybrid approach for text categorization by using x2 statistic, principal component analysis and particle swarm optimization

Scientific Research and Essays | VOL. 8

04 Oct 2013
Scientific Research and Essays | VOL. 8

Unraveling the performance of the benthic index AMBI in a subtropical bay: The effects of data transformations and exclusion of low-reliability sites
Helio H Checon ... A Cecília Z Amaral
Marine Pollution Bulletin | VOL. 126
Helio H Checon, et. al.Helio H Checon ... A Cecília Z Amaral
01 Dec 2017
Marine Pollution Bulletin | VOL. 126

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation

Abstract

Talk to us

Similar Papers

More From: Journal of the American Medical Informatics Association