Abstract

The Medical Subject Headings (MeSH) term search is typical data-gathering method in biomedical text mining. However, it has two problems: the allocation delay of the MeSH term and missing valuable literature sources. Since MeSH term allocation is performed by a human being, the allocation process has delay. In addition, even if a literature source was allocated with a MeSH term, there is a still the problem that valuable literature sources are missed during the data-gathering process. There are literature sources that are not indexed to the MeSH term of a keyword, even though it contains valuable information related to the MeSH term. The MeSH term search misses these valuable literature sources. In order to resolve these problems, we propose a novel method to gather rich data using a one-class support vector machine (SVM) and relevance rule. The term frequency-inverse document frequency (TF-IDF) and paragraph vector are examined as text vectorization methods with various parameters and relevance factors. We apply our method to lung cancer, prostate cancer, breast cancer, and Alzheimer's disease. As a result, up to 26% of keyword data and 35% of target data are gathered with high quality (a C-score of at least 0.948).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call