Abstract

MotivationBiomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results.ResultsWe present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance.Availability and implementationSource code and the list of PMIDs of the publications in our datasets are available upon request.

Highlights

  • Biomedical research findings are typically reported via publications

  • We attained statistically significantly improved performance by employing the meta-classification scheme CombC as well as by employing the classifier, CombV, where concatenated vectors are used for document representation, as compared to classification based on the title-and-abstracts, the captions and the Figure-words alone

  • We presented a new scheme for identifying biomedical documents that are relevant to a certain domain, by using information derived from both images and captions, as well as from titles-and-abstracts

Read more

Summary

Introduction

Biomedical research findings are typically reported via publications. To simplify access to domain-specific knowledge, while supporting the research community, several biomedical databases [e.g. UniProt (Bateman et al, 2021), BioGRID (Chatr-Aryamontri et al, 2017), Wormbase (Harris et al, 2020) and MGI (Blake et al, 2021)] invest significant effort in expert curation of the literature. The first step in the biocuration process is to identify articles that are relevant to a specific area on which the biomedical databases focus. Biocurators at the Jackson Laboratory’s Gene Expression Database (GXD) identify publications relevant to gene expression during mouse development (Finger et al, 2017). Selecting biomedical publications in such focus areas is often too labor-intensive and slow for effectively detecting all and only the relevant articles within a large volume of published literature. Automatically identifying publications relevant to a specific topic is an important task toward expediting biocuration and, in turn, biomedical research

Objectives
Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call