Abstract

Classifying scientific articles, patents, and other documents according to the relevant research topics is an important task, which enables a variety of functionalities, such as categorising documents in digital libraries, monitoring and predicting research trends, and recommending papers relevant to one or more topics. In this paper, we present the latest version of the CSO Classifier (v3.0), an unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive taxonomy of research areas in the field of Computer Science. The CSO Classifier takes as input the textual components of a research paper (usually title, abstract, and keywords) and returns a set of research topics drawn from the ontology. This new version includes a new component for discarding outlier topics and offers improved scalability. We evaluated the CSO Classifier on a gold standard of manually annotated articles, demonstrating a significant improvement over alternative methods. We also present an overview of applications adopting the CSO Classifier and describe how it can be adapted to other fields.

Highlights

  • Characterising scholarly documents according to their relevant research topics enables a variety of functionalities, such as: (i) enhancing semantically the metadata of scientific publications, (ii) categorising proceedings in digital libraries, (iii) producing smart analytics, (iv) generating recommendations, and v) detecting research trends [53]

  • We present the latest version of the CSO Classifier (v3.0), a scalable solution for automatically classifying research papers according to the Computer Science Ontology

  • The upper panel shows the results of the 14 approaches discussed in [54], while the lower panel summarises the results of the nine new versions of the CSO Classifier

Read more

Summary

Introduction

Characterising scholarly documents according to their relevant research topics enables a variety of functionalities, such as: (i) enhancing semantically the metadata of scientific publications, (ii) categorising proceedings in digital libraries, (iii) producing smart analytics, (iv) generating recommendations, and v) detecting research trends [53]. State-of-the-art approaches either classify papers in a topdown fashion, taking advantage of pre-existent categories from domain vocabularies, such as MeSH1, PhySH2, and the STW Thesaurus for Economics, or instead proceed in a bottom-up fashion, by means of topic detection methods, such as probabilistic topic models [8,24]. The first solution has the advantage of relying on a set of formally defined research topics associated with human readable labels; it requires such a controlled vocabulary to be available. The natural step was to develop a classifier that supports the annotation of research papers according to CSO [54]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call