Evaluating semantometrics from computer science publications

Christin Katharina Kreutz,Ralf Schenkel,Premtim Sahitaj

doi:10.1007/s11192-020-03409-5

Christin Katharina Kreutz, Ralf Schenkel + Show 1 more

Open Access

https://doi.org/10.1007/s11192-020-03409-5

Copy DOI

Journal: Scientometrics	Publication Date: Mar 18, 2020
Citations: 8	License type: open-access

Affiliation: University of Trier

Abstract

Identification of important works and assessment of importance of publications in vast scientific corpora are challenging yet common tasks subjected by many research projects. While the influence of citations in finding seminal papers has been analysed thoroughly, citation-based approaches come with several problems. Their impracticality when confronted with new publications which did not yet receive any citations, area-dependent citation practices and different reasons for citing are only a few drawbacks of them. Methods relying on more than citations, for example semantic features such as words or topics contained in publications of citation networks, are regarded with less vigour while providing promising preliminary results. In this work we tackle the issue of classifying publications with their respective referenced and citing papers as either seminal, survey or uninfluential by utilising semantometrics. We use distance measures over words, semantics, topics and publication years of papers in their citation network to engineer features on which we predict the class of a publication. We present the SUSdblp dataset consisting of 1980 labelled entries to provide a means of evaluating this approach. A classification accuracy of up to .9247 was achieved when combining multiple types of features using semantometrics. This is +.1232 compared to the current state of the art (SOTA) which uses binary classification to identify papers from classes seminal and survey. The utilisation of one-vector representations for the ternary classification task resulted in an accuracy of .949 which is +.1475 compared to the binary SOTA. Classification based on information available at publication time derived with semantometrics resulted in an accuracy of .8152 while an accuracy of .9323 could be achieved when using one-vector representations.

Highlights

With the ever growing amount of scientific publications, automatic methods for finding influential or seminal works are indispensable
Our experiments show the significantly superior performance of one-vector representations compared to usage of features derived from semantometrics for our dataset
The two main tasks of classification a publication with its complete citation network as seminal, survey or uninfluential as well as quality prediction of new papers which did not yet receive citations were observed: We dissected the classification of publications in their citation network as seminal, survey or uninfluential papers based on semantometrics derived from our proposed SUSdblp dataset which is publicly available

Summary

Introduction

With the ever growing amount of scientific publications, automatic methods for finding influential or seminal works are indispensable. A majority of research tackles the identification of important works (Gerrish and Blei 2010; Herrmannova et al 2018; Simkin and Roychowdhury 2005; Wade et al 2006; Whalen et al 2005; Zhu et al 2015). Common approaches are based on the observation of the number of citations which publications received. As this indicator can be highly dependent on a specific dataset, it might be problematic to utilise as a measure of impact (Moed 2002; Seglen 1997). Citations need to be handled with care due to cases of self-citations (Jin et al 2007; Schreiber 2007, 2008), varying citation practices in different areas (Cronin and Meho 2006; Jin et al 2007; Seglen 1992, 1997; Shi et al 2010), diverging reasons for citing (Garfield 1964), the non-existence of citations of new papers (Wade et al 2006) and uncited influences (Garfield 1964; MacRoberts and MacRoberts 2010; Patton et al 2016)

Methods

Results

Discussion

Conclusion