Abstract

Abstract Quantile regression presents a complete picture of the effects on the location, scale, and shape of the dependent variable at all points, not just the mean. We focus on two challenges for citation count analysis by quantile regression: discontinuity and substantial mass points at lower counts. A Bayesian hurdle quantile regression model for count data with a substantial mass point at zero was proposed by King and Song (2019). It uses quantile regression for modeling the nonzero data and logistic regression for modeling the probability of zeros versus nonzeros. We show that substantial mass points for low citation counts will almost certainly also affect parameter estimation in the quantile regression part of the model, similar to a mass point at zero. We update the King and Song model by shifting the hurdle point past the main mass points. This model delivers more accurate quantile regression for moderately to highly cited articles, especially at quantiles corresponding to values just beyond the mass points, and enables estimates of the extent to which factors influence the chances that an article will be low cited. To illustrate the potential of this method, it is applied to simulated citation counts and data from Scopus.

Highlights

  • Citation analysis can help to estimate the relative importance or impact of articles by counting the number of times that they have been cited by other works

  • Various statistical models have been proposed for citation counts (e.g. Brzezinski, 2015; Eom and Fortunato, 2011; Garanina and Romanovsky, 2016; Low et al, 2016; Redner, 1998; Seglen, 1992; Shahmandi et al, 2020; Thelwall, 2016; Thelwall and Wilson, 2014), but most have sought to model the conditional mean of citation counts from independent variables

  • Quantile regression (QR) is a statistical method proposed by Koenker and Bassett (1978) to complement classical linear regression analysis (e.g., Coad and Rao, 2008; Koenker and Hallock, 2001)

Read more

Summary

INTRODUCTION

Citation analysis can help to estimate the relative importance or impact of articles by counting the number of times that they have been cited by other works. This paper, based on simulations of log-normal continuous data with substantial mass points at zero, one, two, and three (approximating a common distribution of citation counts), will assess, by considering the mean squared error of the estimates of the coefficients corresponding to the independent variables in the model, whether the QR part of the two-part model with a hurdle at three, results in more accurate estimates than are obtained by the other models. The results of the simulation show that the model with hurdle at 3 in general returns more accurate estimates based on the mean squared errors of the estimates of parameters (excluding the intercept) at quantiles just beyond the hurdle.

Literature and Literary Theory
Findings
DISCUSSION AND CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call