Abstract
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available.
Highlights
Citations are an essential tool for scientific practice
Concerning reason (2), the fact that unarXive is built from papers on the preprint server arxiv.org, and the Microsoft Academic Graph (MAG) contains metadata on paper’s preprint and published versions, allows us to analyze whether or not cross-lingual citations are affected by the peer review process
To assess the relative degree of self-citation when referring to publications in other languages, we compare the ratio of self-citations in (a) the cross-lingual citations within the documents of the cross-lingual set, and (b) the monolingual citations within the documents of the cross-lingual set
Summary
Citations are an essential tool for scientific practice. By allowing authors to refer to existing publications, citations make it possible to position one’s work within the context of others’, critique, compare, and point readers to supplementary reading material. Because English is currently the de facto academic lingua franca [37], citations from non-English languages to English are significantly more prevalent than the other way around This dichotomy is reflected in existing literature, where usually either citations from English [24,29], or to English [20,21,41,44] are analyzed. 1. We conduct an analysis of cross-lingual citations in English papers that is considerably more extensive than existing literature in terms of corpus size as well as covered languages, time, and disciplines. We conduct an analysis of cross-lingual citations in English papers that is considerably more extensive than existing literature in terms of corpus size as well as covered languages, time, and disciplines This makes the results more representative of the areas covered, and enables the use of our collected data for machine learning-based applications such as crosslingual citation recommendation. Parts within the text of a paper, which contain a marker connected to one of the reference section entries, are called in-text citations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.