Abstract

The number of scientific publications is growing exponentially. Research articles cite other work for various reasons and, therefore, have been studied extensively to associate documents. It is argued that not all references carry the same level of importance. It is essential to understand the reason for citation, called citation intent or function. Text information can contribute well if new natural language processing techniques are applied to capture the context of text data. In this paper, we have used contextualized word embedding to find the numerical representation of text features. We further investigated the performance of various machine-learning techniques on the numerical representation of text. The performance of each of the classifiers was evaluated on two state-of-the-art datasets containing the text features. In the case of the unbalanced dataset, we observed that the linear Support Vector Machine (SVM) achieved 86% accuracy for the “background” class, where the training was extensive. For the rest of the classes, including “motivation,” “extension,” and “future,” the machine was trained on less than 100 records; therefore, the accuracy was only 57 to 64%. In the case of a balanced dataset, each of the classes has the same accuracy as trained on the same size of training data. Overall, SVM performed best on both of the datasets, followed by the stochastic gradient descent classifier; therefore, SVM can produce good results as text classification on top of contextual word embedding.

Highlights

  • UsesComparison Motivation Extension Future WorkBalanced SciCite dataset # Feature nameCiting paper titleCited paper title Citing author Cited authorAvailability in Association for Computational Linguistics-Anthology Reference Corpus (ACL-ARC) Available AvailableAvailable Available

  • We further investigated the performance of various machine-learning techniques on the numerical representation of text. e performance of each of the classifiers was evaluated on two state-of-the-art datasets containing the text features

  • In the case of the unbalanced dataset, we observed that the linear Support Vector Machine (SVM) achieved 86% accuracy for the “background” class, where the training was extensive

Read more

Summary

Proposed Study Framework

We discuss various steps of the proposed study, as depicted in Figure 2. e flow of the proposed study starts with the data processing and cleaning step, followed by converting text data to numeric representation. E dataset includes, along with some other unimportant features, the name of the section in which in-text citation is placed, citing and cited paper id, citation context, citation intent class, and the confidence level of the annotated citation intent class. E second state-ofthe-art dataset contains the citation intent annotation in only three classes: background, method, and result. In order to keep the datasets persistent and for comparing and evaluating the results on both of these datasets, we made a balanced version of SciCite, which includes the missing required features for our study. Is study is based on the features selected from both of the datasets discussed in the previous section.

Result
Background
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.