Abstract

ABSTRACTThe availability of large scholarly full‐text datasets with in‐text citations annotated opens the opportunity to investigate how articles have been cited in scientific literature at scale. However, duplicate documents may exist in a dataset, and these duplicates may impact downstream analysis such as calculating citation counts. Document conflation is the task of identifying documents that are nearly identical to each other. This study evaluates document conflation in the Semantic Scholar Open Research Corpus (S2ORC), a dataset containing over 12 million scholarly articles. The evaluation was based on 6,099,232 full‐text S2ORC documents with PubMed IDs (PMIDs) or PubMed Central IDs (PMCIDs). Our findings showed that a portion of S2ORC might contain duplicates. Of the 6,099,232 full‐text documents, 1,280,196 (20.99%) had the same PMIDs or PMCIDs as at least one other document. Pairwise comparisons of their full text found that at least 9.44% of the documents in S2ORC had duplicates.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.