Abstract

Authors are often not aware of hidden information and that they can contain more information than the actual content of the file. This work mainly focuses on how PDF files are published in the scientific community. We have analyzed a corpus of 555865 PDF files to show that direct and modified authoring process of PDF creations leads to the leakage of sensitive information on the researchers. Our analysis on the extraction of the metadata has shown that at least 23% of the PDF files in our dataset contains valuable information on the authoring process. We were even able to solve the co-authorship (multiple authors) problem by crossing the information of multiple PDF files using linear algebra. We believe that, PDF sanitization needs to be included in the scientific publication processes to avoid leakage of sensitive information. We have explored and suggested necessary strategies available for the safer distribution of scientific work by researchers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.