Abstract
Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.
Highlights
The National Library of Medicine’s (NLM) PubMed database contains over 27 million citations and is growing exponentially (Lu, 2011)
To encourage biomedical text mining researchers to widely share their results and code, and keep analyses up-to-date, we present PubRunner
It wraps around a text mining tool and manages regular updates using the latest publications from PubMed
Summary
PubRunner can upload data to Zenodo which is a data repository designed for very large datasets to encourage open science This will allow the output of text mining tools to be kept publicly available permanently. This data can be used for interesting analysis on term similarity or as a useful input to other machine learning algorithms (Mehryary et al.) This resource is valuable to the biomedical community, requires substantial compute and storage to create (which may be outside the capability of smaller research groups), and is a good example of a resource that should be kept up-to-date. We hope this shows that PubRunner can be used with real text mining tools and the test cases that we had previously shown.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.