Abstract
In this study, we attempt to assess the value of the term Big Data when used by researchers in their publications. For this purpose, we systematically collected a corpus of biomedical publications that use and do not use the term Big Data. These documents were used as input to a machine learning classifier to determine how well they can be separated into two groups and to determine the most distinguishing classification features. We generated 100 classifiers that could correctly distinguish between Big Data and non-Big Data documents with an area under the Receiver Operating Characteristic (ROC) curve of 0.96. The differences between the two groups were characterized by terms specific to Big Data themes—such as `computational’, `mining’, and `challenges’—and also by terms that indicate the research field, such as `genomics’. The ROC curves when plotted for various time intervals showed no difference over time. We conclude that there is a detectable and stable difference between publications that use the term Big Data and those that do not. Furthermore, the use of the term Big Data within a publication seems to indicate a distinct type of research in the biomedical field. Therefore, we conclude that value can be attributed to the term Big Data when used in a publication and this value has not changed over time.
Highlights
With approximately 3700 documents mentioning Big Data in the PubMed library between 2011 and the time of writing, it can be said that the term Big Data is widely used in biomedical research
In this research we investigated the question whether Big Data (BD) literature in the biomedical field can be distinguished from literature that does not use the term
We found no trends over time that indicate a change in the distinguishability between BD and non-Big Data (NBD) documents
Summary
With approximately 3700 documents mentioning Big Data in the PubMed library between 2011 and the time of writing, it can be said that the term Big Data is widely used in biomedical research. This, does not mean that a clear-cut meaning of the term is being applied, as can be attested from the many publications—both formal and informal—written on the subject. This sentiment is underwritten in publications such as Tian et al [1] and Mayer-Schonberger et al [2], which state that there is no rigorous definition of Big Data and it remains something of a work-in-progress. The degree to which BD can be separated from NBD documents gives insight in the value of the Big Data term, and inspecting the distinctive features tells us something about its meaning. The influence of some hype effect can be measured through the change of value of the term over time
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have