Abstract
Research publications related to the novel coronavirus disease COVID-19 are rapidly growing in number. However, current online literature hubs, even with artificial intelligence, are inadequate for identifying the relative strength of research topics. Hence, we aimed to develop a comprehensive Latent Dirichlet Allocation (LDA) topic model using natural language processing (NLP) techniques, provide visualisations for temporal trends, and apply our methodology to improve existing online literature hubs.Using the search term “COVID”, abstracts were extracted from PubMed®, from January to July 2020 (N=16346). An LDA topic model was trained on 81% of abstracts. Weekly temporal trends were visualised as a heatmap on all abstracts. Then, we tested our methodology on over 23,000 abstracts gathered from January 2020 to September 2020 from LitCovid, a literature hub from the National Center for Biotechnology Information. We use our topic model to subdivide LitCovid’s eight categories into corresponding LDA topics.The optimised LDA topic model, created using PubMed® data, produced 25 comprehensive topics with no significant overlap. There were temporal changes for topics: prominence of “Mental Health” and “Socioeconomic Impact” increased, “Genome Sequence” decreased, and “Epidemiology” remained relatively constant. We identified inadequate representation of “Airborne Transmission Protection”. Importantly, research on masks and PPE is skewed towards clinical applications with a lack of population-based epidemiological research. Our methodology, when applied to LitCovid, identified important topics within each LitCovid category. For example, “Case Report” was split into topics such as “Pulmonary” and “Oncology” as well as the under-represented topics “Haematology” and “Gastroenterology”. Our work allows for comprehensive topic identification and intuitive visualisation of temporal trends in COVID-19 research. Implementation of the methodology complements existing online literature hubs and identifies underrepresented topics such as population-based studies on masks that may be of significant public interest.Funding Statement: None to declare.Declaration of Interests: There are no conflicts of interest.
Highlights
The COVID-19 outbreak was officially declared a pandemic by the World Health Organization in March 2020 [1]
To evaluate the temporal trends, we propose a novel method, which is applied to both PubMed R and LitCovid abstracts to produce an intuitive visualisation of the weekly temporal evolution of topic proportions
We provide a generalisable natural language processing (NLP) methodology to extract abstracts from PubMed R, create an optimised Latent Dirichlet Allocation (LDA) topic model, and visualise temporal trends
Summary
The COVID-19 outbreak was officially declared a pandemic by the World Health Organization in March 2020 [1]. Latent Dirichlet Allocation (LDA) is an unsupervised topic modelling technique used to learn hidden topics within a corpus [2]. It assumes topics are a soft clustering of words and outputs two probability distributions: a distribution of topics in the corpus, and distributions of words across each topic. Current online literature hubs, even with artificial intelligence, are limited in identifying the complexity of COVID-19 research topics. We developed a comprehensive Latent Dirichlet Allocation (LDA) model with 25 topics using natural language processing (NLP) techniques on PubMed® research articles about “COVID.” We propose a novel methodology to develop and visualise temporal trends, and improve existing online literature hubs. Our topic model demonstrates that research on “masks” and “Personal Protective Equipment (PPE)” is skewed toward clinical applications with a lack of population-based epidemiological research
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.