Text Processing Pipeline Research Articles

BackgroundThe COVID-19 pandemic has created a pressing need for integrating information from disparate sources in order to assist decision makers. Social media is important in this respect; however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. Here, we adopt a triage and diagnosis approach to analyzing social media posts using machine learning techniques for the purpose of disease detection and surveillance. We thus obtain useful prevalence and incidence statistics to identify disease symptoms and their severities, motivated by public health concerns.ObjectiveThis study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts in order to provide researchers and public health practitioners with additional information on the symptoms, severity, and prevalence of the disease rather than to provide an actionable decision at the individual level.MethodsThe text processing pipeline first extracted COVID-19 symptoms and related concepts, such as severity, duration, negations, and body parts, from patients’ posts using conditional random fields. An unsupervised rule-based algorithm was then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations were subsequently used to construct 2 different vector representations of each post. These vectors were separately applied to build support vector machine learning models to triage patients into 3 categories and diagnose them for COVID-19.ResultsWe reported macro- and microaveraged F1 scores in the range of 71%-96% and 61%-87%, respectively, for the triage and diagnosis of COVID-19 when the models were trained on human-labeled data. Our experimental results indicated that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. In addition, we highlighted important features uncovered by our diagnostic machine learning models and compared them with the most frequent symptoms revealed in another COVID-19 data set. In particular, we found that the most important features are not always the most frequent ones.ConclusionsOur preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from social media natural language narratives, using a machine learning pipeline in order to provide information on the severity and prevalence of the disease for use within health surveillance systems.

Read full abstract

Automatic curation of consumer-generated, opioid-related social media big data may enable real-time monitoring of the opioid epidemic in the United States. To develop and validate an automatic text-processing pipeline for geospatial and temporal analysis of opioid-mentioning social media chatter. This cross-sectional, population-based study was conducted from December 1, 2017, to August 31, 2019, and used more than 3 years of publicly available social media posts on Twitter, dated from January 1, 2012, to October 31, 2015, that were geolocated in Pennsylvania. Opioid-mentioning tweets were extracted using prescription and illicit opioid names, including street names and misspellings. Social media posts (tweets) (n = 9006) were manually categorized into 4 classes, and training and evaluation of several machine learning algorithms were performed. Temporal and geospatial patterns were analyzed with the best-performing classifier on unlabeled data. Pearson and Spearman correlations of county- and substate-level abuse-indicating tweet rates with opioid overdose death rates from the Centers for Disease Control and Prevention WONDER database and with 4 metrics from the National Survey on Drug Use and Health for 3 years were calculated. Classifier performances were measured through microaveraged F1 scores (harmonic mean of precision and recall) or accuracies and 95% CIs. A total of 9006 social media posts were annotated, of which 1748 (19.4%) were related to abuse, 2001 (22.2%) were related to information, 4830 (53.6%) were unrelated, and 427 (4.7%) were not in the English language. Yearly rates of abuse-indicating social media post showed statistically significant correlation with county-level opioid-related overdose death rates (n = 75) for 3 years (Pearson r = 0.451, P < .001; Spearman r = 0.331, P = .004). Abuse-indicating tweet rates showed consistent correlations with 4 NSDUH metrics (n = 13) associated with nonmedical prescription opioid use (Pearson r = 0.683, P = .01; Spearman r = 0.346, P = .25), illicit drug use (Pearson r = 0.850, P < .001; Spearman r = 0.341, P = .25), illicit drug dependence (Pearson r = 0.937, P < .001; Spearman r = 0.495, P = .09), and illicit drug dependence or abuse (Pearson r = 0.935, P < .001; Spearman r = 0.401, P = .17) over the same 3-year period, although the tests lacked power to demonstrate statistical significance. A classification approach involving an ensemble of classifiers produced the best performance in accuracy or microaveraged F1 score (0.726; 95% CI, 0.708-0.743). The correlations obtained in this study suggest that a social media-based approach reliant on supervised machine learning may be suitable for geolocation-centric monitoring of the US opioid epidemic in near real time.

Read full abstract

Text Processing Pipeline Research Articles

Related Topics

Articles published on Text Processing Pipeline

Development and validation of a pulmonary function test data extraction tool for the US department of veterans affairs electronic health record

How Could Semantic Processing and Other NLP Tools Improve Online Legal Databases?

Automated Matching of Patients to Clinical Trials: A Patient-Centric Natural Language Processing Approach for Pediatric Leukemia.

Optimizing healthcare system by amalgamation of text processing and deep learning: a systematic review.

Discovering Content through Text Mining for a Synthetic Biology Knowledge System.

Monitoring COVID-19 on Social Media: Development of an End-to-End Natural Language Processing Pipeline Using a Novel Triage and Diagnosis Approach.

Using Machine Learning to Collect and Facilitate Remote Access to Biomedical Databases: Development of the Biomedical Database Inventory.

Machine Learning and Natural Language Processing for Geolocation-Centric Monitoring and Characterization of Opioid-Related Social Media Chatter

Data-driven method to enhance craniofacial and oral phenotype vocabularies

Automated extraction of ophthalmic surgery outcomes from the electronic health record

Automatic Language Identification in Texts: A Survey

Automatic Processing of User-Generated Content for the Description of Energy-Consuming Activities at Individual and Group Level

Beyond accuracy: creating interoperable and scalable text-mining web services.

ADT Linked to Increased Risk of Alzheimer's Disease

Androgen Deprivation Therapy and Future Alzheimer's Disease Risk.

Mining texts to efficiently generate global data on political regime types

Detection of sentence boundaries and abbreviations in clinical narratives.

Generalising semantic category disambiguation with large lexical resources for fun and profit.

Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records

BeCAS: biomedical concept recognition services and visualization

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Processing Pipeline Research Articles

Related Topics

Articles published on Text Processing Pipeline

Development and validation of a pulmonary function test data extraction tool for the US department of veterans affairs electronic health record

How Could Semantic Processing and Other NLP Tools Improve Online Legal Databases?

Automated Matching of Patients to Clinical Trials: A Patient-Centric Natural Language Processing Approach for Pediatric Leukemia.

Optimizing healthcare system by amalgamation of text processing and deep learning: a systematic review.

Discovering Content through Text Mining for a Synthetic Biology Knowledge System.

Monitoring COVID-19 on Social Media: Development of an End-to-End Natural Language Processing Pipeline Using a Novel Triage and Diagnosis Approach.

Using Machine Learning to Collect and Facilitate Remote Access to Biomedical Databases: Development of the Biomedical Database Inventory.

Machine Learning and Natural Language Processing for Geolocation-Centric Monitoring and Characterization of Opioid-Related Social Media Chatter

Data-driven method to enhance craniofacial and oral phenotype vocabularies

Automated extraction of ophthalmic surgery outcomes from the electronic health record

Automatic Language Identification in Texts: A Survey

Automatic Processing of User-Generated Content for the Description of Energy-Consuming Activities at Individual and Group Level

Beyond accuracy: creating interoperable and scalable text-mining web services.

ADT Linked to Increased Risk of Alzheimer's Disease

Androgen Deprivation Therapy and Future Alzheimer's Disease Risk.

Mining texts to efficiently generate global data on political regime types

Detection of sentence boundaries and abbreviations in clinical narratives.

Generalising semantic category disambiguation with large lexical resources for fun and profit.

Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records

BeCAS: biomedical concept recognition services and visualization