Machine Learning for Identifying Relevance to Biosurveillance in Multilingual Text

Qiaochu Chen,Lauren E Charles

doi:10.5210/ojphi.v10i1.8375

Abstract

Objective: The objective is to develop an ensemble of machine learning algorithms to identify multilingual, online articles that are relevant to biosurveillance. Language morphology varies widely across languages and must be accounted for when designing algorithms. Here, we compare the performance of a word embedding-based approach and a topic modeling approach with machine learning algorithms to determine the best method for Chinese, Arabic, and French languages.Introduction: Global biosurveillance is an extremely important, yet challenging task. One form of global biosurveillance comes from harvesting open source online data (e.g. news, blogs, reports, RSS feeds). The information derived from this data can be used for timely detection and identification of biological threats all over the world. However, the more inclusive the data harvesting procedure is to ensure that all potentially relevant articles are collected, the more data that is irrelevant also gets harvested. This issue can become even more complex when the online data is in a non-native language. Foreign language articles not only create language-specific issues for Natural Language Processing (NLP), but also add a significant amount of translation costs. Previous work shows success in the use of combinatory monolingual classifiers in specific applications, e.g., legal domain [1]. A critical component for a comprehensive, online harvesting biosurveillance system is the capability to identify relevant foreign language articles from irrelevant ones based on the initial article information collected, without the additional cost of full text retrieval and translation.Methods: The analysis text dataset contains the title and brief description of 3506 online articles in Chinese, Arabic, and French languages from the date range of August, 17, 2016 to July 5, 2017. The NLP article pre-processing steps are language-specific tokenization and stop words removal. We compare two different approaches: word embeddings and topic modeling (Fig. 1). For word embeddings, we first generate word vectors for the data using a pretrained Word2Vec (W2V) model [2]. Subsequently, the word vectors within a document are averaged to produce a single feature vector for the document. Then, we fit a machine learning algorithm (random forest classifier or Support Vector Machine (SVM)) to the training vectors and get predictions for the test documents. For topic modelling, we used a Latent Dirichlet Allocation (LDA) model to generate five topics for all relevant documents [3]. For each new document, the output is the probability measure for the document belonging to these five topics. Here, we classify the new document by comparing the probability measure with a relevancy threshold.Results: The Word2Vec model combined with a random forest classifier outperformed the other approaches across the three languages (Fig. 2); the Chinese model has an 89% F1-score, the Arabic model has 86%, and the French model has 94%. To decrease the chance of calling a potentially relevant article irrelevant, high recall was more important than high precision. In the Chinese model, the Word2Vec with a random forest approach had the highest recall at 98% (Table 1).Conclusions: We present research findings on different approaches of relevance to biosurveillance identification on non-English texts and identify the best performing methods for implementation into a biosurveillance online article harvesting system. Our initial results suggest that the word embeddings model has an advantage over topic modeling, and the random forest classifier outperforms the SVM. Directions for future work will aim to further expand the list of languages and methods to be compared, e.g., n-grams and non-negative matrix factorization. In addition, we will fine-tune the Arabic and French model for better accuracy results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Machine Learning for Identifying Relevance to Biosurveillance in Multilingual Text

Abstract

Talk to us

Similar Papers

More From: Online Journal of Public Health Informatics

Lead the way for us

Journal: Online Journal of Public Health Informatics	Publication Date: May 22, 2018
License type: cc-by

Similar Papers

Transportation sentiment analysis using word embedding and ontology-based topic modeling
Farman Ali ... Kyung-Sup Kwak
Knowledge-Based Systems | VOL. 174
Farman Ali, et. al.Farman Ali ... Kyung-Sup Kwak
05 Mar 2019
Knowledge-Based Systems | VOL. 174

Where Does LDA Sit for GitHub?
Xukun Wang ... Matthias Lee
-
Xukun Wang, et. al.Xukun Wang ... Matthias Lee
01 Nov 2019
01 Nov 2019

Comparative Analysis of Research Papers Categorization using LDA and NMF Approaches
Sandeep Preetham M C ... Darukumalli Sai Tharun Reddy
-
Sandeep Preetham M C, et. al.Sandeep Preetham M C ... Darukumalli Sai Tharun Reddy
20 Nov 2022
20 Nov 2022

Evaluating the Coverage and Depth of Latent Dirichlet Allocation Topic Model in Comparison with Human Coding of Qualitative Data: The Case of Education Research
Gaurav Nanda ... Yuzhe Zhou
Machine Learning and Knowledge Extraction | VOL. 5
Gaurav Nanda, et. al.Gaurav Nanda ... Yuzhe Zhou
14 May 2023
Machine Learning and Knowledge Extraction | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Machine Learning for Identifying Relevance to Biosurveillance in Multilingual Text

Abstract

Talk to us

Similar Papers

More From: Online Journal of Public Health Informatics