Abstract

We aimed to develop machine learning classifiers as a risk-prevention mechanism to help medical professionals with little or no knowledge of the patient’s languages in order to predict the likelihood of clinically significant mistakes or incomprehensible MT outputs based on the features of English source information as input to the MT systems. A MNB classifier was developed to provide intuitive probabilistic predictions of erroneous health translation outputs based on the computational modelling of a small number of optimised features of the original English source texts. The best performing multinominal Naïve Bayes classifier (MNB) using a small number of optimised features (8) achieved statistically higher AUC (M = 0.760, SD = 0.03) than the classifier using high-dimension natural features (135) (M = 0.631, SD = 0.006, p < 0.0001, SE = 0.004) and the automatically optimised classifier (22) (M = 0.7231, SD = 0.0084, p < 0.0001, SE = 0.004). Furthermore, MNB (8) had statistically higher sensitivity (M = 0.885, SD = 0.100) compared with the full-feature classifier (135) (M = 0.577, SD = 0.155, p < 0.0001, SE = 0.005) and the automatically optimised classifier (22) (M = 0.731, SD = 0.139, p < 0.0001, SE = 0.0023). Finally, MNB (8) reached statistically higher specificity (M = 0.667, SD = 0.138) compared to the full-feature classifier (135) (M = 0.567, SD = 0.139, p = 0.0002, SE = 0.026) and the automatically optimised classifier (22) (M = 0.633, SD = 0.141, p = 0.0133, SE = 0.026).

Highlights

  • Public health discourse is a heterogeneous system which encompasses clinical, research, and educational resources

  • In order to develop interpretable machine learning classifiers to predict the likelihood of machine translation errors, we reviewed the jointly optimised features (CFJO 22) and further reduced the number of features in this automatically selected feature set on the basis of the interpretability of semantic features

  • The main purpose of our study was twofold: first, to develop machine learning classifiers as decision aids and help vulnerable non-English speaking people to appreciate the risks of using machine translation (MT) tools to seek and acquire health information online

Read more

Summary

Introduction

Public health discourse is a heterogeneous system which encompasses clinical, research, and educational resources. Research resources are known for their linguistic, content complexity, formality [3,4,5]. Both genres pose significant challenges for machine translation (MT) technologies which are susceptible to language difficulty and irregularity [6,7,8]. Compared to these specialised discourses, public health educational resources developed by health authorities are designed to have high information accessibility, as these resources are intended to inform, guide, and support the general public in acquiring an essential understanding of health risks and diseases.

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call