Abstract
As the collection of mobile health data becomes pervasive, missing data can make large portions of datasets inaccessible for analysis. Missing data has shown particularly problematic for remotely diagnosing and monitoring Parkinson's disease (PD) using smartphones. This contribution presents multi-source ensemble learning, a methodology which combines dataset deconstruction with ensemble learning and enables participants with incomplete data (i.e., where not all sensor data is available) to be included in the training of machine learning models and achieves a 100% participant retention rate. We demonstrate the proposed method on a cohort of 1513 participants, 91.2% of which contributed incomplete data in tapping, gait, voice, and/or memory tests. The use of multi-source ensemble learning, alongside convolutional neural networks (CNNs) capitalizing on the amount of available data, increases PD classification accuracy from 73.1% to 82.0% as compared to traditional techniques. The increase in accuracy is found to be partly caused by the use of multi-channel CNNs and partly caused by developing models using the large cohort of participants. Furthermore, through bootstrap sampling we reveal that feature selection is better performed on a large cohort of participants with incomplete data than on a small number of participants with complete data. The proposed method is applicable to a wide range of wearable/remote monitoring datasets that suffer from missing data and contributes to improving the ability to remotely monitor PD via revealing novel methods of accounting for symptom heterogeneity.
Highlights
P ARKINSON’S disease (PD) is the second most common neurodegerative disease after Alzheimer’s disease and its prevalence is estimated to double over the two decades [22]
In this research we have presented a novel method for compensating for source-wise missing data through the combined use of dataset deconstruction and ensemble learning
Due to the inclusion of a high number of participants and the robust fusion of multiple classification models, we find our method yields higher disease classification accuracies when used for remote detection of PD and to be more appropriate at feature selection than traditional methods
Summary
P ARKINSON’S disease (PD) is the second most common neurodegerative disease after Alzheimer’s disease and its prevalence is estimated to double over the two decades [22]. The current gold-standard of diagnosing and monitoring PD is the Unified Parkinsons Disease Rating Scale (UPDRS) which is performed in-clinic by movement disorder specialists [9]. The diagnosis procedure is further complicated as symptom prevalence is highly heterogeneous in the PD population in that two people with similar UPDRS scores may exhibit different motor symptoms [21], [29]. Many studies have shown the ability to identify disease differentiating digital biomarkers in gait, dexterity, tremor, and voice tests [6], [16], [24], [36]. The vast majority of these studies have been performed in-clinic, using different experimental protocols, different sensors, and have had small cohorts. The specific biomarkers from each study lack scalability as they are yet to be validated on a large cohort
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have