Abstract

Multiple additive regression trees (MART) have been widely used in the literature for various classification tasks. However, the overfitting effects of MART across heterogeneous and highly imbalanced big data structures within distributed environments has not yet been investigated. In this work, we utilize distributed MART with hybrid loss to resolve overfitting effects during the training of disease classification models in a case study with 10 heterogeneous and distributed clinical datasets. Lexical and semantic analysis methods were utilized to match heterogeneous terminologies with 80% overlap. Data augmentation was used to resolve class imbalance yielding virtual data with goodness of fit 0.01 and correlation difference 0.02. Our results highlight the favorable performance of the proposed distributed MART on the augmented data with an average increase by 7.3% in the accuracy, 6.8% in sensitivity, 10.4% in specificity, for a specific loss function topology.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.