Abstract

This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning. Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms-namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms-to identify the one that could generate the best prediction model. The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format. By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call