Abstract

The key to effective cancer treatment is early detection. Risk models built from routinely collected clinical data have the opportunity to improve early detection by identifying high-risk patients. In this study, we explored various machine learning techniques for building a melanoma skin cancer risk model. The dataset contains records of routine dermatology office visits from 9,531,408 patients spread throughout the United States. Of these patients, 17,246 (0.18%) developed melanoma. We conducted extensive experiments to effectively learn from this dataset with limited positive samples. We derived datasets with more severe class imbalance and tested several classifiers with different data sampling techniques to build the best possible model. Additionally, we explored various properties of the datasets to determine relationships between class distributions and model performance. We found that randomly removing negative cases from the training datasets significantly improved model performance. K-means clustering of different groups of instances shows that there is greater homogeneity in negative samples, and the model results reflect that removing these samples increases overall model performance. This experiment provides a reference framework for future risk models, since most datasets will have a plethora of healthy patients, but only a few key patients that are at high risk for developing a disease.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.