Abstract

Liver Cancer is a threat to human health and life over the world. The key to reduce liver cancer incidence is to identify high-risk populations and carry out individualized interventions before cancer occurrence. Building predictive models based on machine learning algorithms is an effective and economical way to forecast potential liver cancers. However, since the dataset is usually extremely skewed (negative samples are much more than positive samples), machine learning models suffer from severe bias and make unreliable predictions. In this paper, we systematically evaluate existing approaches in tackling class-imbalance problem and introduce two undersampling methods. The first is based on K-means++, where robust clustering centers are appointed as negative samples. The second is based on learning vector quantization, which considers diagnostic labels during clustering, and the prototypes are used as negative data. In this way, positive and negative samples are rebalanced. The algorithm is applied to five-year liver cancer prediction in Early Diagnosis and Treatment of Urban Cancer project in China. We achieve an AUC of 0.76 when no clinical measure except for epidemiological information is used. Experimental results show the advantage of our method over existing oversampling, undersampling, ensemble algorithms, and state-of-the-art outlier detection algorithms. This work explores a feasible and practical roadmap to tackle skewed medical data in cancer prediction and benefits applications targeted to human health and well-being.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call