Estimating Classification Error to Identify Biomarkers in Time Series Expression Data

John H Phan,May D Wang

doi:10.1109/bibe.2007.4375561

Abstract

One of the primary objectives in the study of human diseases is the development of accurate and early diagnostic tests using molecular profiling technology. These investigations usually focus on feature selection with the goal of building a classifier using only the most clinically relevant features. With time series molecular profiles, each patient's assay contains observations measured at several time points. Using traditional time series classification methods, we can only determine a patient's diagnosis after obtaining all time points, eliminating the possibility of early diagnosis. This problem can be alleviated by dividing the time series into smaller overlapping sub-series. Unfortunately, these sub-series are not independent and identically distributed (iid). Consequently, when we estimate classification error for feature selection using traditional methods, we may encounter estimation bias. In response, we have developed a novel method that ranks time series biomarkers using specialized blocked error estimation methods designed to reduce estimation bias. Our investigation applies special cross validation and bootstrap methods, including h-block, hv-block cross validation, and blocked bootstrap to synthetic and clinical time series data. Results indicate a clear decrease in estimation bias using these methods on synthetic time series data. Similar results for a drug treatment dataset show further evidence that these blocked algorithms can improve biomarker identification.

Full Text