We consider three (strong, moderate and mild) predictive biomarker scenarios with varying prevalence. As such, there is no treatment effect in the biomarker negative (g −) patient subpopulation. Relative to g −, there is a four-fold profound treatment effect in the biomarker positive (g +) patient subpopulation, a strongly predictive scenario; a three-fold large g + subpopulation treatment effect, a moderately predictive scenario; and a two-fold modest g + subpopulation treatment effect, a mildly predictive scenario. In this paper, we focus on binary endpoint in prescribing treatment effect. Using a Breiman’s (Mach. Learn. 24:123–140, 1996) machine learning voting algorithm via a k-fold cross-validated approach applied by Freidlin et al. (Clin. Cancer Res. 16:691–698, 2010), a predictive biomarker may be developed. We consider development or discovery of a genomic biomarker using microarray gene expressions data in randomized controlled trials and validate the biomarker’s predictive performance in an independent data set. We investigate the classification performance characteristics of a binary genomic composite biomarker (expected to be predictive of treatment effects) including sensitivity, specificity, accuracy, positive predictive value and negative predictive value as a function of true sensitive prevalence. In doing so, we report the finding based on three representative tuning parameter sets with varying degree of rigor in their choices of the parameters ranging from highly rigorous, moderately rigorous to mildly rigorous. We articulate the rationales on the choices of tuning parameter sets. We also study the impacts of misclassification of genomic biomarker classifiers on their assessment of treatment effects in the positive and negative patient subpopulations, and all-comer patients. We elucidate via simulation studies on approaches to improve sensitivity when a biomarker is highly specific but poorly sensitive, a scenario that is most likely to lead to an incorrect test conclusion of an applicable significant treatment effect in a specific patient subpopulation or both positive and negative subpopulations. We explore when it will be beneficial to develop a binary predictive biomarker and conclude that hypothesis test inferences for the g + subpopulation treatment effect in the dual hypotheses setting (all-comer and g + alone) cannot be relied upon if the biomarker classifier is only highly specific and poorly sensitive or resulting in poor negative predictive value. The converse dual hypotheses (all-comer and g − alone) have the same concern, viz. highly sensitive and poorly specific or resulting in poor positive predictive value. In addition, we compare the predictive performance of a biomarker classifier between use of direct selection and selection from a candidate pool shedding favorable lights of direct selection approach where biological or mechanistic plausibility can be relied upon. Further research is needed if accurate classifier is required irrespective of prevalence level.