Let's consider the growing popularity of phenotypic screens within the drug development community. Biochemical assays once dominated the field of high throughput chemical screening, but phenotypic experiments are more adept at modeling real physiological behavior, and can simultaneously mimic drug delivery factors including cell membrane permeability, intracellular localization, and aspects of transport and metabolism. Efficiency and cost of phenotypic screens have improved substantially over the years. For all of these reasons, the technology is well suited to fostering lead optimization. The technology has disadvantages, however, in that data complexity hinders de novo SAR rationalization. For preliminary screens over diverse chemotypes, the simpler biochemical construct affords a good platform for classifying hits according to target-specific modulation, thus facilitating systematic pharmacophore perception. With phenotypic screens, activity trends across different chemotypes may actually reflect modulation of different biochemical targets or via distinct interaction modes. Even within the same chemotype family, pharmacophore perception may be occluded by imprecise partitioning of observed bioactivity measurements between fundamental biochemical modulation versus variation across ADME-like factors. So the question arises: can phenotypic screens ultimately support systematic SAR-driven pharmacophore perception and thus fully supplant biochemical screens? The answer may be computational in nature: given a large, accurate phenotypic data set, one should theoretically be able to sort through the various influences to distill a reliable chemotype-specific SAR that distinguishes targetspecific trends from deliverability issues. This can be achieved via data mining. Unfortunately, many people who realize that data mining is designed for such challenges may be missing background insight that is key to exploiting such methods. Most often overlooked is the fact that excellent calculations will rarely rescue weak data. Grasping pharmacophore effects from a screening study requires data sets that are strategically sensitive to variations in molecular effects that dictate physiology. In order for a given chemical to exert specific biochemical activity, it must have the right solubility profile to be available to those biomolecules that must collectively admit, transport and bind the modulator in order to effect applicable bioactivity. Ligand solubility is determined by chemical substructures. Furthermore, every relevant intermolecular interaction is directly influenced by ligand chemical composition. Thus, if one knows which chemical substructures balance appropriate solubility with the interactions required to reach and bind the biomolecular target, one should be able to predict ligand activity. This is the essence of rational drug design and is, furthermore, the type of insight that careful mining of a well crafted data set should yield. A thorough discourse on 'careful' data mining, would take up much more than the space available to an editorial like this, but of more immediate interest to non-informaticians who design chemical screens would be a quick synopsis of what well crafted data sets might look like. To a significant extent, the data set will depend on whether one is trying to refine one's knowledge of the target specific SAR in a system, or if one is searching de novo for promising new chemotypes with activity toward a given phenotype. Ultimately, the strategy lies in giving the informatician the right spread of data to partition the global bioactivity measurement into contributing factors. In the simpler case where a chemotype family of interest has already been identified, the set of compounds screened should focus on chemicals with the desired core scaffold (or close variants thereof) so that subsequent data analysis will be emphasize SAR within that family and avoid informational contamination from other mechanisms of action. However, it can help to retain a small population of compounds derived from chemical families that likely act via distinct mechanisms; this enables computational differentiation not only between chemical properties that influence whether a given compound is active, but also perception of attributes that push modulators toward one mechanism over another. Secondly, although it may seem counter-intuitive to populate your screen with compounds known to have marginal solubility, membrane permeability or transport efficacy, a data set that includes them will also produce analysis that can distinguish molecular properties that favor target-specific interactions, versus those that amplify bioactivity simply by ensuring greater compound availability to the target. For speculative preliminary studies, systematic screening set selection reflects other considerations. Unlike targeted screening, the useful information from general preliminary screens is enhanced by embracing broad chemical functionality to reduce the number of potentially relevant chemotypes that are overlooked. Preliminary screens should also rigorously eschew compounds with poor ADME properties, since at this early stage there is little benefit in degrading chemotype-specific assessments with inactivity arising from factors other than target-compatibility. Resulting SAR analysis may blend factors that reflect target-specific effects with deliverability, but at least will lay a solid foundation for subsequent targeted lead discovery and refinement studies.