Many molecular docking programs are available nowadays, and thus it is of great practical value to evaluate and compare their performance. We have conducted an extensive evaluation of four popular commercial molecular docking programs, including Glide, GOLD, LigandFit, and Surflex. Our test set consists of 195 protein-ligand complexes with high-resolution crystal structures (resolution <or=2.5 A) and reliable binding data [dissociation constant (K(d)) or inhibition constant (K(i))], which are selected from the PDBbind database with an emphasis on diversity. The top-ranked solutions produced by these programs are compared to the native ligand binding poses observed in crystal structures. Glide and GOLD demonstrate better accuracy than the other two on the entire test set. Their results are also less sensitive to the starting structures for docking. Comparison of the results produced by these programs at three different computation levels reveal that their accuracy are not always proportional to CPU cost as one may expect. The binding scores of the top-ranked solutions produced by these programs are in low to moderate correlations with experimentally measured binding data. Further analyses on the outcomes of these programs on three suites of subsets of protein-ligand complexes indicate that these programs are less capable to handle really flexible ligands and relatively flat binding sites, and they have different preferences to hydrophilic/hydrophobic binding sites. Our evaluation can help other researchers to make reasonable choices among available molecular docking programs. It is also valuable for program developers to improve their methods further.