Molecular docking is the process of posing, scoring, and ranking small molecules at the binding sites of proteins to prioritize compounds for experimental testing. It is a widely-used computational method in the drug discovery process. However, it is a highly time-consuming procedure since a receptor may need to find favorable ligand orientations in billions of ligands. UCSF DOCK3.7 is one of the most widely used molecular docking applications. In this paper, we port and optimize UCSF DOCK3.7 on the Sunway TaihuLight supercomputer. To avoid the impact of load imbalance, we employ a producer-consumer strategy that can overlap I/O and computation in order to achieve high performance. Furthermore, we present a new binary file format to replace the mol2db2 file format for ligand storage and adopt xzip rather than gzip to compress ligand files. We show that our file format can reduce I/O time significantly while xzip saves significant storage. For the routines which determine the orientation of a ligand relative to the receptor, we present an improved algorithm to discard geometrically similar orientations. Furthermore, we fuse loops and compress memory usage to store data in fast Local Device Memory (LDM) in order to score ligand orientations with high efficiency. In addition, we propose a number of architecture-specific optimizations. Asynchronous data transfer and vectorization of computation are implemented to take full advantage of the SW26010 processor. Our experiments show that a speedup of 167 can be achieved by using the proposed strategies. Compared to a core of an Intel(R) Core(TM) i9-10900K CPU, our approach achieves speedups of 15 on a SW26010 core group. Furthermore, our implementation achieves strong scalability to hundreds of thousands of heterogeneous cores on the next-generation Sunway supercomputer.