Massively parallel computers are becoming quite common in computational fluid dynamics. In this study, a parallel algorithm of a three-dimensional primitive-equation coastal ocean circulation model is designed on the hypercube MIMD computer architecture. The grid is partitioned using one-dimensional domain decomposition. The generation of sub-grids is made using an I/O parallel code. The sub-grids are mapped on successive P nodes in hypercube topology. The resulting parallel code is scalable: allocating more nodes does not affect its architecture. To test the code in a uniform rectangular grid problem, the model domain in each node is a cube. The one-dimensional partition is in the y-direction, and the area of communication is NX $\*$ NZ, where NX is the number of grid points in the x-direction, and NZ is the number of grid points in the z-direction. For the problem where the grain size ($n\sb{y}$) is fixed, the speed up is linear and is close to ideal for $P\ge 8$ processors. The speed up decreases as the number of processors increases, from 3.9 in 4 processors to 29.3 in 32 processors, so the efficiency decreases from 0.97 in 4 processors to 0.91 in 32 processors. Also, the speed up and efficiency are not affected by the increase by the number of bytes send for $P\ge 8.$ The overhead ($F\sb{C}$) increases as the number of processors increases. The background overhead is inversely proportional to the size of the grain. The slope $F\sb{C}$ vs. P is a measure of the fraction of non-parallel code. The calculation time is 0.70 and the communication time is 0.37. For the problem where the domain is fixed, the grain size is inversely proportional to the number of nodes. The speed up decreases as the number of processors increases, from 7.7 in 8 processors to 29.5 in 32 processors, and the efficiency decreases, from 0.97 in 8 processors to 0.92 in 32 processors. The overhead increases linearly with P. The slope $F\sb{C}$ is a measure of the communication cost. The calculation time is 0.58 and the communication time is 0.27. The load balancing problem is addressed by examining the problem where the domain is irregular. In an irregular domain, parallel efficiency is not much affected by the partition. The ratio between 16-nodes run and 8-nodes run reveals an efficiency of 0.9. In this case, the load balancing factor on the overall efficiency is small. This analysis can be used to predict the performance of the parallel scheme due to changes on hardware technology of parallel computers.