In many computational tasks, especially in signal processing, it is the throughput that is important, rather than the latency, or delay. If a special-purpose VLSI chip is designed for a particular signal processing task, such as FIR filtering, for example, the maximum clock rate, and hence throughput, is determined by the depth of the combinational logic between registers and the time required for the distribution and operation of the clock. If the combinational logic is sufficiently deep (in bit-parallel circuits, for example), the throughput can be increased by inserting intermediate stages of clocked latches. This is at the expense of increased area and delay to operate and clock the intermediate registers. Roughly speaking, the strategy amounts to using more of the chip area to store information useful for pipelining. This paper investigates the optimal tradeoff between the degree of intermediate latching and cost, using the measure AP, where A is the chip area and P is the period (the reciprocal of throughput). We derive expressions for the time and area before and after intermediate latching, using the Mead-Conway model, both for the cases of on-chip and off-chip clock drivers. The results show that significant reductions in AP product (reciprocal of throughput per unit area) can be achieved by intermediate latching in many typical signal processing applications, for a wide range of circuit parameters. The array multiplier is used as an example.