Multipliers are essential computation units in virtually all computing systems, including processors and numerous AI accelerator architectures. This paper presents an optimized architecture for a Booth multiplier, targeting high performance while minimizing energy consumption and area utilization. The design optimization focuses on all three multiplier stages: partial product generation, reduction, and summation. To enhance delay and energy efficiency in the partial product generation stage, we first employed a simplified configuration comprising inverters and a sign selection unit instead of complex binary-to-two's complement circuitry. Next, to achieve further delay and area efficiency at this stage, logic optimization is applied at the partial product's generation circuitry by designing Booth encoders to remove redundant logic in multiplexers circuitry. Moreover, we introduced specialized sign compressors tailored for carry-save compression in the compression stage. Compared to conventional counterparts, these compressors offered lower power consumption and reduced critical path delay with only two XOR logic gates. Finally, in the summation stage, we proposed an optimized design segment for Carry Look-Ahead Adder for the final summation stage, designed to deliver swift throughput with minimal fan-in logic gates, even in the context of high bit-width configurations. This segment is cascaded to make a 13-bit final adder for the summation stage in the proposed design. The proposed architecture undergoes ASIC-targeted synthesis in Cadence Genus employing FreePDK CMOS 45 nm process technology. Synthesized results, along with theoretical design complexity comparison, demonstrate that the proposed design surpasses state-of-the-art 8 × 8 multiplier designs by critical metrics, including delay, power consumption, area utilization, power delay product, and area delay product.