Hybrid floating-point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations with USL support increase software FP throughput per core by $2.18\times $ for addition/subtraction, $1.29\times $ for multiplication, 3.07– $4.05\times $ for division, and 3.11– $3.81\times $ for square root, and use 90.7–94.6% less area than dedicated fused multiply-add (FMA) hardware. Hybrid implementations with custom FP-specific hardware increase throughput per core over a fixed-point software kernel by 3.69– $7.28\times $ for addition/subtraction, 1.22– $2.03\times $ for multiplication, $14.4\times $ for division, and $31.9\times $ for square root, and use 77.3–97.0% less area than dedicated FMA hardware. The circuit area and throughput are found for 38 multiply-add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. Thirty-three multiply-add implementations are presented, which improve throughput per core versus a fixed-point software implementation by 1.11– $15.9\times $ and use 38.2–95.3% less area than dedicated FMA hardware.