This article presents the design and implementation of active messaging engines (AMEs) on an IBM Power10 prototype chip. AMEs are tiny, simple, but fully programmable 64-bit processors, for offloading operations related to data movement. AMEs can offload the execution flow of the message passing interface and other messaging stacks from the host central processing unit, enabling truly asynchronous progress to overlap computation and communication. The AMEs are implemented as onboard OpenCAPI-compliant accelerators, leveraging existing OpenCAPI infrastructure. As realized in a 7-nm technology, each AME takes 0.034 mm2 of silicon area and 4.1 mW of power. AME performance is evaluated across several contiguous and noncontiguous memory copy scenarios. AMEs can perform up to the bandwidth limit of their access path to the main memory (32 GB/s) and incur a per-request overhead of about 600 ns. These results indicate that AMEs will confer advantages to general messaging libraries for processing, sending, and receiving on-node and off-node messages.