Partitioned point-to-point communication primitives provide a performance-oriented mechanism to support a hybrid parallel programming model and have been included in the upcoming MPI-4.0 standard. These primitives enable an MPI library to transfer parts of the data buffer while the application provides partial contributions using multiple threads or tasks or simply pipelines the buffers sequentially. The focus of this paper is the design and implementation of a layered library that provides the functionality of these newer APIs and supports application development using these newer APIs. This library provides an opportunity to explore potential optimizations and identify further enhancements to the APIs. Initial experience in designing this library along with preliminary performance results are presented. In addition, the library is compared to initial prototype libraries that have recently become available that have been updated to the standard-compliant interface. These prototype libraries were built on remote-memory-access (RMA) primitives, offering insight into different implementation strategies. In general, we observe an interesting trade-off space, with the RMA-based implementation proving more performant for send-side partitioning, with increases in perceived bandwidth 8.9x on average over a single send, compared to the persistent-based implementation, which shows improvements 4.0x on average. In comparing the two implementations, we find that the persistent-based implementation enables more overlap for receive-side partitioning up to 5.37X the RMA library’s overlap, while the RMA-based implementation provides better send-side performance of up to 70%.