Heterogeneous systems with multiple different compute devices have come into common use recently, and the heterogeneity of the compute device is mainly reflected in three aspects: hardware architecture, instruction set architecture, and processing capability. Heterogeneous CPU-accelerator systems have attracted increasing attention especially. To make full use of multiple CPUs and accelerators to execute data-parallel applications, programmers may need to manually map computation and data to all available compute devices, which is tedious, error-prone, and difficult. Especially for some data-parallel applications, the inter-device communication could easily become the performance bottleneck of multi-device co-execution. Therefore, firstly, a runtime system is designed for supporting heterogeneous cooperative execution (HCE) of data-parallel applications, which can help programmers to automatically and efficiently map computation and data to multiple compute devices. Secondly, an incremental data transfer method is designed to avoid redundant data transfers between devices, and a three-way overlapping communication optimization method based on software pipelining is designed to effectively hide the inter-device communication overhead. Based on our previously proposed feedback-based dynamic and elastic task scheduling (FDETS) scheme and asynchronous-based dynamic and elastic task scheduling (ADETS) scheme, the modified FDETS that supports incremental data transfer and the modified ADETS that supports three-way overlapping communication optimization are proposed, which not only can effectively partition and balance the workload among multiple compute devices but also can significantly reduce data transfer overhead between devices. Thirdly, a prototype of the proposed runtime system is implemented, which provides a set of runtime APIs for task scheduling, device management, memory management, and transfer optimization. Our experimental results show that the communication overhead between devices is greatly reduced using the proposed inter-device communication optimization methods and the multi-device co-execution significantly outperforms the best single-device execution.