Ultrasound open scanners have recently boosted the development and validation of novel imaging techniques. They are usually split into hardware- or software-oriented systems, depending on whether they process the echo data using embedded FPGAs/DSPs or a GPU on a host PC. The goal of this work was to realize a high-performance heterogeneous open scanner capable of leveraging the strengths of both hardware and software-oriented systems. The elaboration power of the 256-channel ultrasound advanced open platform (ULA-OP 256) was further enhanced by embedding a compact co-processing GPU system-on-module (SoM). By carefully avoiding latencies and overheads through low-level optimization work, an efficient PCIe communication interface was established between the GPU and the processing devices onboard the ULA-OP 256. As a proof of concept of the enhanced system, the high frame rate color flow mapping technique was implemented on the GPU SoM and tested. Compared to a previous DSP-based implementation, higher real-time frame rates were achieved together with unprecedented flexibility in setting crucial parameters such as the ensemble length (EL). For example, by setting EL=64 and a continuous-time high-pass filter, the flow was investigated with high temporal and spatial resolution in the femoral vein bifurcation (frame rate = 1.1 kHz) and carotid artery bulb (4.3 kHz), highlighting the flow disturbances due to valve aperture and secondary velocity components, respectively. The results of this work promote the development of other computational-expensive processing algorithms in real-time and may inspire the next generation of ultrasound high-performance heterogeneous scanners.