Neural network accelerators (e.g., TPUs) have become mainstream devices in computing systems. Unfortunately, the existing accelerator-based systems for neural networks fail to fully leverage the acceleration opportunities due to the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">limited flexibility</i> . Specifically, the majority of the accelerators focus on only the compute-intensive operations of neural networks (e.g., convolution and fully-connected layers). However, we identify that sub-optimal handling of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">auxiliary operations</i> such as embedding and compression can incur non-trivial loss in terms of accuracy, training speed, and adaptability to new domains. The problem persists considering that recent advancements in neural networks often come from auxiliary operations. To effectively handle rapidly evolving auxiliary operations and maximize acceleration opportunities, we propose DLS, a holistic neural network acceleration system using heterogeneous computing devices. The key idea is to distribute compute-intensive operations on highly specialized ASICs for maximum performance, and auxiliary operations on flexible devices (e.g., FPGA, GPU) for better adaptability. We emphasize that a naïve integration of different devices fails to deliver high performance due to high communication overheads. To address this communication inefficiency, we propose an efficient FPGA-based device orchestration utilizing direct device-to-device communication and fine-grained operation scheduling. In this way, our system alleviates the communication overhead between heterogeneous devices by removing expensive kernel stack traversal and leveraging computation units and communication links in parallel. The evaluation using popular neural networks with emerging auxiliary operations shows that our system achieves both flexibility and high performance for various cases from single-accelerator training to distributed training (2.6–8.9× speedup).