This paper presents a 3-D integrated circuit (3-D IC) for heterogeneous domain-specific streaming architectures. In such architectures, an array of fine-grained accelerators is provided for executing kernels, and applications are mapped via configuration of the accelerators into a desired computation pipeline. The two-layer 3-D IC addresses architectures for different application domains, through a generic routing-and-memory (RM) layer and a separate compute-accelerator (CA) layer, which could ultimately be selected at assembly time for different application domains. The RM layer provides a configurable routing network, as well as memory for pipeline buffering and computation scratch pad. The routing network is based on a 2-D mesh with low-swing signaling. The memory is organized as 32 fine-grained (1-kB) SRAM tiles for increased interface parallelism, reduced access energy, and modularity, to interface with different accelerators in the CA layer. Memory-driver and sensing circuits are reused by the low-swing routing network, both for repeaters and to directly load pipeline data into accelerator input buffers. For the prototype, the CA layer is implemented as an array of multiplexers, providing off-chip interfacing to any memory title, thereby enabling different accelerators to be emulated by an off-chip field-programmable gate array (FPGA). The 3-D interconnection is achieved by 8- $\mu \text{m}$ -pitch face-to-face (F2F) vias and wafer-level assembly. For the $2.47\times 3.38$ mm2 two-layer die, implemented in 130-nm CMOS, the total peak memory bandwidth is 9.2 GB/s/mm2. A compute pipeline for computational photography is demonstrated, with the total energy of the accelerators reduced by over 2 $\times $ , by exploiting parallelism enabled by interfaces to fine-grained RM-layer memory.