Some pathologies such as cancer and dementia require multiple imaging modalities to fully diagnose and assess the extent of the disease. Magnetic resonance imaging offers this kind of polyvalence, but examinations take time and can require contrast agent injection. The flexible synthesis of these imaging sequences based on the available ones for a given patient could help reduce scan times or circumvent the need for contrast agent injection. In this work, we propose a deep learning architecture that can perform the synthesis of all missing imaging sequences from any subset of available images. The network is trained adversarially, with the generator consisting of parallel 3D U-Net encoders and decoders that optimally combines their multi-resolution representations with a fusion operation learned by an attention network trained conjointly with the generator network. We compare our synthesis performance with 3D networks using other types of fusion and a comparable number of trainable parameters, such as the mean/variance fusion. In all synthesis scenarios except one, the synthesis performance of the network using attention-guided fusion was better than the other fusion schemes. We also inspect the encoded representations and the attention network outputs to gain insights into the synthesis process, and uncover desirable behaviors such as prioritization of specific modalities, flexible construction of the representation when important modalities are missing, and modalities being selected in regions where they carry sequence-specific information. This work suggests that a better construction of the latent representation space in hetero-modal networks can be achieved by using an attention network.