In recent years, gait recognition has emerged as an important and promising solution for human identification. Generally, gait recognition is based on a single type of sensor, such as a camera or a radar. However, data of a single modality may only capture inadequate gait features of a person, such as camera data lacking the intuitive micro-motion pattern information and radar data lacking the information about gait appearance, making gait-based human identification system vulnerable to complex covariate conditions, e.g. cross-view and cross-walking-condition. To build a robust and reliable gait-based human identification system, in this study, we propose a multi-sensor gait recognition framework with deep convolutional neural networks (CNNs) by fusing camera gait energy images (GEIs) and radar time-Doppler spectrograms. To learn the fine-grained gait appearance features, we propose a body-part spatial attention (BPSA) module to obtain more discriminative body part representations of GEIs. To learn the gait micro-motion pattern, we propose a long-short temporal relation modeling (LSTRM) module to obtain the local and global micro-motion representation of time-Doppler spectrograms. Finally, we fuse the discriminative body part representation and the micro-motion pattern at the multi-scale feature space to obtain richer and more robust gait features for human identification. We provide an extensive empirical evaluation in terms of various complex covariate conditions, namely, cross-view and cross-walking-condition. Experiments on 121 subjects with eight views and three walking conditions of camera and radar data show our proposed method is more robust and accurate.