Utilizing machine learning technologies to monitor assets’ health conditions can improve the effectiveness of maintenance activities. However, accurately recognizing the current health degradation stages of industrial assets requires a time-consuming manual feature extraction due to the wide range of observable measures (e.g., temperature, vibration) and behaviors characterizing assets’ degradation. To address this issue, feature learning technology can transform minimally processed time series into informative features, i.e., able to simplify the classification task (e.g., recognizing degradation stages) regardless of the specific machine learning classifier employed. In this work, minimally preprocessed time series of vibration and temperature of industrial bearings are exploited by an autoencoder-based architecture to extract degradation-representative features to be used for recognizing their degradation stages. Different autoencoder architectures are employed to compare their data fusion strategies. The effectiveness of the proposed approach is evaluated in terms of recognition performance and the quality of the learned features by using a publicly available real-world dataset and comparing the proposed approach against a state-of-the-art feature learning technology. We tested three different multimodal autoencoder-based feature learning approaches, i.e., shared-input autoencoder (SAE), multimodal autoencoder (MMAE), and partition-based autoencoder (PAE). All the AE-based architecture results in classification performances greater or comparable with the state-of-the-art feature learning technology, despite being trained in an unsupervised fashion. Also, the features provided via PAE correspond to the greatest performances in recognizing bearings’ degradation stage, providing high-quality features both from a classification and clustering perspective. Unsupervised feature learning methodologies based on multimodal autoencoders are capable of learning high-quality features. These result in greater degradation stages recognition performances when compared to supervised state-of-the-art feature learning technology. Also, this enables the correct representation of the expected progressive degradation of the bearing.