Accurately mapping tree species is crucial for forest management and conservation. Most previous studies relied on features derived from optical imagery, and digital elevation data and the potential of synthetic aperture radar (SAR) imagery and other environmental factors have, generally, been underexplored. Therefore, the aim of this study is to evaluate the potential of fusing freely available multi-modal data for accurately mapping tree species. Sentinel-2, Sentinel-1, and various environmental datasets over a large mountainous forest in Southwest China were obtained and analyzed using Google Earth Engine (GEE). Seven data cases considering the individual or joint performance of different features, and four additional cases considering a novel clustering-based feature selection method, were analyzed. All 11 cases were assessed using three machine learning algorithms, including random forest (RF), support vector machine (SVM), and extreme gradient boosting tree (XGBoost). The best performance, with an overall accuracy of 77.98%, was attained from the case with all features and the random forest classifier. Sentinel-2 data alone exhibited similar performance as environmental data in terms of overall accuracy. Similar species, such as oak and birch, cannot be spectrally discriminated based on Sentinel-2-based features alone. The addition of SAR features improved discrimination, especially when distinguishing between some coniferous and deciduous species, but also decreased accuracy for oak. The analysis based on different data cases and feature importance rankings indicated that environmental features are important. The random forest outperformed other models, and a better prediction was achieved for planted tree species compared to that for the natural forest. These results suggest that accurately mapping tree species over large mountainous areas is feasible with freely accessible multi-modal data, especially when considering environmental factors.