Geographical object classification and information extraction is an important topic for the construction of 3D virtual reality and digital twin cities in urban areas. However, the majority of current multi-target classification of urban scenes uses only a single source data (optical remote sensing images or airborne laser scanning (ALS) point clouds), which is limited by the restricted information of the data source itself. In order to make full use of the information carried by multiple data sources, we often need to set more parameters as well as algorithmic steps. To address the above issues, we compared and analyzed the object classification methods based on data fusion of airborne LiDAR point clouds and optical remote sensing images, systematically. Firstly, the features were extracted and determined from airborne LiDAR point clouds and high-resolution optical images. Then, some key feature sets were selected and were composed of median absolute deviation of elevation, normalized elevation values, texture features, normal vectors, etc. The feature sets were fed into various classifiers, such as random forest (RF), decision tree (DT), and support vector machines (SVM). Thirdly, the suitable feature sets with appropriate dimensionality were composed, and the point clouds were classified into four categories, such as trees (Tr), houses and buildings (Ho), low-growing vegetation (Gr), and impervious surfaces (Is). Finally, the single data source and multiple data sources, the crucial feature sets and their roles, and the resultant accuracy of different classifier models were compared and analyzed. Under the conditions of different experimental regions, sampling proportion parameters and machine learning models, the results showed that: (1) the overall classification accuracy obtained by the feature-level data fusion method was 76.2% compared with the results of only a single data source, which could improve the overall classification accuracy by more than 2%; (2) the accuracy of the four classes in the urban scenes can reach 88.5% (Is), 76.7% (Gr), 87.2% (Tr), and 88.3% (Ho), respectively, while the overall classification accuracy can reach 87.6% with optimal sampling parameters and random forest classifiers; (3) the RF classifier outperforms DT and SVM for the same sample conditions. In this paper, the method based on ALS point clouds and image data fusion can accurately classify multiple targets in urban scenes, which can provide technical support for 3D scene reconstruction and digital twin cities in complex geospatial environments.