Diagnostic methods for cardiovascular disease diagnosis based on heart sound classification have been widely investigated for their noninvasiveness, low-cost, and high efficiency. Most current researches either use manually designed functions or deep learning-based methods to extract features from heart sound signals, but the heart sound signals have highly nonstationary and complex data patterns due to environmental noise and the differences between different stethoscopes. Therefore, using a single feature extraction method does not result in a good feature representation. Moreover, deep learning-based feature extraction methods for heart sound signals usually only use 1D convolution or 2D convolution, which limits the capability of neural networks to extract discriminative features. In addition, many studies do not consider the redundancy of features and the interpretability of decisions, which affects the performance and efficiency of the models. To solve the above problems, this paper first proposes a new convolutional neural network named the 1D + 2D convolutional neural network (1D + 2D-CNN) as a deep learning feature extractor, which combines 1D convolution and 2D convolution. The 1D + 2D-CNN contains two branches, and the feature maps obtained from the two branches are concatenated according to the channel. Then, a 10-layer convolutional network with an attention mechanism is introduced to enhance the feature extraction capability of the network. Second, the advantages and disadvantages when combining deep learning features with manual features in different scenarios are explored. In addition, the mean and variance of each dimensional feature of the dataset are calculated by class and feature selection is achieved by evaluating the importance of each dimension of the feature through a simple statistical formula. Finally, an evolving fuzzy system is used to classify the heart sound signals, as it can provide interpretability for decision-making. For the experimental part, the 2016 PhysioNet/CinC Challenge dataset (PCCD) and our collected publicly available pediatric heart sound dataset (PHSD) are used to evaluate the performance of the model by 10-fold cross-validation. The model in this paper achieves accuracies of 96.3 % and 99.1 % on these two datasets respectively, demonstrating its capability to reach the state-of-the-art level. We also open the code of our algorithms. Diagnostic methods for cardiovascular disease diagnosis based on heart sound classification have been widely investigated for their noninvasiveness, low-cost, and high efficiency. Most current researches either use manually designed functions or deep learning-based methods to extract features from heart sound signals, but the heart sound signals have highly nonstationary and complex data patterns due to environmental noise and the differences between different stethoscopes. Therefore, using a single feature extraction method does not result in a good feature representation. Moreover, deep learning-based feature extraction methods for heart sound signals usually only use 1D convolution or 2D convolution, which limits the capability of neural networks to extract discriminative features. In addition, many studies do not consider the redundancy of features and the interpretability of decisions, which affects the performance and efficiency of the models. To solve the above problems, this paper first proposes a new convolutional neural network named the 1D + 2D convolutional neural network (1D + 2D-CNN) as a deep learning feature extractor, which combines 1D convolution and 2D convolution. The 1D + 2D-CNN contains two branches, and the feature maps obtained from the two branches are concatenated according to the channel. Then, a 10-layer convolutional network with an attention mechanism is introduced to enhance the feature extraction capability of the network. Second, the advantages and disadvantages when combining deep learning features with manual features in different scenarios are explored. In addition, the mean and variance of each dimensional feature of the dataset are calculated by class and feature selection is achieved by evaluating the importance of each dimension of the feature through a simple statistical formula. Finally, an evolving fuzzy system is used to classify the heart sound signals, as it can provide interpretability for decision-making. For the experimental part, the 2016 PhysioNet/CinC Challenge dataset (PCCD) and our collected publicly available pediatric heart sound dataset (PHSD) are used to evaluate the performance of the model by 10-fold cross-validation. The model in this paper achieves accuracies of 96.3 % and 99.1 % on these two datasets respectively, demonstrating its capability to reach the state-of-the-art level. We also open the code of our algorithms.