Lung disease classification using machine learning algorithms

Murat Aykanat,Bahar Kurt,Özkan Kiliç,Sevgi Behiye Saryal

doi:10.18100/ijamec.799363

Abstract

In this study we compared support vector machines (SVM), k-nearest neighbor (k-NN), and Gaussian Bayes (GB) algorithms in classification of respiratory diseases with text and audio data. An electronic stethoscope and its software are used to record patient information and 17930 lung sounds from 1630 subjects. SVM, k-NN and GB algorithms were run on 6 datasets to classify patients into; (1) sick or healthy with text data, (2) sick or healthy with audio MFCC features, (3) sick or healthy with the text data and audio MFCC features, (4) 12 diseases with text data, (5) for 12 disease with audio MFCC features, (6) for 12 disease with the text data and audio MFCC features. Accuracy results in SVM were %75, %88, %64, %73, %63, %70; for k-NN %95, %92, %92, %67, %64, %66; for GB %98, %91, %97, %58, %48, %58 respectively. In 12 class classification of lung diseases, the most accurate algorithm was SVM with text data. In classifying via audio data, k-NN was the most accurate. Using both audio and text data, SVM was the most accurate. When we classify healthy versus sick via text, audio and combined data, GB was always the most accurate with very high accuracy, closely followed by k-NN. We can infer from here that when we have large number of features but limited amount of samples, SVM and k-NN are best in classifying the dataset in more than two classes. However GB is best when it comes to classifying into two classes.

Full Text