Abstract

Machine learning algorithms are used to discover complex nonlinear relationships in biomedical data. However, sophisticated learning models becomes computationally unfeasible when dimension of the data increases. One of the solution to overcome this problem is to use feature selection methods. Feature selection methods finds the optimal feature subset and the subset performance is evaluated using some evaluation criteria, these methods are categorized as Filter, Wrapper, Embedded and Hybrid approaches. Even though these methods reduces the dimension of the data, the execution time of training increases as the dataset size increases. And also nowadays the preferred place for storage of data is cloud. Thus, the first step before applying machine learning algorithms is to copy the data to our local machine. This might take lot of time, if the size of data is huge. So to overcome such problems, here we propose a pipeline that runs on the AWS cloud based distributed architecture capable of doing feature selection, training and classifying. Here, we define an evaluation criteria that measures the performance of feature subsets based on the classification accuracy and size of the feature subset. The experiments were carried out on two chest X-ray datasets (Shenzhen and NIH) clinically tested as normal or abnormal. We achieved the classification accuracy of 84.24% for Shenzhen dataset and 79.55% for NIH dataset for classifying the chest X-ray image as normal or abnormal reducing the feature subset size to more than 50% with hybrid approach of feature selection and using defined evaluation criteria.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call