Abstract

Big data validation and system verification are crucial for ensuring the quality of big data applications. However, a rigorous technique for such tasks is yet to emerge. During the past decade, we have developed a big data system called CMA for investigating the classification of biological cells based on cell morphology which is captured in diffraction images. CMA includes a collection of scientific software tools, machine learning algorithms, and a large-scale cell image repository. In order to ensure the quality of big data system CMA, we developed a framework for rigorously validating the massive scale image data as well as adequately verifying both the software tools and machine learning algorithms. The validation of big data is conducted by iteratively selecting the data using a machine learning approach. An experimental approach guided by a feature selection algorithm is introduced in the framework to select an optimal feature set for improving the machine learning performance. The verification of software and algorithms is developed on the iterative metamorphic testing approach due to the non-testable property of the software and algorithms. A machine learning approach is introduced for developing test oracles iteratively to ensure the adequacy of the test coverage criteria. Performance of the machine learning algorithm is evaluated with a stratified N-fold cross validation and confusion matrix. We describe the design of the proposed big data verification and validation framework with CMA as the case study, and demonstrate its effectiveness through verifying and validating the dataset, the software and the algorithms in CMA.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call