Abstract

Accuracy assessments are important steps of classifications and get higher relevance with the soar of machine and deep learning techniques. We provided a method for quick model evaluations with several options: calculate the class level accuracy metrics for as many models and classes as needed; calculate model stability using random subsets of the testing data. The outputs are single calculations, summaries of the repetitions, and/or all accuracy results per repetitions. Using the application, we demonstrated the possibilities of the function and analyzed the accuracies of three experiments. We found that some popular metrics, the binary Overall Accuracy, Sensitivity, Precision, and Specificity, as well as ROC curve, can provide false results when the true negative cases dominate. F1-score, Intersection over Union and the Matthews correlation coefficient were reliable in all experiments. Medians and interquartile ranges (IQR) of the repeated sampling from the testing dataset showed that IQR were small when a model was almost perfect or completely unacceptable; thus, IQR reflected the model stability, reproducibility. We found that there were no general, statistically justified relationship with the median and IQR, furthermore, correlations of accuracy metrics varied by experiments, too. Accordingly, a multi-metric evaluation is suggested instead of a single metric.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call