Effects of annotation quality on model performance

Khaled Alhazmi,Martin Simon,Lara Podkuiko,Indrek Seppo,Walaa Alsumari

doi:10.1109/icaiic51459.2021.9415271

Abstract

Supervised machine learning generally requires pre-labelled data. Although there are several open access and pre-annotated datasets available for training machine learning algorithms, most contain a limited number of object classes, which may not be suitable for specific tasks. As previously available pre-annotated data is not usually sufficient for custom models, most of the real world applications require collecting and preparing training data. There is an obvious trade-off between annotation quality and quantity. Time and resources can be allocated for ensuring superior data quality or for increasing the quantity of the annotated data. We test the degree of the detrimental effect caused by the annotation errors. We conclude that while the results deteriorate if annotations are erroneous; the effect - at least while using relatively homogeneous sequential video data - is limited. The benefits from the increased annotated data set size (created by using imperfect auto-annotation methods) outweighs the deterioration caused by annotated data.

Full Text