Abstract

Generative models, such as autoencoders, generative adversarial networks, and diffusion models, have become an integral part of innovation in various fields in recent years, including art, design, medicine, and more. Due to their ability to create new data samples, they open broad opportunities for automation and process improvement. However, assessing the quality of generated data remains a challenging task, as traditional methods do not always adequately reflect the diversity and realism of the generated samples. This is particularly true for partial data generation, where changes are applied only to specific parts of an image, significantly complicating the assessment of their quality. This work examines various approaches to evaluating generative models, including automatic metrics such as Inception Score and Fréchet Inception Distance, precision, recall, density, and coverage, as well as a human-in-the-loop method such as HYPE. While these metrics have proven effective in evaluating the results of traditional generation, their use in the case of partially generated data may be inappropriate due to their limitations. To address this issue, the paper proposes a new method for evaluating partially generated data that involves the human factor. This method is based on analysing transformed images by users, who identify the areas that have been altered, and evaluates their quality using precision, recall, and F1-score metrics by seeking intersections between actual altered areas and those selected by users using IoU. The proposed approach provides a more objective assessment of the realism and quality of generated image fragments during transformations. A practical example of applying the developed method is presented using a dataset of panoramic dental images, where the quality of three models was evaluated: 1) a GAN based on a U-generator; 2) the same model with post-processing of the output image and segmentation mask; and 3) a self-validated GAN. The evaluation was performed by 30 individuals. The average F1-scores for these models were 0,78, 0,27, and 0,20, respectively. Since lower F1-scores in this case indicate better results (the more accurately users identified the transformations, the worse the model performed), the best model by this metric is the self-validated GAN, which is also supported by subjective assessments mentioned in the authors’ work.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.