Round-robin exercises have traditionally been laborious to arrange in non-destructive testing (NDT). The exercises have involved manufacturing of costly big mock-ups and then distributing them around the world to facilitate testing by numerous laboratories. This has limited both the number of such round robins and their scope. Often the round robins have contained small number of flaws and the representativeness of these flaws has been limited. Nevertheless, the few round robins that have been completed have yielded significant additional understanding on the capability of the used NDT methods and procedures.Recently, the increased use of automated inspections together with the development of virtual flaws (independently by Trueflaw and EPRI) has enabled a new type of round robin, where instead of moving samples around the world, the round robin is focused on the data analysis and only pre-acquired data files are distributed. In 2019–2020, first of a kind virtual round robin (VRR) was completed. The round-robin allowed for the first time to compare inspection performance from teams around the world with statistically significant number of flaws and with ultrasonic data representative for nuclear dissimilar metal weld inspection. The study resulted in important new insight into NDE reliability for nuclear applications.However, as a first of a kind study, the first virtual round robin also contained some significant limitations. In particular, the data sets distributed were limited in order to limit the effort needed from each participating inspector. The reduced amount of data acquired was compensated by using pre-optimized data gathering, possible only with prior knowledge on the flaws present. While these choices were well justified for the first round-robin, they also made direct comparison of VRR results and real-life inspector performance problematic. In addition, the first VRR focused primarily on flaw detection and the data was insufficient for sizing.To address these shortcomings of the first round robin, a second round robin was completed in 2021–2022. In this second round robin, more representative data was used for evaluation. In addition, increased emphasis on the hard-to-detect small flaws was put forward to get improved into detectability especially in the low end.The more representative data required much more significant effort from the inspectors, which reduced the participation as compared to the first round robin. Furthermore, the emphasis on difficult-to-detect cracks may have further deterred participation, as the exercise may have been seen as too challenging. While the number of downloaded data sets (23) was similar to previous exercise, the number of returned sets was reduced to 5, compared to previous 18. Despite the smaller than expected participation, the results revealed several interesting features. The results displayed marked variation. Also, the false call rate was significantly reduced, as compared to the previous study. This could be attributed to the more rich data set, which allowed more comprehensive evaluation and exclusion of potential false calls.The recent advances in machine learning (ML) for ultrasonics also introduced an interesting opportunity to compare machine learning results with the human inspectors. Developing an optimized machine learning model for the present data was outside the scope of this study. Instead, an independently developed model, if somewhat sub-optimal, was used. Thus, the results should not be taken as a measure of ML performance as such. Nevertheless, the comparison between human results and ML model are informative and illustrate the potential benefits of automated data evaluation.