In forensic comparison sciences, experts are required to compare samples of known and unknown origin to evaluate the strength of the evidence assuming they came from the same- and different-sources. The application of valid (if the method measures what it is intended to) and reliable (if that method produces consistent results) forensic methods is required across many jurisdictions, such as the England & Wales Criminal Practice Directions 19A and UK Crown Prosecution Service and highlighted in the 2009 National Academy of Sciences report and by the President’s Council of Advisors on Science and Technology in 2016. The current study uses simulation to examine the effect of number of speakers and sampling variability and on the evaluation of validity and reliability using different generations of automatic speaker recognition (ASR) systems in forensic voice comparison (FVC). The results show that the state-of-the-art system had better overall validity compared with less advanced systems. However, better validity does not necessarily lead to high reliability, and very often the opposite is true. Better system validity and higher discriminability have the potential of leading to a higher degree of uncertainty and inconsistency in the output (i.e. poorer reliability). This is particularly the case when dealing with small number of speakers, where the observed data does not adequately support density estimation, resulting in extrapolation, as is commonly expected in FVC casework.