Computing free energy differences between metastable states characterized by nonoverlapping Boltzmann distributions is often a computationally intensive endeavor, usually requiring chains of intermediate states to connect them. Targeted free energy perturbation (TFEP) can significantly lower the computational cost of FEP calculations by choosing a set of invertible maps used to directly connect the distributions of interest, achieving the necessary statistically significant overlaps without sampling any intermediate states. Probabilistic generative models (PGMs) based on normalizing flow architectures can make it much easier via machine learning to train invertible maps needed for TFEP. However, the accuracy and applicability of approaches based on empirically learned maps depend crucially on the choice of reweighting method adopted to estimate the free energy differences. In this work, we assess the accuracy, rate of convergence, and data efficiency of different free energy estimators, including exponential averaging, Bennett acceptance ratio (BAR), and multistate Bennett acceptance ratio (MBAR), in reweighting PGMs trained by maximum likelihood on limited amounts of molecular dynamics data sampled only from end-states of interest. We carry out the comparisons on a set of simple but representative case studies, including conformational ensembles of alanine dipeptide and ibuprofen. Our results indicate that BAR and MBAR are both data efficient and robust, even in the presence of significant model overfitting in the generation of invertible maps. This analysis can serve as a stepping stone for the deployment of efficient and quantitatively accurate ML-based free energy calculation methods in complex systems.