Inferring nonlinear and asymmetric causal relationships between multivariate longitudinal data is a challenging task with wide-ranging application areas including clinical medicine, mathematical biology, economics, and environmental research. A number of methods for inferring causal relationships within complex dynamic and stochastic systems have been proposed, but there is not a unified consistent definition of causality in the context of time series data. We evaluate the performance of ten prominent causality indices for bivariate time series across four simulated model systems that have different coupling schemes and characteristics. Pairwise correlations between different methods, averaged across all simulations, show that there is generally strong agreement between methods, with minimum, median, and maximum Pearson correlations between any pair (excluding two similarity indices) of 0.298, 0.719, and 0.955, respectively. In further experiments, we show that these methods are not always invariant to real-world relevant transformations (data availability, standardization and scaling, rounding errors, missing data, and noisy data). We recommend transfer entropy and nonlinear Granger causality as particularly strong approaches for estimating bivariate causal relationships in real-world applications. Both successfully identify causal relationships and a lack thereof across multiple simulations, while remaining robust to rounding errors, at least 20% missing data and small variance Gaussian noise. Finally, we provide flexible open-access Python code for computation of these methods and for the model simulations.