In the correlation analysis of experimentally recorded parallel spike trains one has to thoroughly consider the statistical features of the data in order to prevent false positive results [1]. Typically, the complexity of the data prevents us from using analytical expressions for evaluating the significance of observed correlations. Similarly, parametric tests presuppose models that are typically simplifications of the real neuronal data and thus may ignore important features. An alternative to these approaches is to use surrogate data, i.e. modified versions of the original data, to assess the significance [2]. The goal of this study is to develop selection criteria for suitable surrogate types. To study the applicability of surrogates we defined data sets exhibiting different statistical features found in typical experimental data (non-stationary firing rates, cross-trial non-stationary rates, deviation from Poisson) in combinations of increasing complexity. To demonstrate the impact of surrogate schemes on correlation analysis, we examine these with different surrogate generation methods commonly used in the literature [1]. Common to all these methods is that they in one way or the other destroy the precise temporal relation of the spiking activities between the neurons, by e.g. shuffling the trial ids (tr-shu) [3], randomizing the spike times (sp-rnd), randomly dithering the whole spike train against the other (tr-di) [4,5], dithering of individual spike times (sp-di) [6,7], dithering spike times under conservation of the joint-ISI distribution (jisi-di) [8], or by exchanging spikes across trials under local preservation of spike counts (sp-exg) [9,10]. To quantify the applicability of the various surrogates for significance estimation of spike correlation we concentrate on spike coincidences (allowed temporal precision: +/-1ms) and use their empirical count nemp as a test statistic. The p-value of nemp is obtained by comparing it to the surrogates' coincidence count distributions. To evaluate the true performance of the surrogates we study the false positive (FP) and false negative (FN) rates for different configurations of parameters implemented in simulated data (rate modulation, regularity, non-stationarity across trials, co-variation of rates). Figure 1 False positive (a.) and false negative (b.) percentages for all tested surrogate methods across five different data types. Colors code FP and FN percentages. White squares mark the position of bars of 100% FP. Based on the FN and FP performances, we find spike train dithering (tr-di) as the most robust detector of excess coincidences amongst the selected surrogates methods. Its detection accuracy is seemingly unaffected by the level of complexity of the data and its sensitivity remains at acceptable levels. Still, tr-di smooths the firing rate profile on the time scale of the dither width, and it is expected to produce false positives is the case of abrupt transients in firing rate. With the aim of dealing with this issue, further work is being done on the development of novel methods taking into account the observed firing rate profile. Doing so enables an approximate mapping of non-stationary processes to stationary ones, through which more accurate surrogates can be generated. This study illustrates the serious need to select appropriate surrogate methods when evaluating the significance of correlations observed in a given data set. Not doing so can lead to false conclusions and misinterpretation of the data. We therefore strongly recommend to test the chosen method on synthetic data which is as similar as possible to the experimental data at hand, but yet does not contain the feature being tested for, before proceeding with the analysis to control for false positive results [11].