Abstract Funding Acknowledgements Type of funding sources: None. Background Deep learning (DL) has received much attention as a solution for automatically diagnosing atrial fibrillation (AF) from raw ECG signals. However, few studies exist to investigate how DL approaches can be optimally configured and whether their diagnostic performance is externally validated. Purpose To explore how signal-related parameter tuning affects the ability of DL approaches to diagnose AF and validate the optimal approach internally and externally. Methods We applied two dedicated DL models (InceptionTime and MINIROCKET) on a set of 7,966 AF and non-AF (normal or with other abnormalities) ambulatory ECG samples, originating from the MIT-BIH AF, MIT-BIH Normal Sinus Rhythm and Long Term AF databases. We tested the effect of different sample lengths (30sec (s), 10s, 30/10s -30s with a "sliding window" of 10s-), sampling frequencies (200, 100, 50 Hz) and lead numbers (two-, single-), and the role of denoising (Discrete Wavelet Transformation, no denoising) on the ability to diagnose AF, by measuring ROC AUC and sensitivity (SEN) after repeated model training and testing. Under the optimal configuration, we trained 10 replicas of both models on 90% of the data and tested their performance on the remaining 10% (internal validation). Finally, we applied both pre-trained models on a separate dataset (MIT-BIH Arrhythmia) to determine their external validity. Results Although the diagnostic performance did not differ between 30s and 10s signals, the 30/10s setting displayed significantly higher median AUC (0.98) and sensitivity (97.3%, p<0.05 for all comparisons). Signals sampled at 50Hz performed poorer (AUC=0.88, SEN=79.9%) than those at 100Hz (AUC=0.92, SEN=88.7%) and 200Hz (AUC=0.93, SEN=89.2%), although this difference slightly failed to reach statistical significance. Despite denoised signals showing a higher median AUC (0.95 vs. 0.92) and sensitivity (92.8% vs. 88.7%), the difference was not found significant. Similarly, two-lead signals performed better than single-lead ones (AUC=0.92 vs. 0.9 and SEN=88.7% vs. 84.1%, respectively), but without crossing the significance threshold. The internal validation with denoised, 30/10s, two-lead signals, at 100Hz, yielded similarly high performance metrics for both InceptionTime and MINIROCKET (AUC=0.98, SEN=96.9% and AUC=0.98, SEN=97.4%, respectively). In contrast, the performance on the external set dropped significantly (AUC=0.79, SEN=81.4% and AUC=0.72, SEN=83.7%, respectively, p<0.001 for all comparisons). Conclusions Both DL approaches can effectively detect AF in ambulatory ECG signals, with only 3 out of 100 cases missed, designating their promising utility as screening tools for automated AF detection. While optimising tunable parameters can enhance the internal performance of such efforts, their external validation is necessary to establish their robustness "in the wild", since their performance on "unseen" data can be, similarly to our case, notably lower.