Recently, deep-learning based video-manipulation techniques, so-called deepfakes, have gained a lot of traction in popular media. Deepfakes are predominantly used to photo-realistically alter the identity of actors recorded on video, and misuse of this new technology has the potential to harm individuals, companies, or to inflict damage in a societal context. The accelerated pace, at which deepfake technology is improved, challenges contemporary research to provide the urgently required detection methods. A major obstacle to reliable deepfake detection in the field is the impaired generalizability across specific instances of deepfake models, i.e. detection performance of previously not encountered manipulation models. The present work aims to establish a better understanding of the critical factors that guide cross-manipulation detection performance. Recent studies have indicated that the properties of video-based modeling may be leveraged to enhance the detection of unknown deepfake manipulations. Therefore, we compare multiple image- and video-based detection models and evaluate their performance on instances both generated by known, and unknown manipulation models. Furthermore we attempt to replicate research that successfully improved cross-manipulation detection by using image-perturbation methods to degrade training data. Using multiple data-sources to emulate in- and out-sample detection performance, we demonstrate that none of the two model types show universally superior performance. Also, our results confirm that cross-manipulation evaluation results in problematic detection accuracy for all models, regardless of whether we apply image-perturbation techniques during training. Additionally, the detection performance on a set of manually selected, high-quality deepfake videos indicates that the current state-of-the-art detection models are not yet fully equipped for real-world applications.