Unveiling the source of an image is one of the most effective ways to validate the originality, authenticity, and reliability in the field of digital forensics. Source camera device identification can identify the specific camera device used to take a photo under investigation. While great progress has been made by the photo-response non-uniformity (PRNU)-based methods over the past decade, the challenge of instance-level source camera device linking, which verifies whether two images in question were captured by the same camera device, remains significant. This challenge is mainly due to the absence of auxiliary images to construct a clean camera fingerprint for each camera, particularly dealing with small image sizes. To overcome this limitation, in this paper, we formulate the task of source device linking as a binary classification problem and propose a simple yet effective framework based on a context-aware deep Siamese network. We take advantage of a Siamese architecture to extract the intrinsic camera device-related noise patterns from a pair of image patches in parallel for comparisons without any auxiliary images. Moreover, a recurrent criss-cross group is utilized to aggregate contextual information in the noise residual maps to alleviate the problem that PRNU noise maps are easily contaminated by the additive noises from image contents. For reliable device linking, we employ a patch-selection strategy on a pair of test images to adaptively choose suitable image patch pairs according to image contents. The final decision of a pair of test images is obtained from the average similarity score of the selected image patch pairs. Compared with existing state-of-the-art methods, our proposed framework can achieve better performance on both the tasks of source camera identification and source device linking without any prior knowledge, i.e., reliable camera fingerprints, regardless of whether the camera devices are “seen” or “unseen” in the training stage. The experimental results on two standard image forensic datasets demonstrate that the proposed method not only shows robustness with respect to different image patch sizes and image quality degenerations, but also has a generalization ability across digital camera and smartphone devices.