AbstractFacial feature tracking is a key component of imaging ballistocardiography (BCG) where accurate quantification of the displacement of facial keypoints is needed for good heart rate estimation. Skin feature tracking enables video-based quantification of motor degradation in Parkinson’s disease. While traditional computer vision algorithms like Scale Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), and Lucas-Kanade method (LK) have been benchmarks due to their efficiency and accuracy, they often struggle with challenges like affine transformations and changes in illumination. In response, we propose a pipeline for feature tracking, that applies a convolutional stacked autoencoder to identify the most similar crop in an image to a reference crop containing the feature of interest. The autoencoder learns to represent image crops into deep feature encodings specific to the object category it is trained upon. We train the autoencoder on facial images and validate its ability to track skin features in general using manually labelled face and hand videos of small and large motion recorded in our lab. Our evaluation protocol is comprehensive, including quantification of errors in human annotation. The tracking errors of distinctive skin features (moles) are so small that we cannot exclude the fact that they stem from the manual labelling based on a $$\chi ^2$$ χ 2 -test. With a mean error of 0.6–3.3 pixels, our method outperformed the other methods in all but one scenario. More importantly, our method was the only one that did not diverge. We also compare our method with the latest state-of-the-art transformer for feature matching by Google—Omnimotion. Our results indicate that our method is superior at tracking different skin features under large motion conditions and that it creates better feature descriptors for tracking, matching, and image registration compared to both traditional algorithms and the latest Omnimotion.