Accurate and fast extraction of step parameters from video recordings of gait allows for richer information to be obtained from clinical tests such as Timed Up and Go. Current deep-learning methods are promising, but lack in accuracy for many clinical use cases. Extracting step parameters will often depend on extracted landmarks (keypoints) on the feet. We hypothesize that such keypoints can be determined with an accuracy relevant for clinical practice from video recordings by combining an existing general-purpose pose estimation method (OpenPose) with custom convolutional neural networks (convnets) specifically trained to identify keypoints on the heel. The combined method finds keypoints on the posterior and lateral aspects of the heel of the foot in side-view and frontal-view images from which step length and step width can be determined for calibrated cameras. Six different candidate convnets were evaluated, combining three different standard architectures as networks for feature extraction (backbone), and with two different networks for predicting keypoints on the heel (head networks). Using transfer learning, the backbone networks were pre-trained on the ImageNet dataset, and the combined networks (backbone + head) were fine-tuned on data from 184 trials of older, unimpaired adults. The data was recorded at three different locations and consisted of 193 k side-view images and 110 k frontal-view images. We evaluated the six different models using the absolute distance on the floor between predicted keypoints and manually labelled keypoints. For the best-performing convnet, the median error was 0.55 cm and the 75% quartile was below 1.26 cm using data from the side-view camera. The predictions are overall accurate, but show some outliers. The results indicate potential for future clinical use by automating a key step in marker-less gait parameter extraction.