Reliably detecting and tracking animals in wildlife videos is an essential basis for researchers to be able to analyse animal behavior or recognize animal individuals. In order to correctly distinguish individual animals that are standing close to each other, bounding boxes around the animals are not sufficient. Instead, an exact contour of the animal, an instance mask, which is the result of an instance segmentation, is needed. In this paper, we present SWIFT, a novel multi-object tracking and segmentation (MOTS) pipeline that solves this task. We evaluate the functionality of our approach on a self-created wildlife video dataset containing red deer and fallow deer. Our dataset is one of the very few datasets in wildlife monitoring that is annotated with instance masks and tracking IDs. SWIFT significantly improves the quality of the instance masks compared to using a state-of-the-art instance segmentation approach from 0.432 average precision to 0.495 average precision. Our tracking algorithm uses multiple filtering steps to either delete tracks that are found incorrectly or to merge tracks that are not yet connected. This results in an increased multi-object tracking accuracy score in comparison to a state-of-the-art tracking approach from 57.2% to 63.8%, which means that our detected tracking results are less erroneous.