In this paper, we propose a novel system for capsule endoscopy (CE) summarisation that has two main components. The first component consists of the Semi-Supervised Clustering and Local Scale Learning (SS-LSL) algorithm. This algorithm is used to group video frames into prototypical clusters that summarise the CE video. The constraints consist of pairs of frames that should not be included in the same cluster. These constraints are deduced from the training frames to help in guiding the clustering process. The second component of the system consists of a novel relational motion histogram descriptor that is designed to represent the local motion distribution between two contiguous frames. The main idea is to identify “highlight” frames which contain typical variations within the frame collection. These variations are due to different pathologies, small tumours and other subtle abnormalities of the small intestine and so on. SS-LSL algorithm is assessed using synthetic data sets, and proved to outperform similar clustering algorithms because of its ability to discover clusters of different sizes and densities. The proposed video summarisation system is trained, field-tested, evaluated and compared using a large-scale cross-validation experiment that uses videos from Video Surveillance Online Repository, and four CE videos acquired from four patients. This collection includes more than 150k video frames.