Digital video incurs many distortions during processing, compression, storage, and transmission, which can reduce perceived video quality. Developing adaptive video transmission methods that provide increased bandwidth and reduced storage space while preserving visual quality requires quality metrics that accurately describe how people perceive distortion. A severe problem for developing new video quality metrics is the limited data on how the early human visual system simultaneously processes spatial and temporal information. The problem is exacerbated by the fact that the few data collected in the middle of the last century do not consider current display equipment and are subject to medical intervention during collection, which does not guarantee a proper description of the conditions under which media content is currently consumed. In this paper, the 27840 thresholds of the visibility of spatio-temporal sinusoidal variations necessary to determine the artefacts that a human perceives were measured by a new method using different spatial sizes and temporal modulation rates. A multidimensional model of human contrast sensitivity in modern conditions of video content presentation is proposed based on new large-scale data obtained during the experiment. We demonstrate that the presented visibility model has a distinct advantage in predicting subjective video quality by testing with video quality metrics and including our and other visibility models against three publicly available video datasets.