The introduction of action quality assessment technology in sports events to achieve precise intelligent evaluation can greatly enhance the objectivity and effectiveness of competition results. Taking diving as the specific application background, this study proposes a novel Multi-granularity Extraction Approach for Temporal-spatial features in judge scoring prediction (MEAT) under the conditions of action quality assessment. On the one hand, it uses dual-modal inflated 3D ConvNet to extract the temporal and spatial features of each modal diving video at the video granularity parallelly and to merge them to form a global feature. On the other hand, the human body pose is modeled, and the simulated athlete's three-dimensional splash state is taken as local characteristics at the object granularity. Finally, the global and local features are concatenated into the fully connected layer, and heuristic method inspired by competition rules using labeled distribution learning are employed to output the probability distribution of the average score of all referees. The maximum probability score is selected and multiplied by the difficulty coefficient to obtain the final diving score. Through comprehensive experiments, comparing the Spearman's rank correlation (SRC) evaluation results of existing methods on the UNIV-Dive dataset, this framework reflects the greater accuracy advantage and further lays the foundation for the actual implementation of the technology.