Accurate, unbiased, and reproducible assessment of skill is a vital resource for surgeons throughout their career. The objective in this research is to develop and validate algorithms for video-based assessment of intraoperative surgical skill. Algorithms to classify surgical video into expert or novice categories provide a summative assessment of skill, which is useful for evaluating surgeons at discrete time points in their training or certification of surgeons. Using a spatial-temporal neural network architecture, we tested the hypothesis that explicit supervision of spatial attention supervised by instrument tip locations improves the algorithm's generalizability to unseen dataset. The best performing model had an area under the receiver operating characteristic curve (AUC) of 0.88. Augmenting the network with supervision of spatial attention improved specificity of its predictions (with small changes in sensitivity and AUC) and led to improved measures of discrimination when tested with unseen dataset. Our findings show that explicit supervision of attention learned from images using instrument tip locations can improve performance of algorithms for objective video-based assessment of surgical skill.