Objectives To compare user satisfaction and acceptability, reliability and validity of three different methods of assessing the surgical skills of trainees by direct observation in the operating theatre across a range of different surgical specialties and index procedures. Design and setting A 2-year prospective, observational study in the operating theatres of three teaching hospitals in Sheffield. Methods The assessment methods were procedure-based assessment (PBA), Objective Structured Assessment of Technical Skills (OSATS) and Non-technical Skills for Surgeons (NOTSS). The specialties were obstetrics and gynaecology (O&G) and upper gastrointestinal, colorectal, cardiac, vascular and orthopaedic surgery. Two to four typical index procedures were selected from each specialty. Surgical trainees were directly observed performing typical index procedures and assessed using a combination of two of the three methods (OSATS or PBA and NOTSS for O&G, PBA and NOTSS for the other specialties) by the consultant clinical supervisor for the case and the anaesthetist and/or scrub nurse, as well as one or more independent assessors from the research team. Outcome measures Information on user satisfaction and acceptability of each assessment method from both assessor and trainee perspectives was obtained from structured questionnaires. The reliability of each method was measured using generalisability theory. Aspects of validity included the internal structure of each tool and correlation between tools, construct validity, predictive validity, interprocedural differences, the effect of assessor designation and the effect of assessment on performance. Results Of the 558 patients who were consented, a total of 437 (78%) cases were included in the study: 51 consultant clinical supervisors, 56 anaesthetists, 39 nurses, 2 surgical care practitioners and 4 independent assessors provided 1635 assessments on 85 trainees undertaking the 437 cases. A total of 749 PBAs, 695 NOTSS and 191 OSATSs were performed. Non-O&G clinical supervisors and trainees provided mixed, but predominantly positive, responses about a range of applications of PBA. Most felt that PBA was important in surgical education, and would use it again in the future and did not feel that it added time to the operating list. The overall satisfaction of O&G clinical supervisors and trainees with OSATS was not as high, and a majority of those who used both preferred PBA. A majority of anaesthetists and nurses felt that NOTSS allowed them to rate interpersonal skills (communication, teamwork and leadership) more easily than cognitive skills (situation awareness and decision-making), that it had formative value and that it was a valuable adjunct to the assessment of technical skills. PBA demonstrated high reliability (G > 0.8 for only three assessor judgements on the same index procedure). OSATS had lower reliability (G > 0.8 for five assessor judgements on the same index procedure). Both were less reliable on a mix of procedures because of strong procedure-specific factors. A direct comparison of PBA between O&G and non-O&G cases showed a striking difference in reliability. Within O&G, a good level of reliability (G > 0.8) could not be obtained using a feasible number of assessments. Conversely, the reliability within non-O&G cases was exceptionally high, with only two assessor judgements being required. The reasons for this difference probably include the more summative purpose of assessment in O&G and the much higher proportion of O&G trainees in this study with training concerns (42% vs 4%). The reliability of NOTSS was lower than that for PBA. Reliability for the same procedure (G > 0.8) required six assessor judgements. However, as procedure-specific factors exerted a lesser influence on NOTSS, reliability on a mix of procedures could be achieved using only eight assessor judgements. NOTSS also demonstrated a valid internal structure. The strongest correlations between NOTSS and PBA or OSATS were in the ‘decision-making’ domain. PBA and NOTSS showed better construct validity than OSATS, the year of training and the number of recent index procedures performed being significant independent predictors of performance. There was little variation in scoring between different procedures or different designations of assessor. Conclusions The results suggest that PBA is a reliable and acceptable method of assessing surgical skills, with good construct validity. Specialties that use OSATS may wish to consider changing the design or switching to PBA. Whatever workplace-based assessment method is used, the purpose, timing and frequency of assessment require detailed guidance. NOTSS is a promising tool for the assessment of non-technical skills, and surgical specialties may wish to consider its inclusion in their assessment framework. Further research is required into the use of health-care professionals other than consultant surgeons to assess trainees, the relationship between performance and experience, the educational impact of assessment and the additional value of video recording. Funding The National Institute for Health Research Health Technology Assessment programme.