Abstract

Automatically recording surgical procedures and generating surgical reports are crucial for alleviating surgeons' workload and enabling them to concentrate more on the operations. Despite some achievements, there still exist several issues for the previous works: 1) failure to model the interactive relationship between surgical instruments and tissue; and 2) neglect of fine-grained differences within different surgical images in the same surgery. To address these two issues, we propose an improved scene graph-guided Transformer, also named by SGT++, to generate more accurate surgical report, in which the complex interactions between surgical instruments and tissue are learnt from both explicit and implicit perspectives. Specifically, to facilitate the understanding of the surgical scene graph under a graph learning framework, a simple yet effective approach is proposed for homogenizing the input heterogeneous scene graph. For the homogeneous scene graph that contains explicit structured and fine-grained semantic relationships, we design an attention-induced graph transformer for node aggregation via an explicit relation-aware encoder. In addition, to characterize the implicit relationships about the instrument, tissue, and the interaction between them, the implicit relational attention is proposed to take full advantage of the prior knowledge from the interactional prototype memory. With the learnt explicit and implicit relation-aware representations, they are then coalesced to obtain the fused relation-aware representations contributing to generating reports. Some comprehensive experiments on two surgical datasets show that the proposed STG++ model achieves state-of-the-art results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call