Abstract

Automatic mitochondrial segmentation enjoys great popularity with the development of deep learning. However, the coarse prediction raised by the presence of regular 3D grids in previous methods regardless of 3D CNN or the vision transformers suggest a possibly sub-optimal feature arrangement. To mitigate this limitation, we attempt to interpret the 3D EM image stacks as a set of interrelated 3D fragments for a better solution. However, it is non-trivial to model the 3D fragments without introducing excessive computational overhead. In this paper, we design a coherent fragment vision transformer (FragViT) combined with affinity learning to manipulate features on 3D fragments yet explore mutual relationships to model fragment-wise context, enjoying locality prior without sacrificing global reception. The proposed FragViT includes a fragment encoder and a hierarchical fragment aggregation module. The fragment encoder is equipped with affinity heads to transform the tokens into fragments with homogeneous semantics, and the multi-layer self-attention is used to explicitly learn inter-fragment relations with long-range dependencies. The hierarchical fragment aggregation module is responsible for hierarchically aggregating fragment-wise prediction back to the final voxel-wise prediction in a progressive manner. Extensive experimental results on the challenging MitoEM, Lucchi, and AC3/AC4 benchmarks demonstrate the effectiveness of the proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call