With the renaissance of deep learning, automatic diagnostic algorithms for computed tomography (CT) have achieved many successful applications. However, they heavily rely on lesion-level annotations, which are often scarce due to the high cost of collecting pathological labels. On the other hand, the annotated CT data, especially the 3-D spatial information, may be underutilized by approaches that model a 3-D lesion with its 2-D slices, although such approaches have been proven effective and computationally efficient. This study presents a multiview contrastive network (MVCNet), which enhances the representations of 2-D views contrastively against other views of different spatial orientations. Specifically, MVCNet views each 3-D lesion from different orientations to collect multiple 2-D views; it learns to minimize a contrastive loss so that the 2-D views of the same 3-D lesion are aggregated, whereas those of different lesions are separated. To alleviate the issue of false negative examples, the uninformative negative samples are filtered out, which results in more discriminative features for downstream tasks. By linear evaluation, MVCNet achieves state-of-the-art accuracies on the lung image database consortium and image database resource initiative (LIDC-IDRI) (88.62%), lung nodule database (LNDb) (76.69%), and TianChi (84.33%) datasets for unsupervised representation learning. When fine-tuned on 10% of the labeled data, the accuracies are comparable to the supervised learning models (89.46% versus 85.03%, 73.85% versus 73.44%, 83.56% versus 83.34% on the three datasets, respectively), indicating the superiority of MVCNet in learning representations with limited annotations. Our findings suggest that contrasting multiple 2-D views is an effective approach to capturing the original 3-D information, which notably improves the utilization of the scarce and valuable annotated CT data.
Read full abstract