Accurate lymph node size estimation is critical for staging cancer patients, initial therapeutic management, and assessing response to therapy. Current standard practice for quantifying lymph node size is based on a variety of criteria that use uni-directional or bi-directional measurements. Segmentation in 3D can provide more accurate evaluations of the lymph node size. Fully convolutional neural networks (FCNs) have achieved state-of-the-art results in segmentation for numerous medical imaging applications, including lymph node segmentation. Adoption of deep learning segmentation models in clinical trials often faces numerous challenges. These include lack of pixel-level ground truth annotations for training, generalizability of the models on unseen test domains due to the heterogeneity of test cases and variation of imaging parameters. In this paper, we studied and evaluated the performance of lymph node segmentation models on a dataset that was completely independent of the one used to create the models. We analyzed the generalizability of the models in the face of a heterogeneous dataset and assessed the potential effects of different disease conditions and imaging parameters. Furthermore, we systematically compared fully-supervised and weakly-supervised methods in this context. We evaluated the proposed methods using an independent dataset comprising 806 mediastinal lymph nodes from 540 unique patients. The results show that performance achieved on the independent test set is comparable to that on the training set. Furthermore, neither the underlying disease nor the heterogeneous imaging parameters impacted the performance of the models. Finally, the results indicate that our weakly-supervised method attains 90%− 91% of the performance achieved by the fully supervised training.