Abstract T cell receptor (TCR) repertoires are formed by random recombination of genomic precursor elements, some of which mediate recognition of cancer-associated antigens. Identifying these cancer-associated TCRs (caTCRs) could facilitate the early detection of cancers and trace them back to their origins. Due to the complicated processes of T cell immune response, a practical strategy to obtain caTCRs is developing computational methods to identify caTCRs from population and/or individual TCR repertoires. Several machine learning-based methods have been proposed during the latest years and have promoted a series of interesting findings. However, these methods can be further improved by overcoming two vital weaknesses on the design of machine learning framework. First, to facilitate the computational processes, the existing machine learning frameworks prefer to decompose the original TCR sequences into length-fixed amino acid fragments, but the lengths of cancer-associated motifs vary from 3 to 10 amino acids. Second, the correlations among TCRs in the same repertoire should be considered, which are not supported in the existing machine learning frameworks. Thus, we developed a multi-instance learning framework incorporating a convolutional neural network (CNN) model, named TEtor, to solve the issues above. TEtor improves the predictions of caTCRs in pan-cancer TCR repertoires on two aspects. On one hand, the CNN model is adopted with the alternative convolutional filters and 1-max pooling operations. Such design of CNN model is able to handle the amino acid fragments with different lengths, and thus realizing the identification of cancer-associated motifs with various lengths. On the other hand, the multi-instance learning framework calculates the correlations among TCRs and assigns an adjusted weight for each TCR according to the prediction and training process. When training this machine learning model, to speed up the convergence, we adopted a filtering step, which only retains the top 500 abundant TCR sequences in each repertoire. To validate the performance of TEtor, we conducted a series of experiments on several cohorts of patients encompassing 11 different cancers. Compared to the existing machine learning methods, TEtor achieved better prediction performance on accuracy, sensitivity, specificity and areas under the curves (AUCs) on most of the cohorts. For all of the tested cancer types, TEtor obtained an AUC above 0.7 (e.g. thyroid cancer, 0.97; lung cancer, 0.90; gastrointestinal cancer, 0.9). In conclusion, TEtor is a reasonable and practical framework. It improves the caTCR prediction and could further support the accurate detection of early cancers from individual TCR repertoire data. Citation Format: Ying Xu, Xinyang Qian, Xuanping Zhang, Jiayin Wang. TEtor: Improved machine learning approach unveils the clues of cancer lurking in individual TCR repertoire [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr LB533.
Read full abstract