Single Training Set Research Articles

Abstract Introduction Cancer type is determined through tumor morphology, aided by immunohistochemical staining. The development of machine learning (ML) models using histology slides has powered the image-based prediction of the site of origin in cancer of unknown primary (CUP). Here, we used ML on proteomic data to predict cancer types and tissue of origin from a sample cohort consisting of 1,277 human tissue samples spanning 44 cancer types. The training proteome datasets included two independent sets of proteomes acquired from a pan-cancer cell line collection and a subset of the tissue cohort for online ML. Methods All samples were processed using data-independent acquisition mass spectrometry (DIA-MS). Two proteomic profiles from the pan-cancer cell line cohort were generated using two independent sample preparation methods. These were normalized by Combat and merged by averaging the protein abundance, yielding a single training set (D1) with 975 cell lines and 9,688 proteins. Similary, 1,277 tissue samples were processed by DIA-MS, quantifying 9,501 proteins. Celligner was used to align the cell lines (D1) with the tissue cohort. Half of the tissue proteomes were used as a second training set (D2) for online ML and a hold-out test set was constructed by taking the other half of the tissue cohort (T1). A multinomial logistic regression was used to predict cancer and tissue types. Top-k accuracy, as the evaluation metric, computes how often the correct cancer and tissue type class is among the top k classes predicted. Results As a proof of concept, we defined six cancer types (adenocarcinoma, sarcoma, squamous carcinoma, lymphoma, melanoma and small cell carcinoma) and seven adenocarcinoma tissues of origin (breast, colorectal, liver, lung, ovary, stomach/esophagus and pancreas) for an ML experiment. We learned a classifier using the cell lines (D1) as the baseline training set, and consecutively added 10% of D2 to D1 for online ML. We tested the baseline model and each subsequent new model on the test set T1. We observed a monotonic performance increase from 0.89 (baseline; Top-1 accuracy) to 0.97 (all D2 were used) when predicting the six cancer types. We observed an analogous trend when predicting the seven tissue types (from 0.64 to 0.84). These results suggest that cancer cell lines can be used to predict cancer type and adenocarcinoma tissue of origin. Conclusion Our proteomic-based ML model can predict cancer type and adenocarcinoma tissue of origin in concordance with existing histopathological classification. It can also assign multiple probabilities to tumor type and tissue of origin, potentially enabling the classification of CUP in future work. By adding tissue samples stepwise to the existing model, its predictive performance can be further enhanced. This reflects a real-world knowledgebase that will continue to increase in predictive power as additional data are added. Citation Format: Zhaoxiang Cai, Zainab Noor, Adel T. Aref, Emma L. Boys, Dylan Xavier, Natasha Lucas, Steven G. Williams, Jennifer M. Koh, Rebecca C. Poulos, Peter G. Hains, Phillip J. Robinson, Rosemary Balleine, Roger R. Reddel, Qing Zhong. Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975 cancer cell lines. [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 5391.

Read full abstract

This study developed and evaluated a fully convolutional network (FCN) for pediatric CT organ segmentation and investigated the generalizability of the FCN across image heterogeneities such as CT scanner model protocols and patient age. We also evaluated the autosegmentation models as part of a software tool for patient-specific CT dose estimation. A collection of 359 pediatric CT datasets with expert organ contours were used for model development and evaluation. Autosegmentation models were trained for each organ using a modified FCN 3D V-Net. An independent test set of 60 patients was withheld for testing. To evaluate the impact of CT scanner model protocol and patient age heterogeneities, separate models were trained using a subset of scanner model protocols and pediatric age groups. Train and test sets were split to answer questions about the generalizability of pediatric FCN autosegmentation models to unseen age groups and scanner model protocols, as well as the merit of scanner model protocol or age-group-specific models. Finally, the organ contours resulting from the autosegmentation models were applied to patient-specific dose maps to evaluate the impact of segmentation errors on organ doseestimation. Results demonstrate that the autosegmentation models generalize to CT scanner acquisition and reconstruction methods which were not present in the training dataset. While models are not equally generalizable across age groups, age-group-specific models do not hold any advantage over combining heterogeneous age groups into a single training set. Dice similarity coefficient (DSC) and mean surface distance results are presented for 19 organ structures, for example, median DSC of 0.52 (duodenum), 0.74 (pancreas), 0.92 (stomach), and 0.96 (heart). The FCN models achieve a mean dose error within 5% of expert segmentations for all 19 organs except for the spinal canal, where the mean error was 6.31%. Overall, these results are promising for the adoption of FCN autosegmentation models for pediatric CT, including applications for patient-specific CT doseestimation.

Read full abstract

Single Training Set Research Articles

Articles published on Single Training Set

Robust deep learning for eye fundus images: Bridging real and synthetic data for enhancing generalization

Improving generalization capability of deep learning-based nuclei instance segmentation by non-deterministic train time and deterministic test time stain normalization

Abstract 5391: Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975 cancer cell lines

Instance Selection-Based Surrogate-Assisted Genetic Programming for Feature Learning in Image Classification.

Technical note: Evaluation of a V-Net autosegmentation algorithm for pediatric CT scans: Performance, generalizability, and application to patient-specific CT dosimetry.

Integrating Multiple Datasets and Machine Learning Algorithms for Satellite-Based Bathymetry in Seaports

WITHDRAWN: Contactless attendance system using Siamese neural network based face recognition

Boosting Deep Open World Recognition by Clustering

Repeated holdout validation for weighted quantile sum regression

Multi-domain adversarial training of neural network acoustic models for distant speech recognition

Gene expression based cancer classification

A Novel Eigenface based Species Recognition System

A human platelet calcium calculator trained by pairwise agonist scanning.

A Localization Algorithm of Nodes Based on Hypersphere Granular Computing in Wireless Sensor Networks

Detection of static groups and crowds gathered in open spaces by texture classification

Design and Analysis of Classifier Learning Experiments in Bioinformatics: Survey and Case Studies

Window consensus PCA for multiblock statistical process control: adaption to small and time‐dependent normal operating condition regions, illustrated by online high performance liquid chromatography of a three‐stage continuous process

Reinforcement learning design for cancer clinical trials

Combining multiple positive training sets to generate confidence scores for protein–protein interactions

Application of a hybrid model on short‐term load forecasting based on support vector machines (SVM)

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Single Training Set Research Articles

Articles published on Single Training Set

Robust deep learning for eye fundus images: Bridging real and synthetic data for enhancing generalization

Improving generalization capability of deep learning-based nuclei instance segmentation by non-deterministic train time and deterministic test time stain normalization

Abstract 5391: Machine learning of cancer type and tissue of origin from proteomes of 1,277 human tissue samples and 975 cancer cell lines

Instance Selection-Based Surrogate-Assisted Genetic Programming for Feature Learning in Image Classification.

Technical note: Evaluation of a V-Net autosegmentation algorithm for pediatric CT scans: Performance, generalizability, and application to patient-specific CT dosimetry.

Integrating Multiple Datasets and Machine Learning Algorithms for Satellite-Based Bathymetry in Seaports

WITHDRAWN: Contactless attendance system using Siamese neural network based face recognition

Boosting Deep Open World Recognition by Clustering

Repeated holdout validation for weighted quantile sum regression

Multi-domain adversarial training of neural network acoustic models for distant speech recognition

Gene expression based cancer classification

A Novel Eigenface based Species Recognition System

A human platelet calcium calculator trained by pairwise agonist scanning.

A Localization Algorithm of Nodes Based on Hypersphere Granular Computing in Wireless Sensor Networks

Detection of static groups and crowds gathered in open spaces by texture classification

Design and Analysis of Classifier Learning Experiments in Bioinformatics: Survey and Case Studies

Window consensus PCA for multiblock statistical process control: adaption to small and time‐dependent normal operating condition regions, illustrated by online high performance liquid chromatography of a three‐stage continuous process

Reinforcement learning design for cancer clinical trials

Combining multiple positive training sets to generate confidence scores for protein–protein interactions

Application of a hybrid model on short‐term load forecasting based on support vector machines (SVM)