Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification.

Chakravarthi Kanduri,Milena Pavlović,Lonneke Scheffer,Keshav Motwani,Maria Chernigovskaya,Victor Greiff,Geir K Sandve

doi:10.1093/gigascience/giac046

Chakravarthi Kanduri, Milena Pavlović + Show 5 more

Open Access

https://doi.org/10.1093/gigascience/giac046

Copy DOI

Journal: GigaScience	Publication Date: May 25, 2022
Citations: 14	License type: CC BY 4.0

Affiliation: University of Oslo, University of Florida

Abstract

BackgroundMachine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where method development of more sophisticated ML approaches may be required.ResultsTo identify those scenarios where a baseline ML method is able to perform well for AIRR classification, we generated a collection of synthetic AIRR benchmark data sets encompassing a wide range of data set architecture-associated and immune state–associated sequence patterns (signal) complexity. We trained ≈1,700 ML models with varying assumptions regarding immune signal on ≈1,000 data sets with a total of ≈250,000 AIRRs containing ≈46 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR-ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50,000 AIR sequences.ConclusionsWe provide a reference benchmark to guide new AIRR-ML classification methodology by (i) identifying those scenarios characterized by immune signal and data set complexity, where baseline methods already achieve high prediction accuracy, and (ii) facilitating realistic expectations of the performance of AIRR-ML models given training data set properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark data sets for comprehensive benchmarking of AIRR-ML methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification.

Abstract

Talk to us

Similar Papers

More From: GigaScience

Lead the way for us

Similar Papers

Methodological progress note: Machine learning methods in healthcare research.
Colin Rogerson ... Matt Hall
Journal of Hospital Medicine | VOL. 18
Colin Rogerson, et. al.Colin Rogerson ... Matt Hall
13 Mar 2023
Journal of Hospital Medicine | VOL. 18

SimAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods.
Chakravarthi Kanduri ... Geir K Sandve
GigaScience | VOL. 12
Chakravarthi Kanduri, et. al.Chakravarthi Kanduri ... Geir K Sandve
28 Dec 2022
GigaScience | VOL. 12

Pushing the limits of solubility prediction via quality-oriented data selection.
Murat Cihan Sorkun ... Süleyman Er
iScience | VOL. 24
Murat Cihan Sorkun, et. al.Murat Cihan Sorkun ... Süleyman Er
17 Dec 2020
iScience | VOL. 24

Comparison of machine learning and logistic regression models in predicting acute kidney injury: A systematic review and meta-analysis
Xuan Song ... Chunting Wang
International Journal of Medical Informatics | VOL. 151
Xuan Song, et. al.Xuan Song ... Chunting Wang
08 May 2021
International Journal of Medical Informatics | VOL. 151

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification.

Abstract

Talk to us

Similar Papers

More From: GigaScience