High-dimensional Data Types Research Articles

Background: Currently, hematologic neoplasms are diagnosed using a combination of methods, which require complex equipment and highly skilled clinical laboratory scientists and technicians - scarce resources.WGS and WTS could streamline this process and become a singular method. Interpretation of WGS and WTS data in a diagnostic setting is extremely challenging due to the breadth of data and its high-dimensional data types. AI will be mandatory to identify clinically meaningful genetic patterns and produce unbiased diagnosis.Aim: Compute leukemia diagnosis using AI methods with WGS and WTS data only, depicting relevant features for a decision and thus making its results comprehensible and transparent to humans.Methods: To train the model we used our cohort of 4,689 samples both with WGS (100x coverage, 2x151bp) and WTS (50 mio reads/sample, 2x101bp), along with our independent final routine diagnosis based on gold standard techniques (GST) and following WHO guidelines. Single nucleotide variants (SNV), structural variants (SV) and copy number alterations (CNA) from WGS data using a tumor w/o normal pipeline and gene fusions (GF) and gene expression (GE) from WTS were extracted. The cohort comprised of 30 different neoplasms and was severely imbalanced (n: 20 - 773).To test its performance another independent cohort which was not used during model creation (n=202, 22 entities) was selected.Results: We trained an ensemble of multi-class classifier using SageMaker (AWS, Seattle, WA) based on LGBM implementation of gradient boosted decision trees (Ke et al, 2017) in a one vs. rest architecture (1vRA). The model accuracy reached 85% overall on a 5-fold cross-validation (Fig 1a). Since neighboring disease types such as MGUS/MM, MDS-EB-2/AML/CMML are in some cases difficult to classify correctly using GST, we trained entity-specific classifiers operating independently. Rather than forcing a single predicted class to be predicted as overwhelmingly likely, this architecture accounts for ambiguous entities. In addition to reflecting biological similarity, the 1vRA resulted in improved probability calibration, so that cases with ambiguous leukemias are more easily identifiable by the distribution of predicted class probabilities and flagged for a human. Expected calibration error was only 3.8% for the 1vRA with entity-specific components, compared to 8.7% for a single LGBM model.Typically, AI methods are black boxes, making it hard for a medical professional to understand predictions, which results in low confidence and acceptance of such systems. Thus, we particularly focused on the transparency aspect of the model. We employed the SHAP library (Lundberg et al, 2017) to retrace the models output and gain insight into the features (i.e. which SNV, SV, GE etc.) predominantly driving classification results (e.g. LPL case Fig 1b). Fig. 1c illustrates the application of SHAP at the global cohort level for two individuals wrongly diagnosed with CML compared to CML correctly predicted cohort. By using a decision plot, we can observe which features are the most important contributors to the model's prediction. Fig 1c shows that predictions for CML are primarily driven by the BCR-ABL1 features, as expected.In our independent test cohort the following entities reached a very high concordance, such as AML (16/21) AUL (11/12), BCP-ALL (10/10), CML (13/13), HZL (8/8), MGUS (7/7), Multiple Myeloma (9/11), PNH (10/10), T-ALL (6/7). Other clear cut entities with correct high level predictions include BPDCN, FL, LPL, PPBL, NK-cell, HCL-variant and HGBL.In other entities such as T-NHL results were more heterogeneous, but this was also expressed in the probability scores given by the model. The first choice had a probability score of ~50%, exposing the correct diagnosis as the second likeliest one with ~40%. Test cohort included cases with mixed diagnostic characteristics, e.g. MDS/MPN-RS-T (4/11 correct, 4 predicted as MDS, and 3 as MPN).Conclusion: We present an AI tool to interpret WGS and WTS data aiming to predict the final diagnosis without any human input and high concordance to today's WHO classification. Due to the high data dimensionality of WGS and WTS data, an impossible feat for a human. The tool is exposed via a web application and visualizations make the automated decisions transparent and verifiable through humans paving a way for better adoption of WGS and WTS into a clinical routine setting. [Display omitted] DisclosuresKern: MLL Munich Leukemia Laboratory: Other: Part ownership. Haferlach: MLL Munich Leukemia Laboratory: Other: Part ownership. Haferlach: MLL Munich Leukemia Laboratory: Other: Part ownership.

Read full abstract

Abstract: One of the biggest challenges that breeders face is the development of improved cultivars in changing climate conditions posing extra challenges to their labor. On the other hand, the availability of data generated with automated systems offers an opportunity to characterize genetically and phenotypically genotypes with high detail. Modern sequencing technologies delivering hundreds of thousands of molecular makers, offered the opportunity of selecting genotypes without the need of observing these in fields and this methodology was coined as Genomic Selection (GS). More recently, sophisticated automated phenotyping platforms depending on sensors able to measure a large number of plant features were also developed and have shown potential in plant breeding applications. These modern phenotyping systems that attempt to efficiently deliver phenotypic information on secondary traits are also know as high-throughput phenotyping platforms (HTPPs). The integration of HTPP with GS models opened a new research front to improve the efficiency of the selection methods based on genomic data only, specially of those traits depending on a large number of genes with small effects (complex traits). However, there are still remaining some issues to solve for developing a robust methodology able to combine in an efficient and informed way these two high dimensional data types. In this document, we provide an overview of the statistical analysis of the data derived of the HTTPs for improving the predictive ability of conventional GS models. We provide a brief introduction showing the utility of genomic data in plant breeding applications. After, we provide an overview of the field-based HTPPs considering the light detection and ranging and the unmanned aerial vehicles and how the image data derived from these platforms can be used to accelerate genetic gains. After that, we discuss about the extension of the conventional GS models to allow the incorporation of data derived of the HTPPs as main effects and also in interaction with environmental factors. The availability of several sources of information have opened a venue to investigate besides the univariate or single trait model, models based on multiple traits and also models that consider multiple time measures allowing longitudinal GS studies. Finally, we provide some conclusions as well as we mention some the current issues that do not allow to fully exploit the potential of HTTPs in plant breeding applications.

Read full abstract

High-dimensional Data Types Research Articles

Related Topics

Articles published on High-dimensional Data Types

Fast and universal single-molecule localization using multi-dimensional point spread functions

Temporal Autoregressive Matrix Factorization for High-Dimensional Time Series Prediction of OSS.

Learning spiking neuronal networks with artificial neural networks: neural oscillations.

A review on the microgrid sizing and performance optimization by metaheuristic algorithms for energy management strategies

Adaptive Sparse Multi-Block PLS Discriminant Analysis: An Integrative Method for Identifying Key Biomarkers from Multi-Omics Data.

An advanced variable selection method based on information gain and Fisher criterion reselection iteration for multivariate calibration

CausNet: generational orderings based search for optimal Bayesian networks via dynamic programming with parent set constraints

Interactive Trajectory Star Coordinates i-tStar and Its Extension i-tStar (3D)

TMIC-30. EFFICIENT, QUANTITATIVE MAPPING OF TUMOR-IMMUNE NEIGHBORHOOD COMPOSITION IN GLIOBLASTOMA USING CASSATT

Characterizing node-negative non-small cell lung cancer patients with similarity networks: A CancerLinQ Discovery analysis.

A regularization method for linking brain and behavior.

Tightly integrated multiomics-based deep tensor survival model for time-to-event prediction.

Genetic Algorithm for Variable Selection and Parameter Optimization in SVM and Fuzzy SVM for Colon Cancer Microarray Classification

Automated Disease Classification Using Whole Genome Sequencing (WGS) and Whole Transcriptome Sequencing (WTS) Data with Transparent Artificial Intelligence (AI)

Knowledge discovery from gene expression dataset using bagging lasso decision tree

The use of high-throughput phenotyping in genomic selection context

Directionally dependent multi-view clustering using copula model.

Efficient nearest neighbors methods for support vector machines in high dimensional feature spaces

A wide dataset of ear shapes and pinna-related transfer functions generated by random ear drawings.

A Novel Separating Hyperplane Classification Framework to Unify Nearest-Class-Model Methods for High-Dimensional Data.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High-dimensional Data Types Research Articles

Related Topics

Articles published on High-dimensional Data Types

Fast and universal single-molecule localization using multi-dimensional point spread functions

Temporal Autoregressive Matrix Factorization for High-Dimensional Time Series Prediction of OSS.

Learning spiking neuronal networks with artificial neural networks: neural oscillations.

A review on the microgrid sizing and performance optimization by metaheuristic algorithms for energy management strategies

Adaptive Sparse Multi-Block PLS Discriminant Analysis: An Integrative Method for Identifying Key Biomarkers from Multi-Omics Data.

An advanced variable selection method based on information gain and Fisher criterion reselection iteration for multivariate calibration

CausNet: generational orderings based search for optimal Bayesian networks via dynamic programming with parent set constraints

Interactive Trajectory Star Coordinates i-tStar and Its Extension i-tStar (3D)

TMIC-30. EFFICIENT, QUANTITATIVE MAPPING OF TUMOR-IMMUNE NEIGHBORHOOD COMPOSITION IN GLIOBLASTOMA USING CASSATT

Characterizing node-negative non-small cell lung cancer patients with similarity networks: A CancerLinQ Discovery analysis.

A regularization method for linking brain and behavior.

Tightly integrated multiomics-based deep tensor survival model for time-to-event prediction.

Genetic Algorithm for Variable Selection and Parameter Optimization in SVM and Fuzzy SVM for Colon Cancer Microarray Classification

Automated Disease Classification Using Whole Genome Sequencing (WGS) and Whole Transcriptome Sequencing (WTS) Data with Transparent Artificial Intelligence (AI)

Knowledge discovery from gene expression dataset using bagging lasso decision tree

The use of high-throughput phenotyping in genomic selection context

Directionally dependent multi-view clustering using copula model.

Efficient nearest neighbors methods for support vector machines in high dimensional feature spaces

A wide dataset of ear shapes and pinna-related transfer functions generated by random ear drawings.

A Novel Separating Hyperplane Classification Framework to Unify Nearest-Class-Model Methods for High-Dimensional Data.