Abstract

Abstract High-throughput methods implemented in biology research produce a continuously growing array of data input that are used to produce data output with an increasing abundance of features. While growth in the volume and diversity of data input can be highly valuable for studying biological systems, it presents the challenge of managing enormous quantities of features, many of which are not relevant to the specific research question being asked. This excess data input burdens storage and computation of downstream clustering and machine learning tasks. A common approach used to manage this data input relies on filters applied to the features by their variance across the sample set, while applying random cutoffs. Our proprietary algorithm (“MADVAR”) enables the prioritization of variable features from high-throughput continuous data, by automatically finding an optimal cutoff for the distribution of the data. Based on the right-skew nature of biological data distribution, MADVAR finds and excludes the "0 variance peak" using the median of the distributions and the median absolute deviation (MAD). MADVAR enables a faster analysis with a reduced memory requirement, and dramatically improves clustering results with minimal loss of relevant features. Citation Format: Gilad Silberberg, Michael Ritchie. MADVAR: An algorithm that improves the relevance of computational biology output, while reducing compute time and space requirements [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 2327.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call