Abstract

MotivationLarge-scale population omics data can provide insight into associations between gene–environment interactions and disease. However, existing dimension reduction modelling techniques are often inefficient for extracting detailed information from these complex datasets.ResultsHere, we present an interactive software pipeline for exploratory analyses of population-based nuclear magnetic resonance spectral data using a COmbined Multi-block Principal components Analysis with Statistical Spectroscopy (COMPASS) within the R-library hastaLaVista framework. Principal component analysis models are generated for a sequential series of spectral regions (blocks) to provide more granular detail defining sub-populations within the dataset. Molecular identification of key differentiating signals is subsequently achieved by implementing Statistical TOtal Correlation SpectroscopY on the full spectral data to define feature patterns. Finally, the distributions of cross-correlation of the reference patterns across the spectral dataset are used to provide population statistics for identifying underlying features arising from drug intake, latent diseases and diet. The COMPASS method thus provides an efficient semi-automated approach for screening population datasets.Availability and implementationSource code is available at https://github.com/cheminfo/COMPASS.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Systems biology approaches using multi-omic platforms can inform on biological pathways and mechanisms underlying disease risk and identify potential targets for new treatments or preventive measures (Elliott et al, 2015; Holmes et al, 2008)

  • The Principal component analysis (PCA) model generated from the whole dataset showed that the first two PCs are dominated by variance in glucose concentration (Fig. 2A, B) and commonly used drug metabolites, ibuprofen and acetaminophen (PC3, Fig. 2C)

  • Since the ‘PCA blocking’ approach investigates the dataset on a much finer level of detail compared to the ‘default’ whole dataset approach, phenomena such as latent disease, genetic disorders and consumption of drugs or specific foods are more detected by interpretation of the calculated sub-models, whilst the risk of misinterpretations is reduced since the model is no longer skewed by one or two high intensity variables in many cases

Read more

Summary

Introduction

Systems biology approaches using multi-omic platforms can inform on biological pathways and mechanisms underlying disease risk and identify potential targets for new treatments or preventive measures (Elliott et al, 2015; Holmes et al, 2008). Principal component analysis (PCA) (Jackson, 1991; Wold et al, 1987) and partial least squares projections to latent structures (PLS) (Tennenhaus, 1998; Wold et al, 2001a,b) are two of the most widely implemented statistical tools on the basis of their data compression ability, visualization properties, high interpretability, robustness and inherent transparency with respect to feature (metabolite) weighting, compared to other methods.

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.