Modern environmental epidemiology benefits from a new generation of technologies that enable comprehensive profiling of biomarkers, including environmental chemical exposure and omic datasets. The integration and analysis of large and structured datasets to identify functional associations is constrained by computational challenges that cannot be overcome using conventional regression methods. Some extensions of Partial Least Squares (PLS) regression have been developed to efficently integrate multiple datasets, including Multiblock PLS (MB-PLS) and Sequential and Orthogonalized PLS; however, these approaches remain seldom applied in environmental epidemiology. To address that research gap, this study aimed to assess and compare the applicability of PLS-based multiblock models in an observational case study, where biomarkers of exposure to environmental chemicals and endogenous biomarkers of effect were simultaneously integrated to highlight biological links related to a health outcome. The methods were compared with and without sparsity coupling two metrics to support the variable selection: Variable Importance in Projection (VIP) and Selectivity Ratio (SR). The framework was applied to a case-study dataset mimicking the structure of 36 environmental exposure biomarkers (E-block), 61 inflammation biomarkers (M-block), and their relationships with the gestational age at delivery of 161 mother-infant pairs. The results showed an overall consistency in the selected variables across models, although some specific selection patterns were identified. The block-scaled concatenation-based approaches (e.g. MB-PLS) tended to select more variables from the E-block, while these methods were unable to identify certain variables in the M-block. Overall, the number of variables selected using the SR criterion was higher than using the VIP criterion, with lower predictive performances. The multiblock models coupled to VIP, appeared to be the methods of choice for identifying relevant variables with similar statistical performances. Overall, the use of multiblock PLS-based methods appears to be a good strategy to efficiently support the variable selection process in modern environmental epidemiology.
Read full abstract