Finding relevant features in ultra-high dimensional survival data is one of the most important and fundamental objectives in biology discovery and statistical acquisition. Conventional survival regression algorithms are challenged by the exponential increase in raw data. In real-world scenarios, data processing with ultra-high dimensionality has an impact, particularly on two-component structures like the kidneys, lungs, and eyes. Future system stability and the frequency of illness are both affected by gene interactions between two components. The traditional statistical procedures employed for the survival system are restricted to single component. To date, for ultra-high-dimensional survival data with two compartments, no feature selection method is available. Thus, with the goal to determine the optimal methods in this situation, this study suggested and contrasted the performance of ten variable selection approaches for ultra-high dimensional Renal Cell Carcinoma (RCC) survival data containing two compartments. The study attempted to combine Freund’s baseline hazard function as the baseline hazard of Cox model (Lasso Freund, Robust Lasso Freund, Elastic Net Freund) and integrated with sure independence screening (SIS) and iterative sure independence screening (ISIS) (i.e., LF-SIS, RLF-SIS, ENF-SIS, LF-ISIS, RLF-ISIS, ENF-ISIS) in an attempt to tackle this issue. Additionally, two basic approaches, LASSO and EN, were taken into consideration and EN is combined with SIS and ISIS (EN-SIS, EN-ISIS). Result shows that based on the validating model measures, including MSE (340.000), SSE (25300.0) and RMSE (16.490) suggest, the Robust Lasso Freund-Iterative Sure Independence Screening (RLF-ISIS) and Robust Lasso Freund-Sure Independence Screening (RLF-SIS) strategy performs superior to the other suggested approaches in terms of greater precision in picking variables. Though both methods showed lower R2 (0.71) which advocates the presence of the outliers in the dataset. Additionally, the box-plot of some selected predictive genes confirms the presence of outliers. Furthermore, two methods, RLF-ISIS and RLF-SIS, have been used to identify 49 and 68 genes that have both direct and indirect effects on patients with RCC. Finally, it can be concluded that although RLF-SIS and RLF-ISIS outperform other proposed approaches, they may, however, be regarded as a variable selection strategy but they might not be the optimal choice for ultra-high dimensional survival data with outliers. Nevertheless, the study can be expanded in the future by applying competitive risk theory to a sequential and parallel structure, which serves as the basis for most complex mechanical systems found in manufacturing facilities. Notably, no feature selection method is available for ultra-high-dimensional survival data with outliers and two-compartments. Therefore, to address this particular issue, further research should focus on developing an advanced hybrid feature selection approach, with a particular emphasis on deep learning strategies.
Read full abstract