This work is based on the PHM North America 2024 Conference Data Challenge’s datasets of Helicopter turbine engine performance measurements. These datasets were large and moderately imbalanced. For dealing with these challenges, we demonstrate significant tools covering feature engineering, augmentation and selection, model exploration, visualizations, model explainability and confidence margin estimation. This work was performed in its entirety using MATLAB. All these tools will be generally applicable to data-driven modeling and prediction of health to real life applications. Initially, we explored the 742k observations in the training set, noting a 60-40 split between healthy and faulty labels, and identified two major operational clusters within the data. We enhanced the dataset by removing duplicates and engineered new features based on domain knowledge, expanding the feature set to 242 dimensions. However for the torque margin estimation, we trained a regression model on a limited subset (18 features), which includes engineered features using domain knowledge, quadratic terms and linear interaction between all the terms. For the final submission, we utilized a stepwise linear regression model to optimize feature selection. This approach achieved a perfect regression score on test data, validated by a consistent torque margin residual range of +/- 0.5%. The model's RMSE and MAE metrics were optimal for employing a normal distribution probability density function. For , we reduced the feature set to 58 using dimensionality reduction techniques and balanced the data with upsampling and down-weighing the minority class. We employed ASHA (Asynchronous Successive Halving Algorithm) in conjunction with AutoML to efficiently determine the most suitable model family, significantly saving compute time. Subsequently, we trained ensemble models, including bagged tree and AdaBoost (Adaptive Boosting), which minimized false negatives and positives, achieving robust classification performance. This was particularly critical given the high penalty for false negatives in the data challenge. The MathWorks team score on Testing Data was 0.9686 at the close of competition. This was further improved to 0.9867. Our approach demonstrates the effectiveness of combining strategic data processing, feature engineering, and model selection to enhance predictive accuracy in complex operational datasets.
Read full abstract