The advent of the era of big data is accompanied by the generation of large-scale data of various types. Extracting the potential value and rules from such data has always been a challenge. Due to various external and internal factors, it is commonplace for large-scale data to exhibit the phenomenon of missing limited labels. In addressing a large-scale mixed information system with limited label missing (LSMDISLML), local neighborhood rough set model (LNRS-model) is typically employed. However, the identical neighborhood radius is often used by such model when confronted with numerical attributes, which could potentially attenuate the classification capability of the data. Local fuzzy rough set model (LFRS-model) can overcome this point. This paper studies local fuzzy rough attribute reduction for large-scale mixed data with limited missing labels based on LFRS-model via local fuzzy self information and overlap degree function. First, leveraging the statistical distribution of data as a foundation, fuzzy relations on the entire sample set are established, which has the advantage of being able to use different fuzzy similarity radii to calculate similarity, thereby adapting to different data distributions. Subsequently, the samples with missing labels are discarded as they constitute a small proportion of the entire sample set and have little impact on overall performance of dataset. The limited computing resources and storage space are focused on the sample set with complete labels (denoted as target set). Thereafter, based on the target set, local fuzzy λ-upper and lower approximations are defined, and LFRS-model is constructed. This model not only reduces processing time and sources of error in large-scale data but also improves data quality and enhances the reliability of the experimental results. Then, local fuzzy λ-self information is introduced and used to design a local fuzzy rough attribute reduction algorithm in a LSMDISLML. Furthermore, a overlap degree function is introduced to evaluate and reorder the attributes based on their importance, prioritizing the elimination of redundant attributes with high overlap and low importance from the preordered attribute set. This strategy effectively improves the efficiency of obtaining the optimal subset. Finally, a series of experiments are carried out. The experiment results demonstrate that the designed algorithm exhibits excellent performance in classification tasks and outlier detection tasks, surpassing existing four algorithms.
Read full abstract