Abstract

Aiming at the problem of incomplete data such as outliers and missing values in machine learning, an incomplete data processing method based on the measurement of missing rate and abnormal degree was proposed in this paper. In this method, the outlier problem that are easy to be ignored was fully considered, and the Bisection method in the data structure field was applied to reduce the interval length and find the data distribution law. And then, an incomplete data processing model for static-structure or fixed-structure data set in a certain field or research direction was constructed. In the model establishment stage, the rules of the outlier processing part were first explored, including the boundary condition for outlier processing, and the relative applicable condition of using the direct-discarding method for outliers. Then, the rules of the missing value processing part were later explored, including the boundary condition for missing value processing, the relative applicable condition of using the direct-discarding method for missing values, the division of different missing rate intervals and the corresponding applicable numerical filling methods. In this research, the loose particle localization data set was taken as the test object, and the applicable incomplete data processing model was established. Specifically, the model can process outliers with the boundary condition of 23%, and the relative applicable condition of using the direct-discarding method for outliers is 2%. While the boundary condition that can process missing values is 67%, and the relative applicable condition of using the direct-discarding method for missing values is 3%. In the range of 3% to 67%, the range of missing rate that processing by statistical-filling methods is 3% to 7%, and the range of missing rate that processing by kNN prediction model is 7% to 67%. The applicable incomplete data processing model was verified on multiple loose particle localization data sets. Test results show that the prediction accuracy achieved by the classification learner based on the parameter-optimized Random Forest has significantly improved on the data set before and after processing, and the average improvement is 5.04%, which effectively proves the feasibility and practicability of the method proposed in this paper. Theoretically, it can be extended and applied to the processing of incomplete data sets in similar fields or with same structures, and has important reference significance for the research on incomplete data processing in machine learning. In addition, it has important practical value for improving the data completeness and data regularity of feature data sets in specific application fields.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.