Abstract
Real world datasets contain both numerical and categorical attributes. Very often missing values are present in both numerical and categorical attributes. The missing data has to be imputed as the inferences made from complete data are often more accurate and reliable than those made from incomplete data [15]. Also, most of the data mining algorithms cannot work with incomplete datasets. The paper proposes a novel soft computing architecture for categorical data imputation. The proposed imputation technique employs Probabilistic Neural Network (PNN) preceded by mode for imputing the missing categorical data. The effectiveness of the proposed imputation technique is tested on 4 benchmark datasets under the 10 fold-cross validation framework. In all datasets, except Mushroom, which are complete, some values, which are randomly removed, are treated as missing values. The performance of the proposed imputation technique is compared with that of 3 statistical and 3 machine learning methods for data imputation. The comparison of the mode+PNN imputation technique with mode, K-Nearest Neighbor (K-NN), Hot Deck (HD), Naive Bayes, Random Forest (RF) and J48 (Decision Tree) imputation techniques demonstrates that the proposed method is efficient, especially when the percentage of missing values is high, for records having more than one missing value and for records having a large number of categories for each categorical variable.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.