Abstract
Random sample partition (RSP) is a newly developed data management and processing model for Big Data processing and analysis. To apply the RSP model for Big Data computation tasks, it is very important to measure the distribution consistency of different datasets. Existing measurement methods for continuous-attribute and discrete-attribute datasets cannot directly deal with mixed-attribute datasets. In this article, we design a hybrid method to measure the distribution consistency among different mixed-attribute datasets by using a multilayer extreme learning machine (MLELM) and the generalized maximum mean discrepancy (GMMD) criterion, abbreviated as MLELM-GMMD. MLELM is first used to transform original mixed-attribute datasets into corresponding deep encoding datasets. Then, the GMMD criterion is applied to check the distribution consistency of the deep encoding datasets. Four experiments have been done to validate the feasibility and effectiveness of MLELM-GMMD, i.e., the impact of MLELM on the amount of information during mixed-attribute data transformation, the impact of MLELM on distributions of mixed-attribute data, the distribution consistencies of RSP and non-RSP data blocks, and the comparison with other measurement methods. Experimental results show that the proposed MLELM-GMMD method can measure the distribution consistency of mixed-attribute datasets more accurately than one-hot encoding-based methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.