Abstract

Random sample partition (RSP) is a newly developed data management and processing model for Big Data processing and analysis. To apply the RSP model for Big Data computation tasks, it is very important to measure the distribution consistency of different datasets. Existing measurement methods for continuous-attribute and discrete-attribute datasets cannot directly deal with mixed-attribute datasets. In this article, we design a hybrid method to measure the distribution consistency among different mixed-attribute datasets by using a multilayer extreme learning machine (MLELM) and the generalized maximum mean discrepancy (GMMD) criterion, abbreviated as MLELM-GMMD. MLELM is first used to transform original mixed-attribute datasets into corresponding deep encoding datasets. Then, the GMMD criterion is applied to check the distribution consistency of the deep encoding datasets. Four experiments have been done to validate the feasibility and effectiveness of MLELM-GMMD, i.e., the impact of MLELM on the amount of information during mixed-attribute data transformation, the impact of MLELM on distributions of mixed-attribute data, the distribution consistencies of RSP and non-RSP data blocks, and the comparison with other measurement methods. Experimental results show that the proposed MLELM-GMMD method can measure the distribution consistency of mixed-attribute datasets more accurately than one-hot encoding-based methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call