In order to solve the problem of poor generalization ability and high computational complexity of random vector functional link (RVFL) network when dealing with large-scale data classification, we design and implement a distributed RVFL network with subspace-based local connections in Spark framework (DRVFL-SLC). Firstly, in order to take advantage of the partition parallelism of resilient distributed dataset (RDD), the large-scale dataset stored in the Hadoop distributed file system HDFS is randomly divided into random sample partition (RSP) data blocks and each RSP data block corresponds to a partition of the RDD, where the RSP data block is a subset of data that maintains probability distribution consistency with the big data at a given significance level. After that, the mapPartitions transformation is invoked on the RDD containing multiple partitions in a distributed environment and this operation trains the corresponding optimal RVFL-SLC efficiently in parallel. Then, the collect execution operator is used to efficiently fuse the optimal RVFL-SLC corresponding to each partition of the RDD to obtain DRVFL-SLC for realizing the classification of big data. Finally, the feasibility and effectiveness of DRVFL-SLC are verified based on several large-scale data set with at least million records on a Spark cluster deployed with 6 computing nodes. The results show that DRVFL-SLC has a good speedup ratio, scalability and scale growth, and can achieve better generalization performance than RVFL-SLC trained on a single machine with full data.
Read full abstract