In order to solve the problem of unbalanced load of data les in large-scale data all-to-all comparison under distributed system environment, the differences of les themselves arefully considered. This paper aims to fully utilize the advantages of distributed system to enhance the le allocation of all-to-all comparison between the data les in a large dataset. For this purpose, the author formally described the all-to-all comparison problem, and con-structed a data allocation model via mixed integer linear programming (MILP). Meanwhile, a data allocation algorithm was developed on the Matlab using the intlinprog function of branch-and-bound method. Finally, our model and algorithm were veried through several experiments. The results show that the proposed le allocation strategy can achieve the basic load balance of each node in the distributed system without exceeding the storage capacity of any node, and completely localize the data le. The research ndings can be applied to such elds as bioinformatics, biometrics and data mining.
Read full abstract