Graph Convolutional Network (GCN) is an important method for learning graph representations of nodes. For large-scale graphs, the GCN could meet with the neighborhood expansion phenomenon, which makes the model complexity high and the training time long. An efficient solution is to adopt graph sampling techniques, such as node sampling and random walk sampling. However, the existing sampling methods still suffer from aggregating too many neighbor nodes and ignoring node feature information. Therefore, in this paper, we propose a new subgraph sampling method, namely, Similarity-Aware Random Walk (SARW), for GCN with large-scale graphs. A novel similarity index between two adjacent nodes is proposed, describing the relationship of nodes with their neighbors. Then, we design a sampling probability expression between adjacent nodes using node feature information, degree information, neighbor set information, etc. Moreover, we prove the unbiasedness of the SARW-based GCN model for node representations. The simplified version of SARW (SSARW) has a much smaller variance, which indicates the effectiveness of our subgraph sampling method in large-scale graphs for GCN learning. Experiments on six datasets show our method achieves superior performance over the state-of-the-art graph sampling approaches for the large-scale graph node classification task.
Read full abstract