Community detection is one of the most representative graph mining applications, which is often assembled as a concurrent graph partition application to explore the maximum modularity (or gained modularity) of each community. However, many branch divergence operations create significant obstacles to unleashing GPU’s high throughput and memory bandwidth, which are needed in community detection applications to divide the vertices into different communities. In this paper, we present Lugger, a GPU-based overlapping community detection algorithm that reduces GPU’s branch divergence via the customer-designed cache-aware parallel searching technique. In Lugger, we first design a cache-aware parallel searching policy using the B-Tree structure. Then, we set the B-Tree node matches with the GPU cache line to meet the coalesced memory access manner and avoid the branch divergence in warps. Moreover, we design a positive node splitting scheme to reduce the lock operation and idle threads when building the B-Tree structure. In addition, we implement a warp-centric thread assignment strategy to make sure the workloads across threads are balanced. We implement the proposed algorithm on NVIDIA GPU and evaluate the performance on eight large graphs (up to <inline-formula><tex-math notation="LaTeX">$3M$</tex-math></inline-formula> vertices and <inline-formula><tex-math notation="LaTeX">$117M$</tex-math></inline-formula> edges) with ground-truth communities. The experimental results show that Lugger can outperform the state-of-the-art works on scalability and detection quality.