Understanding Node Change Bugs for Distributed Systems

Jie Lu,Xiaobing Feng,Lian Li,Liu Chen

doi:10.1109/saner.2019.8668027

Abstract

Distributed systems are the fundamental infrastructure for modern cloud applications and the reliability of these systems directly impacts service availability. Distributed systems run on clusters of nodes. When the system is running, nodes can join or leave the cluster at anytime, due to unexpected failure or system maintenance. It is essential for distributed systems to tolerate such node changes. However, it is also notoriously difficult and challenging to handle node changes right. There are widely existing node change bugs which can lead to catastrophic failures. We believe that a comprehensive study on node change bugs is necessary to better prevent and diagnose node change bugs. In this paper, we perform an extensive empirical study on node change bugs. We manually went through 6,660 bug issues of 5 representative distributed systems, where 620 issues were identified as node change bugs. We studied 120 bug examples in detail to understand the root causes, the impacts, the trigger conditions and fixing strategies of node change bugs. Our findings shed lights on new detection and diagnosis techniques for node change bugs. In our empirical study, we develop two useful tools, NCTrigger and NPEDetector. NCTrigger helps users to automatically reproduce a node change bug by injecting node change events based on user specification. It largely reduces the manual efforts to reproduce a bug (from 2 days to less than half a day). NPEDetector is a static analysis tool to detect null pointer exception errors. We develop this tool based on our findings that node operations often lead to null pointer exception errors, and these errors share a simple common pattern. Experimental results show that this tool can detect 60 new null pointer errors, including 7 node change bugs. 23 bugs have already been patched and fixed.

Full Text