How are distributed bugs diagnosed and fixed through system logs?

Wei Yuan,Shan Lu,Xudong Liu,Hailong Sun

doi:10.1016/j.infsof.2019.106234

Abstract

Abstract Context Distributed systems are the backbone of today’s computing ecosystems. Debugging distributed bugs is crucial and challenging. There are still many unknowns about debugging real-world distributed bugs, especially through system logs. Objective This paper aims to provide a comprehensive study of how system logs can help diagnose and fix distributed bugs in practice. Method The study was carried out with three core research questions (RQs): How to identify failures in distributed bugs through logs? How to find and utilize bug-related log entries to figure out the root causes? How are distributed bugs fixed and how are logs and patches related? To answer these questions, we studied 106 real-world distributed bugs randomly sampled from five widely used distributed systems, and manually checked the bug report, the log, the patch, the source code and other related information for each of these bugs. Results Seven findings are observed and the main findings include: (1) For only about half of the distributed bugs, the failures are indicated by FATAL or ERROR log entries. FATAL are not always fatal, and INFO could be fatal. (2) For more than half of the studied bugs, root-cause diagnosis relies on log entries that are not part of the failure symptoms. (3) One third of the studied bugs are fixed by eliminating end symptoms instead of root causes. Finally, a distributed bug dataset with the in-depth analysis has been released to the research community. Conclusion The findings in our study reveal the characteristics of distributed bugs, the differences from debugging single-machine system bugs, and the usages and limitations of existing logs. Our study also provides guidance and opportunities for future research on distributed bug diagnosis, fixing, and log analysis and enhancement.

Full Text