Debugging Nondeterministic Failures in Linux Programs through Replay Analysis

Shakaiba Majeed,Minsoo Ryu

doi:10.1155/2018/8939027

Abstract

Reproducing a failure is the first and most important step in debugging because it enables us to understand the failure and track down its source. However, many programs are susceptible to nondeterministic failures that are hard to reproduce, which makes debugging extremely difficult. We first address the reproducibility problem by proposing an OS-level replay system for a uniprocessor environment that can capture and replay nondeterministic events needed to reproduce a failure in Linux interactive and event-based programs. We then present an analysis method, called replay analysis, based on the proposed record and replay system to diagnose concurrency bugs in such programs. The replay analysis method uses a combination of static analysis, dynamic tracing during replay, and delta debugging to identify failure-inducing memory access patterns that lead to concurrency failure. The experimental results show that the presented record and replay system has low-recording overhead and hence can be safely used in production systems to catch rarely occurring bugs. We also present few concurrency bug case studies from real-world applications to prove the effectiveness of the proposed bug diagnosis framework.

Highlights

Debugging is the hardest part of software development
We evaluated the performance of our record and replay system regarding recording overhead for various real applications
Since we aim to trace memory access for a limited subset of global variables obtained during the static analysis phase, we want to avoid the inherent expense of instrumenting memory access to every shared memory location as it is redundant to the bug diagnosis process

Summary

Introduction

Debugging is the hardest part of software development. The process of debugging begins by reproducing a failure, locating its root cause, and fixing it. The ability to reproduce a failure is indispensable, as, in most cases, it is the only way to provide clues to developers in tracking down the sources of failure. In the case of some nondeterministic failures such as concurrency bugs, it is not always possible to reproduce the failure provided a given set of inputs and environmental configurations. Without the ability to reproduce, debugging becomes an inefficient and time-consuming process of trial and error. Some software practitioners report that it takes them weeks to diagnose such hard-to-reproduce failures [1]

Objectives

Results

Discussion

Conclusion