Debugging in Distributed Systems

Thomas J. Leblanc

doi:10.1002/0471028959.sof085

Abstract

Abstract Debugging sequential programs on a uniprocessor is a fairly well‐understood task. A good interactive debugger supports breakpoints and single‐step execution for a line‐by‐line analysis of the effect of procedures and instructions on program state, all in the context of the original source code. At any moment in time, the user can halt execution and examine any aspect of the program's state, tracing the relationship between source code and error symptoms at whatever level of detail desired. Distributed systems, on the other hand, introduce enormous complications for debugging. The physical separation of processors and the communication delays between processors make halting all processors instantaneously impossible. There are properties that adds a dimension to the problem of debugging that is not present with sequential programs. Perhaps the greatest difficulty stems from the uncertainty present in any system with unpredictable communication delays. Any debugging technique that requires use of the processors or communication medium can introduce these delays, and obscure the true execution being observed. Distributed systems require debugging techniques that explicitly address the uncertainty inherent in distributed systems. Not all queries are possible, especially queries that require the debugger to know something about the state of two different processors at the same precise moment in time. In fact, distributed systems require a new notion of time, where time is relative to the processor making the observation. In addition, since a distributed program is made up of many sequential programs, distributed systems need an expanded notion of program state and the events that cause changes in that state. Distributed systems also need new techniques for monitoring program executions and new debugging tools that allow for examination of the execution state and interact with a program in execution. In a distributed system, processes communicate by exchanging messages. Therefore, message operations are the primary events of interest during debugging. There are several techniques that can be used to monitor execution and recognize the occurrence of message events. During the debugging cycle, the debugger and user interact to find the source of errors. The debugger captures execution information and presents it to the user; the user analyzes the information, makes hypotheses, initiates experiments, and requests more information. Thus, the way in which the debugger presents information to the user is crucial. As with any software system, there are a variety of ways to present information to the user. Despite the many advances made since 1985 or so, distributed program debugging remains a difficult problem. Problems (among others) will be the focus of research efforts, while users continue to gain valuable experience with the techniques already developed.

Full Text