Operational softwarized networks reliability management

Sejun Song,Henry Zhu

doi:10.1109/issrew.2015.7392036

Abstract

Operational network health management against the security, reliability, and performance problems is a critical part of building highly dependable network services. In traditional networks, the management practice takes mainly remote approaches through provisioning and configuration to cope with the network-centric infrastructure where the main network functionalities such as data, control, and management planes are distributed and embedded within the vendor specific networking devices. However, since the problems that occur within the network should be inferred by the remote management systems (NMS) on the network edge, they are often accumulated and enlarged, and the diagnosis is delayed, inaccurate, unreliable, and not scalable. A few embedded approaches such as Cisco's Embedded Event Manager (EEM) [1] and Cisco's Component Outage On-Line (COOL) Measurement [2] become available. However, they are costly and limited to the vendor specific devices. Recently proposed Software-Defined Networking (SDN) architecture [3] decomposes networks using the network distribution, forwarding, and configuration abstractions to enable flexible and centralized network control. Particularly, these networking paradigm changes towards virtualization and the softwareisation of network functions, controls, applications are expected to improve cost efficiency, control accuracy, and deployment flexibility. However, as layers of virtualization, policy application, and service chaining rely on network visibility, software reliability becomes more critical issues and hardware reliability impacts more complex ways (as illustrated in Figure 1). There are several operational health management approaches to maintain high SDN reliability. The strategies include Report every events to the health manager (OM), an OM will take care of the events (Event Driven)., Reply about the event only if an OM asks (Polling)., and Report the registered events only (Callback). However, in SDN, it is unclear which approach works better in which condition. Also there are ample chance of control message redundancies due to those approaches are not synchronized among the applications. We develop a comprehensive solution for creating, handling, and managing the network control messages and a facility to identify and eliminate redundant information as well as to synchronize control and management messages among different applications. Our approach is to use information units to identify the control message redundancy. Our approach collects the redundant information and delivers it either to a specific application via callback functions or to a shared DB so that other applications can infer it later. The approach is effective for the control and management applications as the large part of those control messages are periodic. The correlation schemes include eliminating control message redundancies among the different protocols as well as suppressing temporal redundancies from the same protocol. We also design a proactive algorithm that can actively control the related network objects before the status changes. However, as the tradeoff is a continuum, it is critical to find the right level of the algorithm application. We also provide classification approaches for the different protocols and interfaces to quantify the type and level of each control message. An intelligent multicast facility to synchronize the control and management messages among different controllers and user applications is designed. Although each object registered for a different service in mind, we can induce many common factors by analyzing both offline and online information. For example, we can find a failure that impacts services for a couple of registered applications. It can be registered to inform the event via a multicast message. We prototype those functions into Cisco's OpenDayLight High Availability component in support of a few practical case scenarios.

Full Text