This paper presents a new monitoring tool and event management method for data centre compute, network and storage infrastructure based on node event processing. The uptime of highly classified data centres are not only to be maintained at the highest level of reliability and availability of the operation, but also fast, specific event identification and rectification, which altogether improves availability of the resources is important. The new method, using a tree node for each element of the data centre resources provides information about the compute, network and storage file system configuration in a specific node. Its major advantage is that in our case where a large number of heterogeneous computers are present, it helps us in monitoring all the elements of the computer resources and gives information for alerting the associated work centres before any of the identified events that might occur. By monitoring and informing apriori to the concerned work centres the state of the systems, it lowers errors in data centre physical infrastructure operating costs, improving at the same time the level of operations efficiency. This method resulted that the use of tree nodes significantly reduces the number of unexpected events, the time needed for the main event identification, and the maintenance response time to events. By using event entities processing, multilayer nodes have a significant impact on the efficient operation of data centre physical infrastructure. In this paper, the design and development of two customised dashboards to monitor the compute, storage and network elements of the heterogeneous data centre for uptime maintenance and optimal performance is presented. The dashboards are designed, keeping in view the nature of tasks carried out and the resource requirements of various work centres in the data centre. One dashboard displays dynamically created icons for each of the compute resources in the data centre. On clicking any of the icon, complete details of the corresponding server is fetched showing the status, usage, configuration and available resources. Furthermore, a unique colouring scheme is followed wherein the icon is displayed green if the server is healthy and orange if the server is facing a resource crunch (disk, memory, etc.) and red if the server is not reachable. The dashboard GUI refreshes every 5 min (is configurable), displaying the latest status details of the servers in the data centre. The second Dashboard is developed with the capability to monitor the storage, cloud and network infrastructure components. The dashboard collects data from different elements of the storage i.e. Meta Data Servers, Storage, Core and Edge switches etc. and processes the collected data to a customized format for display. It delivers details like availability of Storage Meta Data Servers, switches and file systems, disk space capacity monitoring, file system backup status, Monitoring of the hierarchical Storage including Tape Library and the availability of Production ESXi hosts cluster. The GUI is updated with new requirements to further fine-tune and reduce manual intervention for monitoring operations.
Read full abstract