Автоматизация обнаружения и анализа ошибок в гиперконвергентных системах

D.V Silakov

doi:10.15514/ispras-2019-31(4)-2

Abstract

The paper is devoted to the problem of early error detection and analysis in hyperconverged systems. One approach to organizing hyperconverged systems is to install on each physical server a separate instance of an operating system (OS) that carries virtualization tools and tools for administering and using a distributed data warehouse. Errors can occur both at the level of a single OS instance and at the level of the entire cluster. For example, incorrect control element commands from one infrastructure node can cause software failure on another node. In addition, errors from the subsystems of the cluster can provoke abnormal situations inside virtual machines. The complexity of the architecture of hyperconverged systems makes it difficult to analyze the errors that occur in them. To simplify such an analysis and increase its effectiveness, it is necessary to automate the process of detecting problems and collecting data necessary for their study and correction. Existing approaches for automation of error detection are described and various improvements are suggested to adopt them for systems where distributed storage and virtualization technologies are actively used. Improvements include log collection from the whole cluster just after the error occurred, additional analysis of guest operating system behaviour inside virtual machines, usage of a knowledge base for automated crash recovery and duplicate detection. Finally, a real-life scenario of error handling process in Virtuozzo company products is described starting from error detection and ending with fix deployment.

Highlights

The paper is devoted to the problem of early error detection and analysis
One approach to organizing hyperconverged systems is to install on each physical server a separate instance of an operating system
necessary to automate the process of detecting problems

Summary

Оперативное выявление проблем

Применение первого способа позволяет произвести анализ системы и сбор потенциально полезной для анализа проблемы информации «по горячим следам». ISP RAS, vol 31, issue 4, 2019. pp. 29-38 мониторинга журналов в реальном времени) – и тогда может быть использован альтернативный подход

Выявление ошибок на основе анализа журналов

Обход кластера для сбора информации

Выявление одинаковых ошибок

Обратная связь

Жизненный цикл ошибки в системах Virtuozzo

Заключение