Cyberthreat risk identification based on constructing entity-event ontologies from publicly available texts

M К Ridley

doi:10.21683/1729-2646-2020-20-3-53-60

Abstract

Aim . Out of the currently used methods of ensuring cyber security the most productive ones are traffic analysis, malware detection, denial of unauthorized access to internal networks, incident analysis and other methods of corporate perimeter protection. The efficiency of such methods however depends on the timeliness and quality of threat data. The Aim of the paper is to study the ways of improving the cyber threat awareness and capabilities to analyze texts in open sources for the purpose of cyberattack prediction, identification and monitoring of new threats, detection of zero-day vulnerabilities before they are made public and leaks are discovered. Methods. Publicly available knowledge on cyber security is acquired through continuous collection of data from the Internet (including fragments of its non-indexed part and specialized sources) and other public data networks (including a large number of specialized resources and sites in the TOR network). The collected texts in various languages are analyzed using methods of natural language processing for the purpose of extracting entities and events that are then grouped into canonical entities and events, and all of that information is used for continuous updating of a subject-matter event-entity ontology. It includes general forms of entities and events required for the context and specialized forms of events and entities for purposes of cyber security (technical identifiers, attack vectors, attack surfaces, hashes, identifiers, etc.) Such ontology can function as a knowledge base and be used for structured queries by cyber security analysts. Results. The proposed method and the system based upon it can be used for analyzing computer security information, monitoring, detection of zero-day vulnerabilities before they are made public and leaks are discovered. The information retrieved by the system can be used as highly informative features in statistical models. The latter served as the basis for a classifier that defines the risk of exploits for a specific vulnerability, as well as an IP address scoring system that can be used for automatic blocking. Additionally, a method was developed for risk-based ranking of events and entities associated with cyber threats that allows identifying – within the abundance of available information – the entities and events that require special attention, as well as taking timely and appropriate preventive measures. Conclusion. The proposed method is of direct practical value as regards the problems of analytics, risk-based ranking and monitoring of cyber threats, and can be used for the analysis of large volumes of text-based information and creation of informative features for improving the quality of machine learning models used in computer security.

Highlights

За последнее десятилетие компьютерная преступность совершила скачок в развитии и стала большим конкурент‐ ным рынком
блокировании злоумышленников от доступа во внутреннюю сеть, анализе инцидентов и других способах защиты корпоративного периметра
Вклад автора в статьюАвтором Ридли М.К. выполнен анализ предметной области, предложен метод извлечения информации из открытых источников на базе событийно-сущностных онтологий, доработана ранее разработанная автором информационно-аналитическая система для извлечения и хранения знаний в форме событийно-сущностных онтологий под предметную область кибербезопасности, решены пять прикладных задач (мониторинг и анали‐ тика в области кибербезопасности, раннее обнаружение уязвимостей нулевого дня, определение риска создания эксплойта на базе уязвимости, балльный скоринг IPадресов, рисковое ранжирование событий и сущностей в области кибербезопасности)

Summary

Рекомендуемые разработчиками способы вывода

XML, графический извлечения используют правила на регулярных вы‐ ражениях или метод условных случайных полей (CRF), чья особенность заключается в отсутствии необходимости в моделировании вероятностных за‐ висимостей между наблюдаемыми переменными и проблемы смещения метки как у марковской модели максимальной энтропии. Извлеченные сущности и факты сопоставляются и разрешаются согласно онтологии с целью уточнения их значения и разрешения кореферентности. 6. Верхнеуровневая функциональная схема системы структурированных отношений между сущностями ис‐ пользуются для управления процессом фильтрации и в качестве газеттира для улучшения процесса извлечения. Культурные и региональные категории извлекаются из документа для учета полушария, первого дня недели и формата дат. События могут быть как случившимися, так и ожидаемыми. Схематично архитектура подсистемы лингвистической обработки изображена на рис. 5. Она представляет собой независимую систему, Рис. 7. Техническая архитектура системы с точки зрения потока данных интегрированную с основной системой, рассматривае‐ мой в следующем разделе работы

Архитектура и реализация системы сбора и анализа данных на базе онтологий

Результаты применения метода в прикладных задачах

Библиографический список

Вклад автора в статью