Abstract Protocol reverse engineering is crucial in normative verification, and malware behavior analysis and vulnerability discovery. However, uncovering the structural features of binary protocols concealed within dense data representations remains a significant challenge. Accurately identifying keyword segments associated with message types is a prerequisite for meaningful semantic analysis and protocol state machine reduction. In this work, we introduce a novel approach for inferring keywords from binary protocols based on probabilistic statistics. Our method in terms of Byte employs heuristic rules to filter offset positions that are clearly unrelated to message types. We further filter candidate Byte-offsets utilizing constraint relations and provide the probabilistic ranking of each offset as the keyword segment. To enhance the reliability of keyword segment inference, we utilize the Monte Carlo algorithm to assess the difference between message clustering with candidate Byte-offset and random message clustering, and reorder candidate offsets according to the results. Then we can observe optimal values from both orderings and present the ultimate inference results. Experimental results demonstrate that our method excels in the accuracy of keyword segments identification compared with previous techniques.
Read full abstract