Chinese, Japanese, and Korean (CJK) Hanzi image recognition still faces significant challenges, particularly with pixel resolution, stroke count, character frequency and structure. This paper introduces the concepts of character and component thresholds to establish foundational parameters for Hanzi recognition, considering Unevenly Distributed Composite Character (UDCC), which account for up to 88% of Chinese characters. We compiled corpora of Simplified, Traditional characters, and Japanese Kanji, analyzing 3395 images with stroke counts ranging from 1 to 64, and developed the ZH-TC-IM965858 database, containing 33,950 images across 10 pixel resolutions. Using Euclidean distance and ResNet50 similarity analysis, we identified a character threshold of 26 strokes for 24[Formula: see text]×[Formula: see text]24 pixel images, a component threshold of 14 strokes, and a comprehensive threshold of 16 strokes. We found that 7.61% of 8105 common Chinese characters, 27.41% of 96,585 Traditional characters, and 5.99% of 2163 Japanese Kanji exceed this comprehensive threshold. Character Length and Character Frequency (CLCF) models were employed to explore relationships between stroke count, frequency, and thresholds. Additionally, Scale-invariant Feature Transform (SIFT) was applied to match radicals and components, providing insights for improving recognition accuracy. This research advances Hanzi image recognition and enhances multimodal Large Language Models (LLMs) for ideographic languages.
Read full abstract