Matching abbreviated names with their full names (full-abbr matching) plays a key role in data integration, address matching, information retrieval, and other fields. Traditional full-abbr matching technology often encounters issues related to near homophones and near homoglyphs. First, a near-homophone full-abbr matching model based on Simbert and VGG was first proposed, which integrates character and speech features, leveraging a speech recognition model and combining a brain-like cognitive learning dual-process mechanism which involves linguistic knowledge and neural network together. Second, to address the problem of near-homoglyph full-abbr matching in Chinese, a DenseNet-based model that fuses glyph structure and image features was proposed, in which statistical feature extractors are employed to extract feature vectors for glyphic features including stroke, Wubi and structural features separately. Lastly, the near-homophone model and the near-homoglyph model are coupled to work together in the full-abbr matching task, in which expert knowledge is used as a component of the feature optimizer. Experimental results showed that the integrated model significantly increased the matching accuracy to 87.5%, demonstrating a 12.3% improvement.
Read full abstract