This paper introduces an unsupervised learning framework of Chinese syntactic structure based sentences similarity. First, all sentence pairs in the Chinese sentence corpus are aligned, and each pair is partitioned into similarity segmentations and different ones which alternately occur, Then, aligned similarity segmentations or different ones are selected as potential constituent candidates based on the strategy of similarity priority or of difference priority respectively. As the boundary friction may be introduced in the later step, its disambiguation is further carried out. Finally, by inducing sentence constituents, the syntactic structures are learned. In order to reduce word sparseness in the process, some words are replaced by classes in advance. Three forms of the sentence units, such as the sequence of words, the sequence of POS (part of speech)-tags and the sequence of words with POS-tag, are examined and the learned syntactic structures are evaluated respectively. The results show that different priority strategy achieves a better performance than the similarity one, and the Fs are above 46% for all three forms, with the best one being 49.52%, which is better than those having been reported.
Read full abstract