Abstract

A word segmentation method based on inductive learning for non-segmented language uses only surface information of a character string; it has an advantage that is entirely not dependent on any specific language. The method extracts recursively a character string that occur frequently in text as word candidates, extracts segmentation rule with context information to deal with segmentation ambiguity. The method classifies those extracted word candidates to different ranking according to extraction situation, segments a text into words with extracted word candidates. Though proofread process erroneous segmentation was corrected, ranking of word candidates and segmentation rules was renewed. Evaluation experiments showed availability of the method for Japanese and Chinese word segmentation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call