Abstract

This work proposed a unified view of several unsupervised feature selection based on frequent strings that improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS), term-contributed frequency (TCF), and term-contributed boundary (TCB), with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005 and SIGHAN CWS 2010. The experiment results show that all of those features improve the performance of the baseline system in terms of recall, precision, and their harmonic average as F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> measure score, on both accuracy (F) and out-of-vocabulary recognition (F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">OOV</sub> ). In particular, this work presents a novel feature selection approach of the compound feature “AVS+TCB” that outperforms other types of features for CRF-based CSW in terms of F and F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">OOV</sub> .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.