Integrated multi-strategic Web document pre-processing for sentence and word boundary detection

Junhyeok Shim,Dongseok Kim,Jeongwon Cha,Gary Geunbae Lee,Jungyun Seo

doi:10.1016/s0306-4573(01)00044-9

Abstract

Most work in NLP requires that texts have been previously segmented into sentences and words. Segmenting a text into sentences and words, however, is a complex task, due to the ambiguity of many punctuation marks and spaces. Furthermore, Web texts such as HTML documents are more difficult to make into well refined and segmented texts because they are described in a more free style, with many sentence boundary and spacing errors. The objective of this paper introduces a multi-strategic integrated text preprocessing method for difficult problems of sentence boundary disambiguation and word boundary disambiguation of Web texts. We have applied a hybrid method (the regular expression rule, the heuristic rule, and the inductive learning of statistical decision trees, using a C4.5 learner) synergically to the task of raw corpus preprocessing. This work contributes to a more correct morphological analysis and guarantees a more stable working of application systems. We tackle easily definable problems with automatically acquired constraints and we use inductively learned decision trees to solve ill-defined ambiguity problems by incorporating multiple features ( n-grams, relative frequency, entropy, tri-dictionary index). The multi-strategy approach was thoroughly tested: it achieved approximately 99.12% (with punctuation marks) and 98.04% (without any punctuation marks) accuracy in sentence boundary disambiguation, 95.39% accuracy of word spacing correction, and 94.61% accuracy for whole intermixed text preprocessing problems, from Korean news script Web documents.

Full Text