Multiple features fusion method for identifying text topic boundaries

Yong-Dong Xu Yong-Dong Xu,Guang-Ri Quan Guang-Ri Quan,Ya-Dong Wang Ya-Dong Wang,Zhi-Ming Xu Zhi-Ming Xu

doi:10.1109/icmlc.2008.4620913

Abstract

In general, a document should be regarded as form of some coherent units which are called discourse segments. Discovering the segment boundaries is an important task for many natural language processing applications. In this paper, we proposed a new Chinese text topic boundaries identification method based on multiple features fusion. Our approach firstly extracts multiple features of topics shift from text. For each feature, we adopt corresponding F-dotplotting model to respectively calculate the boundary values of neighboring sentences. Subsequently, the useful features among above cues are automatically select and combined to determine topic boundaries automatically by a statistical method based on logistic regression analysis. The experimental result shows that the F-dotplotting method is more effective than common dotplotting method and the multiple features fusion method based on the logistic regression model can effectively improve Chinese text topic segmentation performance.

Full Text