An application of the nearest correlation matrix on web document classification

,Houduo Qi,Zhonghang Xia,Guangming Xing

doi:10.3934/jimo.2007.3.701

Abstract

The Web document is organized by a set of textual data according toa predefined logical structure. It has been shown that collectingWeb documents with similar structures can improve query efficiency. The XML document has no vectorial representation, which is requiredin most existing classification algorithms. The kernel method has been applied to represent structuraldata with pairwise similarity. In this case, a set of Web data can befed into classification algorithms in the format of a kernel matrix.However, since the distance between a pair of Web documents isusually obtained approximately, the derived distance matrix is not akernel matrix.In this paper, we propose to use the nearest correlation matrix (ofthe estimated distance matrix) as the kernel matrix, which can befast computed by a Newton-type method. Experimental studies showthat the classification accuracy can be significantly improved.

Full Text