Abstract

Web text mining is a growing research area in data mining. Interestingly, the existing Web text mining algorithms have concentrated on finding frequent patterns while discarding the less frequent ones that may contain outliers. In addition, the domain knowledge in one industry is partly different from that in the others. Whatever they belong to, web texts are analyzed using the same dictionary. This paper proposes formal definitions of Web text outliers and Web text outlier mining, and presents a framework of Web text outlier mining based on domain knowledge. To verify the feasibility of the framework, an algorithm for mining Chinese Web text outliers is proposed based on improved VSM and n-grams. Experimental results with insurance topic show that the mining algorithm is effectively capable of finding Chinese Web text outliers from web text data, and has higher precision and recall and lower complexity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.