Abstract

This chapter describes a novel multistage method for linguistic clustering of large collections of texts available on the Internet as a precursor to linguistic analysis of these texts. This method addresses the practicalities of applying clustering operations to a very large set of text documents by using a combination of unsupervised clustering and supervised classification. The method relies on creating a multitude of independent clusterings of a randomized sample selected from the International Corpus of Learner English. Several consensus functions and sophisticated algorithms are applied in two substages to combine these independent clusterings into one final consensus clustering, which is then used to train fast classifiers in order to enable them to perform the profiling of very large collections of text and web data. This approach makes it possible to apply advanced highly accurate and sophisticated clustering techniques by combining them with fast supervised classification algorithms. For the effectiveness of this multistage method it is crucial to determine how well the supervised classification algorithms are going to perform at the final stage, when they are used to process large data sets available on the Internet. This performance may also serve as an indication of the quality of the combined consensus clustering obtained in the preceding stages. The authors’ experimental results compare the performance of several classification algorithms incorporated in this multistage scheme and demonstrate that several of these classification algorithms achieve very high precision and recall and can be used in practical implementations of their method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.