Abstract

ABSTRACTGlobalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management. However, lack of training data and high efforts for corpus annotation will increase the cost for classifying multilingual documents which needs to bridge language gaps as well. Clustering is more suitable to implement in such practical applications. There are two main factors involved in documents clustering, document representation method and clustering algorithm. In this paper, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of clustering results. In our experiment, we use parallel corpora (English‐Chinese documents on topic of technology information) and comparable corpora (English and Chinese documents on topics of mobile technology and wind energy) as dataset. We compare four different types of document representation methods: Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation and Doc2Vec. Experimental results show that, accuracy of Vector Space Model were not competitive with other methods in all clustering tasks. Latent Semantic Indexing is overly sensitive to corpora itself, for it behaved differently when clustering two different topics of comparable corpora. Latent Dirichlet Allocation behaves best when clustering documents in small size of comparable corpora while Doc2Vec behaves best for large documents set of parallel corpora. Accordingly, characteristics of corpora should be under considerations for rational utilization of document representation methods to have better performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.