Abstract

Neural word embedding, such as word2vec, produces very large features' vectors. In this paper, we are investigating the length of the feature vector aiming to optimize the word representation results, and also to speed up the algorithm by addressing noise impact. Principal Component Analysis (PCA) has a proven record in dimensionality reduction as we selected it to achieve our objectives. We also selected class based Language Modeling as extrinsic evaluation of the features vectors and are using Perplexity (pp) as our metric. K-means clustering is used as words classification. The execution time of the classification is also computed. As a result, we concluded that for a given test data, if the training data is of same domain then large vector size can increase the precision of describing word relations. In contrast, if the training data is from different domain and contains large amount of contexts not expected to occur in the test data then a small vector size will give a better description to help reducing the noise effect on clustering decisions. Two different data training domains were used in this analysis; Modern Standard Arabic (MSA) broadcast news and reports, and Iraqi phone conversations with testing data of the same Iraqi data domain. Depending on this analysis, same domain training data and test data have execution times reduced by 61% while keeping same representation efficiency. In addition, for different domain training data i.e. MSA, pp reduction ratio of 6.7% is achieved with time reduced by 92%. This implies the importance of carefully choosing feature vector size on the overall performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.