Abstract
The present study aims to generate low-dimensional explicit distributional semantic vectors. In explicit semantic vectors, each dimension corresponds to a word, which makes word vectors interpretable. In this study, a new approach is proposed to obtain low-dimensional explicit semantic vectors. Firstly, the suggested approach considers three criteria, namely, word similarity, number of zeros, and word frequency as features for words in a corpus. Next, some rules are extracted to obtain the initial basis words using a decision tree which is drawn based on the three features. Secondly, a binary weighting method is proposed based on the binary particle swarm optimization algorithm which obtains NB = 1000 context words. In addition, a word selection method is used to provide NS = 1000 context words. Thirdly, the golden words of the corpus are extracted based on the binary weighting method. Subsequently, the extracted golden words are added to the context words which are selected by the word selection method as the golden context words. The ukWaC corpus is utilized for constructing the word vectors. MEN, RG-65, and SimLex-999 test sets are used to evaluate the word vectors. Next, the results are compared to a baseline which uses 5K most frequent words in the corpus as the context words. The baseline method uses a fixed window to count the cooccurrences. The word vectors are obtained using the 1000 selected context words along with the golden context words. Compared to the baseline method, the suggested approach can increase Spearman?s correlation coefficient for the MEN, RG-65, and SimLex-999 test sets by 4.66%, 14.73%, and 1.08%, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Turkish Journal of Electrical Engineering and Computer Sciences
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.