Abstract
Natural language processing (NLP) provides a framework for large-scale text analysis. One common processing method uses vector space models (VSMs) which embed word attributes, called features, into highly dimensional vectors. Comprehensive VSMs are generated on sources such as the GoogleNews archive. A thesaurus, a collection of semantically-related words, can be created for a particular root word using cosine similarity with a given VSM. Many methods have been developed to reduce the complexity of these models by maintaining useful semantic information while discarding non-informative features. One such method, variance thresholding, retains high-variance features above a manually-determined threshold, providing higher differentiation between words for classification purposes. Our research developed a dimension-reducing methodology called dynamic variance thresholding (DyVaT). DyVaT reduces the specificity of word embeddings by maintaining low-variance features, allowing for a broader thesaurus preserving semantic similarity. A dynamic variance threshold, determining which low-variance features are retained, is selected using the kneedle algorithm, improving the current results. Our test case for examining the efficiency of DyVat in creating a contextual thesaurus is the visual, auditory and kinesthetic learning style context. We conclude that DyVaT is a valid method for generating loosely-connected word collections with potential uses in NLP classification or clustering tasks.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.