Abstract

In this article, automatically generated and manually crafted semantic representations are compared. The comparison takes place under the assumption that neither of these has a primary status over the other. While linguistic resources can be used to evaluate the results of automated processes, data-driven methods are useful in assessing the quality or improving the coverage of hand-created semantic resources. We apply two unsupervised learning methods, Independent Component Analysis (ICA), and probabilistic topic model at word level using Latent Dirichlet Allocation (LDA) to create semantic representations from a large text corpus. We further compare the obtained results to two semantically labeled dictionaries. In addition, we use the Self-Organizing Map to visualize the obtained representations. We show that both methods find a considerable amount of category information in an unsupervised way. Rather than only finding groups of similar words, they can automatically find a number of features that characterize words. The unsupervised methods are also used in exploration. They provide findings which go beyond the manually predefined label sets. In addition, we demonstrate how the Self-Organizing Map visualization can be used in exploration and further analysis. This article compares unsupervised learning methods and semantically labeled dictionaries. We show that these methods are able to find categorical information. In addition, they can further be used in an exploratory analysis. In general, information theoretically motivated and probabilistic methods provide results that are at a comparable level. Moveover, the automatic methods and human classifications give an access to semantic categorization that complement each other. Data-driven methods can furthermore be cost effective and adapt to a particular domain through appropriate choice of data sets.

Highlights

  • In this article, automatically generated and manually crafted semantic representations are compared

  • Battig Method performance In the first experiment, the Independent Component Analysis (ICA) and Latent Dirichlet Allocation (LDA) models were trained with the vector representations that correspond to the Battig vocabulary

  • Considering that the ICA method attempts to describe any kind of structure in the data, and the Battig vocabulary covers only 0.2% and the BLESS vocabulary only 0.8% of the vocabulary of 200 000 words, these results show that even partial labelings can be very useful when studying such a large dataset

Read more

Summary

Introduction

Automatically generated and manually crafted semantic representations are compared. While linguistic resources can be used to evaluate the results of automated processes, data-driven methods are useful in assessing the quality or improving the coverage of hand-created semantic resources. We explore the relationship between human and data-driven semantic similarity judgments. We aim to see a) whether the representations that are automatically generated in a data-driven manner coincide. Challenge of semantics Semantics is an intriguing and a challenging area of linguistics. Linguists and researchers in nearby disciplines have created a number of theories related to semantics. These theories have been used as frameworks for semantic description or for labeling of lexica and corpora (Cruse 1986).

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.